Apparently, it's very easy to stall GCN's geometry engines by context switching. Interestingly, Nvidia's geometry engines are designed to context switch rapidly. So, the intelligent workgroup distributor in Vega (and GCN, generally) will group work as efficiently as possible to reduce context switching. It'll also hold work until there are enough instructions to fill a wavefront (64 threads). Nvidia operates with 32 warps and each section of 32 cores inside an SM has its own warp scheduler. Max threads is 2048 for Pascal over 64 warps per SM (likely virtualized).
It seems easier for Nvidia to fill and task its hardware effectively. GCN has typically struggled with underutilization of CUs, maybe from the workgroup distributor holding work when a 64-thread wavefront isn't reached or geometry engines stalling a bit. I'm really not sure. There is definitely pressure on vGPRs in gaming.
GCN is primarily in-order execution with async compute tasks prioritized to front of processing queue, while Nvidia's architectures based on the GPC/SM design are out-of-order execution, which is why every 32 cores has its own scheduler. Their PolyMorph engines can also work out-of-order, so if geometry data isn't ready for something in the pipeline, they can context switch and render something that is ready, then go back and finish the previous work.
MSAA and SSAA hit Vega pretty hard, which suggests that AMD didn't do any work to improve these ROP intensive anti-aliasing techniques. The market is moving away from them anyway and more towards shader based techniques.
Vega is maxed out on ROPs. 1 raster engine can have a max of 16 ROPs. There are 4 raster engines in Vega64, as it is still a 4 shader engine design.
AMD uses 4 physical ROPs capable of processing 4 colors each. In die shots, you can see this design choice in each shader engine. This gives a total of 16 ROPs per shader engine. This design is well suited to GCN's 4-wide SIMD structure.
To add more ROPs, you'd need more shader engines, which means you'd need to redesign and rebalance the entire architecture. There's also no guarantee of extra performance with added ROPs; there's a point of diminishing returns, which is why Nvidia is very aggressive with compression in their ROPs. The entire architecture is intrinsically linked, so if you have say 96 ROPs, but they're all underutilized, you'd draw even more power (from extra hardware units) for little to no gain. Everything must be thought out and designed carefully.
Doubling L2 cache and vGPR (vector general purpose registers) sizes would probably help more, but that's expensive in terms of die area, so architects and engineers work towards efficiency of rendering pipelines and extensive reuse of data.
- Though the recent patent filing for AMD's Super SIMD also drastically increases vGPR efficiency and reduces register pressure overall (a weak point for GCN).
https://www.reddit.com/r/Amd/comments/9lltgj/64_rops_is_a_bottleneck_for_vega/swear i saw some kind of image on beyond3d forums where a proposed design showed a 96 ROP vega, with 6x ACE and a reduced number of SPs per grouping under each of the ACE in around 768SPs (total being 4608 SPs on the gpu).
The proposal if i recall the design (which had some images) would eliminate the apparent lack of even the vega gpu failing to fully load all the SPs based on the data collected from the vega 56 as well as polaris in terms of how well they performed, and what appears to be a bottleneck with the ACEes themselves, and then also fixing up the backend ROP count. They figured that with a 7nm shrink and with these changes, that overall die size would still be about 20% smaller than the current, and potentially a straight up 50% faster than the current vega at LEAST in the areas in which the rops and ACEes were lacking. But this presumed that the ACEes were the root cause of the SP not loading fully at all times and that the lack of rops were causing the performance degradation and fall in performance in tasks specfiically suited for it. They also suggested that it would probably be extremely helpful for ray tracing... but i can't for the life of me find the post, perhaps it wasn't on beyond3d, but i thought it was.