Interesting article on Reddit, summarising a YouTube video on why we haven't seen the full performance of Vega yet:
https://www.reddit.com/r/Amd/comments/6rm3vy/vega_is_better_than_you_think/
https://www.youtube.com/watch?v=-G1nOztqWm0
*snip*
Primarily that it supports 2X Thread Throughput. This might seem minor, but I'm not sure people quite grasped (NVIDIA did, because they got the GTX 1080 Ti and Titan Xp out to market ASAP following the official announcement of said features) is this actually is perhaps THE most remarkable aspect of the Architecture. So... what does this mean?
In essence the ACE on GCN 1.0 to 4.0 has 4 Pipelines, each is 128-Bit Wide. This means it processes 64-Bit on the Rising Edge, and 64-Bit on the Falling Edge of a Clock Cycle. Now each CU (64 Stream Processors) is actually 16 SIMD (Single Instruction, Multiple Data / Arithmetic Logic Units) each SIMD Supports a Single 128-Bit Vector (4x 32-Bit Components, i.e. [X, Y, Z, W]) and because you can process each individual Component ... this is why it's denoted as 64 "Stream" Processors, because 4x16 = 64.
As I note, the ACE has 4 Pipelines that Process, 4x128-Bit Threads Per Clock. The Minimum Operation Time is 4 Clocks ... as such 4x4 = 16x 128-Bit Asynchronous Operations Per Clock (or 64x 32-Bit Operations Per Clock)
GCN 5.0 still has the same 4 Pipelines, but each is now 256-Bit Wide. This means it processes 128-Bit on the Riding Edge, and 128-Bit on the Falling Edge. Each CU is also now 16 SIMD that support a Single 256-Bit Vector or Double 128-Bit Vector or Quad 64-Bit Vector (4x 64-Bit, 8x 32-Bit, 16x 16-Bit).
*snip*