[
Mahigan:]
Yes... Async helps them achieve what is in this slide...
Latency becomes hidden by overlapping executions of Wavefronts. That's why GCN retains the same degree of latency as you throw more and more Kernels at it. GCN is far more parallel than competing architectures. I wouldn't say it is faster, it's just able to take on far more computational workloads (Threads) at any given time.
If you throw too much work at Maxwell/2, it begins to bottleneck. We see this result with the staircase effect, on nVIDIAs architecture, in Beyond3Ds graphs. So while Maxwell2 can compute a Kernel containing 32 threads in 25ms, GCN can compute a Kernel containing 64 threads (twice the commands) in 38-50ms. The problem is that if you throw a Kernel, containing 32 threads, at GCN, it will take the same 38-50ms. This is the result Beyond3D is getting and concluding (Jawed for example) that Maxwell 2 is so superior at compute.
If you add Async to the mix, You have that same 64 thread Kernel taking 38-50ms as well as a parallel Graphic task. So if we do the math, Maxwell 2 would take 50ms to handle a Kernel with 64 threads plus the 8-12ms it takes to handle the Graphics task.
I think that Beyond3D are CUDA programmers, if true, you can't fault them for not knowing.
At the end of this, Beyond3D will likely conclude that Oxide did something wrong when, in fact, they did something wrong in their tests.