GDC: Async Compute What Nvidia says

Hamfist · 31 Mar 2016 at 14:32

"While this possibility was provided for in the AMD GPU for more than 4 years, they have not been able to get commercial profit, they have not been sold more expensive for the cause."

Isn't Async compute a DX12 feature. Surely that's why it's not been used until now?

tommybhoy · 31 Mar 2016 at 15:12

humbug said:
I have a feeling Nvidia are struggling with this, and Pascal maybe no better. They are really going out of their way to avoid this subject.

D.P. · 31 Mar 2016 at 15:19

Ultimately what matters is performance for consumer/gamers, and the ease of developers getting that performance. That is Nvidia's main point, if they can sell a GPU that is faster than the competition without needing to use Async then what does it matter? in act, if Nvidia can achieve that then supporting Async may be an own goal as it will make it more attractive for developers and AMD will see bigger gains. So the question is, is Async required to get the best possible performance form a GPU and is the transistor budget justified?

The answer is probably yes, Aync is overall an advantage but it isn't quite so black and white and there is also likely a time for it to really shine in the near future, not necessarily now. As GPUs add more and more shaders then there may be more and more available compute resources left under utilized because it gets harder to keep the pipeline fully busy, and as transistor counts increase the relative transistor cost of more dedicated async hardware likely reduces.

maonayze · 31 Mar 2016 at 15:25

ICDP said:
To me that sounds like marketing speak for AMD can do Async compute but we can't.

All of what is being said by Nvidia about Async compute at the moment points to the fact that they are doing their utmost to not say the very sentence I have quoted from you. They will just not admit to it at all.

What with all the Maxwell architecture having been stripped of the Async Compute tech to benefit the efficiency, i am now starting to wonder if Async Compute was being kept out of the earlier iterations of Direct X by Nvidia influencing the situation due to their market share.

If AMD designed all their cards since 2011 to make use of it then possibly it was in Microsoft's road map for them to do that. Maybe it was then sidelined until DX12.

If (these are big IF's) Async Compute is not part of Pascal hardware and if DX12 and Async Compute is used a lot in the next few years then I think the chickens may just come home to roost for Nvidia in the new generation of cards for the next few years. If that is the case then Volta will come sooner than expected.

This is of course mere speculation from the mind of someone who loves conspiracy theories.

Rroff · 31 Mar 2016 at 15:58

maonayze said:
What with all the Maxwell architecture having been stripped of the Async Compute tech to benefit the efficiency, i am now starting to wonder if Async Compute was being kept out of the earlier iterations of Direct X by Nvidia influencing the situation due to their market share.

Async was never stripped out of Maxwell, prior to GK110 it wasn't even possible to despatch commands to the rendering and compute hardware simultaneously - that ability has been part of an evolution upto second generation Maxwell. If you put it that way 2nd generation Maxwell has "more" async compute than ever before.

maonayze said:
If (these are big IF's) Async Compute is not part of Pascal hardware and if DX12 and Async Compute is used a lot in the next few years then I think the chickens may just come home to roost for Nvidia in the new generation of cards for the next few years. If that is the case then Volta will come sooner than expected.

Pascal is perfectly capable of simultaneous graphics and compute command queues the complication is that with AMD's architectures you can easier dispatch mixed/complex compute workloads + graphics indiscriminately (AMD have always been a bit ahead of nVidia in terms of ILP/out of order processing but until now its largely been a pointless advantage) to a greater extent whereas for nVidia the developer has to pay more attention to how they load up the compute queues and more hands on management of that. (Less so on Pascal but Maxwell sometimes has to perform context switching due to the before mentioned deficiencies and if that isn't done in a properly managed fashion results in a potential performance penalty).

andybird123 · 31 Mar 2016 at 16:34

D.P. said:
Ultimately what matters is performance for consumer/gamers, and the ease of developers getting that performance. That is Nvidia's main point, if they can sell a GPU that is faster than the competition without needing to use Async then what does it matter? in act, if Nvidia can achieve that then supporting Async may be an own goal as it will make it more attractive for developers and AMD will see bigger gains. So the question is, is Async required to get the best possible performance form a GPU and is the transistor budget justified?

The answer is probably yes, Aync is overall an advantage but it isn't quite so black and white and there is also likely a time for it to really shine in the near future, not necessarily now. As GPUs add more and more shaders then there may be more and more available compute resources left under utilized because it gets harder to keep the pipeline fully busy, and as transistor counts increase the relative transistor cost of more dedicated async hardware likely reduces.

^ if what IO interactive were saying is true and that to get the most from it it needs to be tuned on a per-GPU basis, it would seem most likely that cross platform developers will use it and fine tune it on consoles (as its a single configuration with a large install base) but not really bother fine tuning it for PC GPU's, so AMD will get a bit of a boost off the back of console ports, but still not achieve their full potential

by comparison Nvidia will probably do what they usually do and just brute force their way through it until they can be bothered to support it properly

Calin Banc · 31 Mar 2016 at 18:13

h4rm0ny said:
The way I read Nvidia's response is that they're saying Async Compute is useful for AMD because they have a lot of misfires in their rendering process (bubbles in their process where things are under-utilized) and Aysnchronous Compute is beneficial for AMD because it's a way of filling in those bubbles with useful work; but that it's not a big deal for Nvidia because their process is already firing on all cylinders and they don't have a lot of bubbles to fill in.

http://media.bestofmicro.com/D/G/561652/gallery/AsyncCompute_On_Off_w_600.png

Panos · 31 Mar 2016 at 18:25

Gregster said:
Thanks for making me chuckle.

Sarcasm is nice when you can back it unfortunately

http://www.pcgameshardware.de/Hitman-Spiel-6333/Specials/DirectX-12-Benchmark-Test-1188758/

Enjoy

Gregster · 31 Mar 2016 at 19:52

Panos said:
Sarcasm is nice when you can back it unfortunately

http://www.pcgameshardware.de/Hitman-Spiel-6333/Specials/DirectX-12-Benchmark-Test-1188758/

Enjoy

Good for you and glad to see you happy that AMD are winning. What made me laugh was the way you posted, like you had just won Olympic gold lol

tommybhoy · 31 Mar 2016 at 20:56

bru · 31 Mar 2016 at 20:57

Panos said:
Usual Nvidia tactics tbh. And also, I couldn't hold my breath waiting for the first gen of Pascal supporting Async.
And another reason Nvidia is forcing people all the time to upgrade, is their abysmal drivers and performance on old hardware atm.

Just two and half years later, the R9 290 (not the X) at STOCK speeds, beats the crap out of the 780Ti, the TB and the 980 both in DX11 and DX12... (without async)

While the 380X, is just couple of FPS slower than the 970.

Explains a lot.

Gregster said:
Thanks for making me chuckle.

Panos said:
Sarcasm is nice when you can back it unfortunately

http://www.pcgameshardware.de/Hitman-Spiel-6333/Specials/DirectX-12-Benchmark-Test-1188758/

Enjoy

Gregster said:
Good for you and glad to see you happy that AMD are winning. What made me laugh was the way you posted, like you had just won Olympic gold lol

This is great, you have to love cherry picked benchmarks to prove a point.

So heres one for you, the 970 is nearly 50% faster than the 380x and the 780ti nearly 30% faster than the 290X. (Edit: yes that is meant to be the 290 non X

)

So yes sarcasm is indeed a wonderful thing when you can back it up and this one is even in English.

disclaimer: this post is not to be taken seriously, it is just to indicate that pretty much anything can be proven, with benchmarks if you want it to be.

BS Dave · 31 Mar 2016 at 21:03

double post

BS Dave · 31 Mar 2016 at 21:04

bru said:
The 780ti nearly 30% faster than the 290X

Try 19%. Maths is the same in any language mate

h4rm0ny · 31 Mar 2016 at 21:11

Calin Banc said:
http://media.bestofmicro.com/D/G/561652/gallery/AsyncCompute_On_Off_w_600.png

Correct me if I'm wrong, but that actually supports what Nvidia said, does it not?

brawl3r · 31 Mar 2016 at 21:24

BS Dave said:
Try 19%. Maths is the same in any language mate

haha, that made me chuckle.

Calin Banc · 31 Mar 2016 at 21:42

h4rm0ny said:
Correct me if I'm wrong, but that actually supports what Nvidia said, does it not?

It could be true at one point, but it varies in time. I've made here a post about async. I'm no expert, so I don't know if a hypothetical Keppler could overcome "improper" code by using some sort of async compute or not, but talking about the actual Keppler, that architecture was quite fast and optimized for those times, meaning nVIDIA was right at that time. Now however... :

As for Kepler, GCN and Maxwell...

It has to do with compute utilization...

Just like Kepler's SMX, each one of Maxwell's SMM has four warp schedulers, but what's changed between SMX and SMM is that each SMMs CUDA cores are assigned to a particular scheduler. So there are less shared units. This simplifies scheduling as each of SMM’s warp schedulers issue to a dedicated set of CUDA Cores equal to the warp width (Warps are 32 thread wide and each scheduler issues its Warps to 32 CUDA cores). You can still dual issue, like with Kepler, but a single issue would result in full CUDA core utilisation. This means that you have less idling CUDA cores. (There's also the dedicated 64KB per SM of Maxwell over Kepler's 16KB + 48KB design).

Why is this important? Because console titles are being optimized for GCN. Optimizing for GCN means using Wavefronts (not Warps). Wavefronts are 64 threads wide (mapping directly to two Warps). Since a Maxwell SMM is composed of 4x32 CUDA core partitions, that means that a wavefront would occupy 2x32CUDA core partitions (half an SMM). With Kepler, you had 192 CUDA cores per SMX. Try mapping Wavefronts to that and you need 3 Wavefronts. If you only have a single wavefront then you're utilizing 50% of a Maxwell SMM while only utilizing 33.3% of an SMX. That's a lot of unused compute resources.

With NVIDIAs architecture, only Kernels belonging to the same program can be executed on the same SM. So with SMX, that's 66.6% of compute resources not being utilized. That's a huge loss.

If a program is optimized for GCN then the work groups will be be divisible in increments of 64 (matching a wavefront).

If a program is optimized for Kepler/Maxwell then the work groups will be divisible in increments of 32 (matching a warp).

Prior to the arrival of GCN based consoles, developer's would map their work groups in increments of 32. This left GCN compute units idling and not being utilized in every CU.

And as you can see, Keppler lost ground in newer games and GCN is in front. GCN is more like a general purpose architecture while nVIDIA tweak theirs here and there for what apps were on the market at that time.

Not sure what the future will bring, but GCN is a monster. In raw compute performance, a R290X is equal to a 980ti. If we had AI and physics done on the GPU or other effects in compute, the situation could be a lot different - and can still be, depending what future games will bring, but I'll not hold my breath over that.

Also using async compute, I think, GCN is better in VR due to lower latency.

bru · 31 Mar 2016 at 22:08

BS Dave said:
Try 19%. Maths is the same in any language mate

Lol yup that was meant to be the 290 not the 290X

Gregster · 31 Mar 2016 at 22:59

tommybhoy said:
Snip

I am personally working on being world's number 1 pie eater

andybird123 · 31 Mar 2016 at 23:35

Calin Banc said:
Also using async compute, I think, GCN is better in VR due to lower latency.

The VR thing was about preemption for asynchronous time warp, NOT anything to do with asynchtonous compute

According to the latest Oculus blog, ATW is now working well on both vendors

If anything, the game forums for some of he VR games are currently saying AMD cards are taking a big nose dive with launch titltes like Elite

Calin Banc · 1 Apr 2016 at 08:46

If it was fix on nvidia cool, however on AMD it uses async compute to work properly for VR if I don't understand this wrong:

https://techreport.com/news/29917/radeon-software-crimson-edition-16-3-2-is-rift-ready

Before today, AMD says a high-priority task on a Radeon graphics card would have to request GPU resources using preemption. In that case, the GPU would temporarily suspend its other work, process the interruption, and return to its regular workload. AMD says preemption is a sub-optimal approach for time-sensitive tasks, since it doesn't provide any guarantee that a task will start and end within a given time frame. The company says that task-switching overhead and other delays associated with this method could also manifest as stuttering or lag in an application.

The Quick Response Queue, on the other hand, gives developers a special asynchronous queue where tasks can get preferential treatment from the GPU while the chip continues to perform other work. AMD says that since GCN asynchronous compute engines are both programmable and manage scheduling in hardware, it can enable the Quick Response Queue with nothing more than a software update on second-generation GCN GPUs and newer.

As an example of what this queue can do, AMD notes that Oculus implemented the asynchronous timewarp (or ATW) feature in version 1.3 of its SDK using the Quick Response Queue for AMD hardware. The company claims that using this feature makes it more likely that the ATW shader will be able to complete before the next vsync interval, even if it's submitted late in the rendering process. That's important since ATW is meant to reduce immersion-breaking judder in one's VR experience, and a warped frame is better than a dropped one. The company also touts the fact that running the ATW shader asynchronously means that the graphics card can continue to perform other tasks at the same time, like starting work on a new frame.

Also - https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved

Modern rendering engines must execute a large number of individual tasks to generate each visible frame. Each task includes a shader program that runs on the GPU. Normally these tasks are processed sequentially in a fixed order, which is referred to as synchronous execution. Asynchronous shader technology allows more flexibility in terms of the timing and order of execution for independent tasks. When used effectively, the result is better utilization of the GPU, faster frame times, and improved responsiveness.

While the feature is already being employed in games like Ashes of the Singularity and Hitman, there is much more to come. Developers are just starting to experiment with the basic functionality, and the new wave of virtual reality applications starting to appear this year are poised to make great use of it. Meanwhile at AMD we have been working on enhancing the technology with the goal of making it even more powerful.

So at least for AMD is important async compute and I doubt developers will leave one IHV outside of the player base and since you must put async in, it means you can boost performance on a lot of cards by a simple driver update.
With Elite it doesn't surprise me, it's an nVIDIA game.

PS: Does VR work properly on anything nvidia has outside of Maxwell v2?