GTX 1060 Vs RX 480 - head to head showdown

D.P. · 20 Jul 2016 at 16:49

akarypid said:
I get that. You submit a request, get a callback. All that you said is correct and so is the link. From a client point of view it's all about asking for something and moving on to do other things while that something gets done. You can do it from a single thread like you said, or from multiple threads (if you are already multi-threaded).

But I'm not talking about the client side of things.

I'm referring to the driver+hardware on the other side that actually runs asynchronously.

AMD can dispatch work from driver to hardware in parallel, whereas NVidia cannot: it can only interrupt one thing for another very fast (in Pascal).

So that where you get the steady 5-10% from in favour of AMD.

It's two legitimate ways to implement an async service. Both are valid. One just generally runs faster than the other in most cases (not all).

" whereas NVidia cannot: it can only interrupt one thing for another very fast (in Pascal)." This is just plain wrong, on Pascal (and Maxwell) compute and graphic work is being executed in parallel, and it is trivial to see that, e.g. look at Futuremarks GPUs analysis:

http://www.futuremark.com/pressreleases/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy

Above is a corresponding trace from an NVIDIA GTX 1080. As can be seen the general structures resemble those which are found on AMD Radeon Fury, albeit with extra queues that do not originate from the engine and which contain only synchronization items. From this image we can see that the GTX 1080 has an additional compute queue which accepts packets in parallel with the 3D queue.

Timespy doesn't enable asynchronous multi-engine support for Maxwell but there are other benchmarks out there where you can achieve similar multi-engine DX 12 support. The problem with Maxwell is the gains are smaller and the architecture is much more sensitive to the developer correctly balancing loads.

WYNIR0 · 20 Jul 2016 at 16:50

I wouldn't be buying either card for a couple of months until we've sne how drivers change the DX12 / Vulkan gap (if at all).

My gut says that DX12 will be important with AMD in the consoles so Nvidia will have to work to improve.

humbug · 20 Jul 2016 at 16:54

Has this been posted yet, Explains AMD massive Gain from Vulkan. Fury-X thrashes 1070, RX480 close to 980TI, seems a lot of reviewers used a an AA that turns A-Sync off.

akarypid · 20 Jul 2016 at 17:02

Gregster said:
Give over Simon. Whilst D.P might not always be correct, he knows his stuff and far more than me and you put together.

Well, I can verify that his interpretation of asynchronous is definitely correct.

Basically, when programming, there is the concept of a 'thread' which is the sequence of steps that are executed. Your code runs in a thread meaning that you lay out the steps and the CPU goes through them one by one.

Code makes use of libraries through APIs (like Vulkan/DX11/DX12). And the when you reach a step which invokes the API you basically give up control of the next steps to the library. It looks something like this:

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX11 to do X
- step 4 is ... (whatever)

The thing to notice here is that when the CPU gets to step 3, it enters a series of steps that the DX11 library provides (which may be 100 steps) and only comes back to step 4 of your code when all those are done. This is SYNCHRONOUS. You wait until step 3 is done before going to step 4. When you get to step 4, whatever work you asked of DX11 is done already and you can see the results.

Now, an asynchronous API works differently in that you can ask for things and check results later when they are ready. So it looks like this:

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX12 'get started on X, I will check back with you later)
- step 4 is ... (whatever)

The difference here (and this is a single thread of steps) is that when you get to step 4, X is just 'in progress'. You can do other things while X is being worked on and then get results. The simplest way is something called 'polling' where you explicitly inquire, but there's also the concept of 'callbacks' which is what is often used in Javascript from web browsers. But anyway, let's do the simple case of polling and it looks like this.

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX12 'get started on X, I will check back with you later)
- step 4 is ... (whatever)
- step 5 is ... (whatever)
- step 6 is ask DX12 'give me the result of X, wait if you have to)
- step 7 is ... (use result of X as needed)

The problem is that we're NOT talking about the async API which I described above. We're talking about how well AMD/NVidia implement it. See, behind the hood (between step 3 and 6) the DX12 drivers talk to the cards in order to get X done. Meanwhile, you can ask for more things through DX12 in step 4 and 5 and the drivers/cards must get to those as well.

That's where I've so far been under the impression that the AMD cards and driver can talk via multiple channels, thus being able to do things in parallel (as in at the same time, like when one CPU core is doing X and another is doing Y) whereas NVidia cards/drivers can not and instead use preemption (as in the driver can say pause X, I need you to do Y urgently).

This ability becomes even more important when the client code has multiple threads. So multiple CPU cores are talking via DX12 to the card and asking for things.

I'll read more into the material fs123 posted tonight as it seems like a nice resource.

humbug · 20 Jul 2016 at 17:09

akarypid said:
Well, I can verify that his interpretation of asynchronous is definitely correct.

Basically, when programming, there is the concept of a 'thread' which is the sequence of steps that are executed. Your code runs in a thread meaning that you lay out the steps and the CPU goes through them one by one.

Code makes use of libraries through APIs (like Vulkan/DX11/DX12). And the when you reach a step which invokes the API you basically give up control of the next steps to the library. It looks something like this:

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX11 to do X
- step 4 is ... (whatever)

The thing to notice here is that when the CPU gets to step 3, it enters a series of steps that the DX11 library provides (which may be 100 steps) and only comes back to step 4 of your code when all those are done. This is SYNCHRONOUS. You wait until step 3 is done before going to step 4. When you get to step 4, whatever work you asked of DX11 is done already and you can see the results.

Now, an asynchronous API works differently in that you can ask for things and check results later when they are ready. So it looks like this:

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX12 'get started on X, I will check back with you later)
- step 4 is ... (whatever)

The difference here (and this is a single thread of steps) is that when you get to step 4, X is just 'in progress'. You can do other things while X is being worked on and then get results. The simplest way is something called 'polling' where you explicitly inquire, but there's also the concept of 'callbacks' which is what is often used in Javascript from web browsers. But anyway, let's do the simple case of polling and it looks like this.

- step 1 is .... (whatever)
- step 2 is .... (whatever)
- step 3 is ask DX12 'get started on X, I will check back with you later)
- step 4 is ... (whatever)
- step 5 is ... (whatever)
- step 6 is ask DX12 'give me the result of X, wait if you have to)
- step 7 is ... (use result of X as needed)

The problem is that we're NOT talking about the async API which I described above. We're talking about how well AMD/NVidia implement it. See, behind the hood (between step 3 and 6) the DX12 drivers talk to the cards in order to get X done. Meanwhile, you can ask for more things through DX12 in step 4 and 5 and the drivers/cards must get to those as well.

That's where I've so far been under the impression that the AMD cards and driver can talk via multiple channels, thus being able to do things in parallel (as in at the same time, like when one CPU core is doing X and another is doing Y) whereas NVidia cards/drivers can not and instead use preemption (as in the driver can say pause X, I need you to do Y urgently).

This ability becomes even more important when the client code has multiple threads. So multiple CPU cores are talking via DX12 to the card and asking for things.

I'll read more into the material fs123 posted tonight as it seems like a nice resource.

Right, the difference is in Hardware, AMD have a series of command processors on the GPU (ACE Units) communicating independently in parallel with the API.
A bit like a multicore CPU is to the GCN architecture.

Nvidia don't have that, instead the organise the instructions in a series of virtual threads, a bit like hyper threading on an i7.

D.P. · 20 Jul 2016 at 17:09

akarypid said:
That's where I've so far been under the impression that the AMD cards and driver can talk via multiple channels, thus being able to do things in parallel (as in at the same time, like when one CPU core is doing X and another is doing Y) whereas NVidia cards/drivers can not and instead use preemption (as in the driver can say pause X, I need you to do Y urgently).

This ability becomes even more important when the client code has multiple threads. So multiple CPU cores are talking via DX12 to the card and asking for things.

I'll read more into the material fs123 posted tonight as it seems like a nice resource.

You still don't understand preemption though. Even AMD GCN has to rely on preemption, DX12 multi-engines just wouldn't work without it. The difference between Maxwell And GCN was that preemption could be expensive on Maxwell if not well tuned. That changes with Pascal where preemption is much finer grained and why Pascal see big speed up in DX12 async enabled scenarios like timespy.

You seem to be under the impression that Nvidia cards aren't running compute and graphics work in parallel. Your wrong, it is as simple as that.

here is a useful resource for Maxwell vs GCN that clearly spell out the pros and cons of each solution.
http://ext3h.makegames.de/DX12_Compute.html

For Pascal things are very different though.

UltraSBM · 20 Jul 2016 at 17:32

Judgeneo said:
Looks like its a 480 for me then

I want an RX 490X whenever that'll be released! Hopefully a Vega 11 GPU with HBM2

Greebo · 20 Jul 2016 at 17:34

D.P. said:
You still don't understand preemption though. Even AMD GCN has to rely on preemption, DX12 multi-engines just wouldn't work without it. The difference between Maxwell And GCN was that preemption could be expensive on Maxwell if not well tuned. That changes with Pascal where preemption is much finer grained and why Pascal see big speed up in DX12 async enabled scenarios like timespy.

You seem to be under the impression that Nvidia cards aren't running compute and graphics work in parallel. Your wrong, it is as simple as that.

here is a useful resource for Maxwell vs GCN that clearly spell out the pros and cons of each solution.
http://ext3h.makegames.de/DX12_Compute.html

For Pascal things are very different though.

WHO CARES ON TECHNICALITIES? Whatever it is, and it could be magic rainbow unicorn dust, AMD cards have it and Nvidia don't and the longer it goes on, the less likely it seems that they will be getting it.

humbug · 20 Jul 2016 at 17:40

D.P. said:
You still don't understand preemption though. Even AMD GCN has to rely on preemption, DX12 multi-engines just wouldn't work without it. The difference between Maxwell And GCN was that preemption could be expensive on Maxwell if not well tuned. That changes with Pascal where preemption is much finer grained and why Pascal see big speed up in DX12 async enabled scenarios like timespy.

You seem to be under the impression that Nvidia cards aren't running compute and graphics work in parallel. Your wrong, it is as simple as that.

here is a useful resource for Maxwell vs GCN that clearly spell out the pros and cons of each solution.
http://ext3h.makegames.de/DX12_Compute.html

For Pascal things are very different though.

That really is the most lame argument i have ever seen, you're relying on people not understanding anything your talking about.

Of Course DX12 will organise and prioritise instruction in to threads (Pre-Emption)

The receiving and propagating end, the GPU, is whats entirely different, Nvidia's GPU still only has the same single processor on the hardware, AMD's, in the case of Fury has 8 and on the RX 480 4.

Nvidias task here, in laymen's terms is to take a single core CPU and put something between it and the API to emulate multiple threads for it.

That presents several problems, for one it might not even be possible but even if it is Nvidia are relying on an emulation at a higher level to pull it off and at the end of it the whole system is still limited by its weakest link, it still only has one dude crunching the numbers at the end of the line.

AMD have 8 and there is nothing between them and the API, its API directly to metal x8 vs API to middleman to x1

Mauller · 20 Jul 2016 at 17:46

humbug said:
The receiving and propagating end, the GPU, is whats entirely different, Nvidia's GPU still only has the same single processor on the hardware, AMD's, in the case of Fury has 8 and on the RX 480 4.

The RX480 has 4 Aces and 2 HWS. Each HWS Can do the work of 2-3 Aces on the Rx480. Fiji is also the same configuration, although i believe the HWS modules on GCN3 can only do the work of 2 Aces.

humbug · 20 Jul 2016 at 17:49

Mauller said:
The RX480 has 4 Aces and 2 HWS. Each HWS Can do the work of 2-3 Aces on the Rx480. Fiji is also the same configuration, although i believe the HWS modules on GCN3 can only do the work of 2 Aces.

ok.. thanks.

akarypid · 20 Jul 2016 at 18:46

D.P. said:
You still don't understand preemption though. Even AMD GCN has to rely on preemption, DX12 multi-engines just wouldn't work without it. The difference between Maxwell And GCN was that preemption could be expensive on Maxwell if not well tuned. That changes with Pascal where preemption is much finer grained and why Pascal see big speed up in DX12 async enabled scenarios like timespy.

You seem to be under the impression that Nvidia cards aren't running compute and graphics work in parallel. Your wrong, it is as simple as that.

here is a useful resource for Maxwell vs GCN that clearly spell out the pros and cons of each solution.
http://ext3h.makegames.de/DX12_Compute.html

For Pascal things are very different though.

Ok I've read through all this and I see how even GCN uses preemption. But still there is no mention of Pascal.

However, even in the material above the important limiting factor seems to be (after all is said and done) that Maxwell can't do graphics and compute at the same time. When a job finishes you get idle SMs and these stay idle (and can't switch to say another compute job). And this is entirely a hardware limitation. It has nothing to do with the async API or even the driver. It's really a matter of architecture capability.

If Pascal is the same, it doesn't matter how fast it can perform a context switch (and even so, I doubt it can be faster than GCN's single-cycle speed).

Cyber-Mav · 20 Jul 2016 at 18:50

so just finished work read the reviews and find it shocking that a card with a 192bit memory bus is able to beat out the rx480. NVidia really have done well. if the price was a bit lower then it would be a steal. but I guess with nothing to compete with the 1060 NVidia can charge what they want

melmac · 20 Jul 2016 at 19:01

FoxEye said:
Well that doesn't really ring true when you look at the actual resource being consumed here - power.

I know you mean brute force in terms of algorithms, etc.

But whilst a big V8 might burn a lot of fuel comapred to 1.6, it's actually AMD who are using more juice than nVidia to achieve the performance.

So not quite such a good analogy, eh?

Car Analogies never work!!

D.P. · 20 Jul 2016 at 19:06

akarypid said:
Ok I've read through all this and I see how even GCN uses preemption. But still there is no mention of Pascal.

However, even in the material above the important limiting factor seems to be (after all is said and done) that Maxwell can't do graphics and compute at the same time. When a job finishes you get idle SMs and these stay idle (and can't switch to say another compute job). And this is entirely a hardware limitation. It has nothing to do with the async API or even the driver. It's really a matter of architecture capability.

If Pascal is the same, it doesn't matter how fast it can perform a context switch (and even so, I doubt it can be faster than GCN's single-cycle speed).

As I said, that article predates Pascal, for Pascal you c=can read the futuremark article:
http://www.futuremark.com/pressreleases/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy

Above is a corresponding trace from an NVIDIA GTX 1080. As can be seen the general structures resemble those which are found on AMD Radeon Fury, albeit with extra queues that do not originate from the engine and which contain only synchronization items. From this image we can see that the GTX 1080 has an additional compute queue which accepts packets in parallel with the 3D queue.

D.P. · 20 Jul 2016 at 19:08

Greebo said:
WHO CARES ON TECHNICALITIES? Whatever it is, and it could be magic rainbow unicorn dust, AMD cards have it and Nvidia don't and the longer it goes on, the less likely it seems that they will be getting it.

Have what? IF you are ignoring technicalities then you can make up an rubbish you want, like the content of most your posts.

melmac · 20 Jul 2016 at 19:11

While I find these technical debates fascinating, I don't really learn anything from them.

It's pretty clear from discussions like this one on various forums, that nobody really knows for sure what is happening. And people are really only guessing, educated guesses, but still guesses.

D.P. · 20 Jul 2016 at 19:24

humbug said:
That really is the most lame argument i have ever seen, you're relying on people not understanding anything your talking about.

Of Course DX12 will organise and prioritise instruction in to threads (Pre-Emption)

The receiving and propagating end, the GPU, is whats entirely different, Nvidia's GPU still only has the same single processor on the hardware, AMD's, in the case of Fury has 8 and on the RX 480 4.

Nvidias task here, in laymen's terms is to take a single core CPU and put something between it and the API to emulate multiple threads for it.

That presents several problems, for one it might not even be possible but even if it is Nvidia are relying on an emulation at a higher level to pull it off and at the end of it the whole system is still limited by its weakest link, it still only has one dude crunching the numbers at the end of the line.

AMD have 8 and there is nothing between them and the API, its API directly to metal x8 vs API to middleman to x1

The 8 ACE units on GCN 1.2 are entirely removed form the argument. that is a specific hardware implementation of DX12 multi-engine queues but there is absolutely no requirement to emulate that specific approach. You don't need multiple ACE to have the compute, copy and graphics queue who do work in parallel. Maxwell has 4 AWS (Asynchronous Warp Schedulers) per SMM/shader engine which is largely responsible for the same work. Work distribution is done in software compared to GCN, this has some pros (flexibility, saved transistor budget) and cons (reduced balancing, more driver optimizations).

none of these is a right or wrong architecture, they are both fully DX12 compliant, they both provide performance increases if used correctly and can be detrimental is used poorly.

We now have a vendor neutral unbiased benchmark where we can comapre the 2 different approaches. Timespy show both Pascal and Polaris/Fiji getting a small performance boost from using DX12 much-engines, exactly as expected.

D.P. · 20 Jul 2016 at 19:27

melmac said:
While I find these technical debates fascinating, I don't really learn anything from them.

It's pretty clear from discussions like this one on various forums, that nobody really knows for sure what is happening. And people are really only guessing, educated guesses, but still guesses.

Not fully true. AMD and NVidia have released information, they have provided more detailed information to developers. The problem is there are complex concepts that most people don't understand and instead go off some weird headline like "Only AMD have true ASync" or some such nonsense.

Really, the only thing that consumers should care about is in game performance. Sadly there have been some very poor DX12 games released. Now with Timespy there is a standard unbiased benchmark that can be viewed fairly. Hopefully we will start seeing more games come out that aren't heavily sponsored by one vendor or the other.

Grimord · 20 Jul 2016 at 20:06

melmac said:
While I find these technical debates fascinating, I don't really learn anything from them.

It's pretty clear from discussions like this one on various forums, that nobody really knows for sure what is happening. And people are really only guessing, educated guesses, but still guesses.

Some people do know their "technicalities" but that's not what will swing the majority of buyers. Price and actual in-game benchmarks with custom versions will, imo.