• Competitor rules

    Please remember that any mention of competitors, hinting at competitors or offering to provide details of competitors will result in an account suspension. The full rules can be found under the 'Terms and Rules' link in the bottom right corner of your screen. Just don't mention competitors in any way, shape or form and you'll be OK.

AMD Polaris architecture – GCN 4.0

Fix your new engine, it's a mess!

I think they lost a couple of their better engine devs to id.

Originally Oxide said the Nvidia drivers were advertising that Async was supported but when the game code executed the async codepath, there were severe performance problems.
Rather dubious on Nvidia's part to fake Async support in this way since drivers can be made to show support for any feature even if it's not actually supported at all. Programs like GPU-Z, Hardwareinfo, etc would show that card as supporting a feature by interrogating the driver only.

We will find out soon enough if the 1080 really does support it or it's the same again.

Problem was if they just indiscriminately loaded up the pipeline - using an approach that works fine for GCN on Maxwell can easily cause stalls within the GPU or even take out Windows itself.
 
Last edited:
Problem was if they just indiscriminately loaded up the pipeline - using an approach that works fine for GCN on Maxwell can easily cause stalls within the GPU or even take out Windows itself.

If that was really the case then there was no need to disable it altogether for Nvidia. Why not reduce the load instead since the Maxwell cards are supposedly able to do async similar to the older GCN cards like Tahiti which have only a few ACE units.
 
It's not that they have not developed a path. It's due to Nvidia requesting it be switched off. There is nothing stopping Nvidia asking it to be turned on. Nvidia originally requested to the developer Oxide that they not use Asynchronous compute at all. This request was not adhered to.

No, maxwell parts need a different implantation of async for best performance, pascal will need a different Implementation again.

Async compute is not switched on in AotS for any nvidia card at this time, that is straight from the developers.
 
If that was really the case then there was no need to disable it altogether for Nvidia. Why not reduce the load instead since the Maxwell cards are supposedly able to do async similar to the older GCN cards like Tahiti which have only a few ACE units.

Because the implementation as exposed originally in Maxwell is to easy to break and stall the entire driver out.

Pascal can do it more like the older GCN cards (but that is only effectively in terms of the outcome not the way in which it operates) 2nd gen Maxwell realistically is more like a 1+1 pipeline in terms of something resembling GCN.
 
I think they lost a couple of their better engine devs to id.



Problem was if they just indiscriminately loaded up the pipeline - using an approach that works fine for GCN on Maxwell can easily cause stalls within the GPU or even take out Windows itself.

A lot of them also went to work for Cloud Imperium Games. Brian Chambers for example.
Cryengine 5 is a complete shambles.
 
Last edited:
No, maxwell parts need a different implantation of async for best performance, pascal will need a different Implementation again.

Async compute is not switched on in AotS for any nvidia card at this time, that is straight from the developers.

No. I speak the truth.


An oxide devs clearly states it was at the request of Nvidia to turn it off.
Might I recomend everybody read this if they are interested in how Nvidia originally reacted to a dev using ASYNC compute.
Also, where did I state it was turned on for Nvidia? I never.


http://wccftech.com/oxide-games-dev-replies-ashes-singularity-controversy/
 
Last edited:
Just watched the PCper podcast from last week and they came up with an interesting way of putting the whole ASync compute thing together.

Basically just like AMD's GCN shader and NVidia Cuda core, arrive at the same end result, they get there in different ways and it is the same with ASync Compute. Async compute is a concept and both companies have decided to go about it in completely different ways, NVidia's method is a rather brute force way of doing things whereas AMD's approach has more finesse. The whole thing is not helped at all because the developer has to code their games to use ASync and it is not a one method suits all situation.
Their conclusion on the issue was does it matter, it is the end performance that matters.

Dissclaimer: I know this opinion will not be popular with some people, but it is just that another opinion on the subject.
 
Just watched the PCper podcast from last week and they came up with an interesting way of putting the whole ASync compute thing together.

Basically just like AMD's GCN shader and NVidia Cuda core, arrive at the same end result, they get there in different ways and it is the same with ASync Compute. Async compute is a concept and both companies have decided to go about it in completely different ways, NVidia's method is a rather brute force way of doing things whereas AMD's approach has more finesse. The whole thing is not helped at all because the developer has to code their games to use ASync and it is not a one method suits all situation.
Their conclusion on the issue was does it matter, it is the end performance that matters.

Dissclaimer: I know this opinion will not be popular with some people, but it is just that another opinion on the subject.

Problem is - taking Maxwell 2nd gen as an example - it can execute 1 graphics + 31 compute queues in parallel but with limited/no "asynchronicity" and simplified if 2 or more queues rely on each other they have to return to the software scheduler which can cause massive slowdowns on the overall pipeline if you just load it up indiscriminately like it is a generic GCN architecture you don't necessarily get good results. If you load it up understanding what it is it is a different matter.
 
Just watched the PCper podcast from last week and they came up with an interesting way of putting the whole ASync compute thing together.

Basically just like AMD's GCN shader and NVidia Cuda core, arrive at the same end result, they get there in different ways and it is the same with ASync Compute. Async compute is a concept and both companies have decided to go about it in completely different ways, NVidia's method is a rather brute force way of doing things whereas AMD's approach has more finesse. The whole thing is not helped at all because the developer has to code their games to use ASync and it is not a one method suits all situation.
Their conclusion on the issue was does it matter, it is the end performance that matters.

Dissclaimer: I know this opinion will not be popular with some people, but it is just that another opinion on the subject.


It is an accurate description. And as I said before, async compute is a solution to a problem that Nvidia GPUs just don't have to the same extent as later GCN architectures. Nvidia's cards perform close to their theoretical optimimum and have a high utilization of the compute shaders. AMD GPUs have far more theoretical compute resources that they struggle to fully utilize, async compute helps AMD much more than Nvidia simply because Nvida GPUs were design differently to maximize real world utilization without some of the front-end bottlenecks that are limiting Hawaii and Fiji architectures. Just look at the Fiji theoretical FP32 compute performance compared to Maxwell And then at real game peformance. That is why AMD are heavily marketing async compute.

If you look at the Dx12 they only describe mixed multi-engine pipeline, nothing to do with async compute. There are many ways to achieve the DX12 requirements, even the old Fermi architecture meets the specifications. Later GCN architectures definitely have a much more advanced approach that is easier for developers, but then the cards stand to gain much more so the costs in transistor budget was probably worth it. AMD went along the lines of brute force with huge compute resources that are hard to fully exploit. As GCN developed they became. Ore and more bottlenecked and AMD sore value in making deviated Async scheduling engines in hardware to try to maximize utilization. nvidia spent mrie of the transistor budget on removing bottlenecks on command processing, geometry throughput etc. People laughed when they saw the Fiji compute shader count compared to the 980ti and we had all these wild claims that the FuryX would be 1.5X faster. The reality was very different.


Pascals's pixel level and instruction level preemption combined with the dynamic load balancing should greatly improve multi-engine support and reduce the complexity for developers. Is it the same approach as AMD? No. Does it need to be? No, does any of it matter? No. The only thing that matters to the consumer is the ultimate performance, if Pascal is faster than Polaris/Volta using advanced preemption instead of ACE units then nvidia has the superior solution. If smaller Polaris kicks Gp104 in the nads then AMD have a vastly superior solution. Performance is the only metric that matters, not how the transisor budget was used to achieve that performance. This isn't something like fragment shaders or Tessellation that is required in hardware to achieve correct rendering, it is merely a way to increase GPU utilization and efficiency. If the GPU doesn't have utilization problems then it's fairly irrelevant.

AMD approach to Dx12 multi-engines is liekly a very good approach in future bpUs when we are hitting 6-10,000 shaders
 
One of the problems is AMD have slapped async over the top of a large area of functionality which covers a lot of parallel processing which often means people are arguing about slightly different things thinking they are talking about the same thing :S
 
It is an accurate description. And as I said before, async compute is a solution to a problem that Nvidia GPUs just don't have to the same extent as later GCN architectures. Nvidia's cards perform close to their theoretical optimimum and have a high utilization of the compute shaders. AMD GPUs have far more theoretical compute resources that they struggle to fully utilize, async compute helps AMD much more than Nvidia simply because Nvida GPUs were design differently to maximize real world utilization without some of the front-end bottlenecks that are limiting Hawaii and Fiji architectures. Just look at the Fiji theoretical FP32 compute performance compared to Maxwell And then at real game peformance. That is why AMD are heavily marketing async compute.

If you look at the Dx12 they only describe mixed multi-engine pipeline, nothing to do with async compute. There are many ways to achieve the DX12 requirements, even the old Fermi architecture meets the specifications. Later GCN architectures definitely have a much more advanced approach that is easier for developers, but then the cards stand to gain much more so the costs in transistor budget was probably worth it. AMD went along the lines of brute force with huge compute resources that are hard to fully exploit. As GCN developed they became. Ore and more bottlenecked and AMD sore value in making deviated Async scheduling engines in hardware to try to maximize utilization. nvidia spent mrie of the transistor budget on removing bottlenecks on command processing, geometry throughput etc. People laughed when they saw the Fiji compute shader count compared to the 980ti and we had all these wild claims that the FuryX would be 1.5X faster. The reality was very different.


Pascals's pixel level and instruction level preemption combined with the dynamic load balancing should greatly improve multi-engine support and reduce the complexity for developers. Is it the same approach as AMD? No. Does it need to be? No, does any of it matter? No. The only thing that matters to the consumer is the ultimate performance, if Pascal is faster than Polaris/Volta using advanced preemption instead of ACE units then nvidia has the superior solution. If smaller Polaris kicks Gp104 in the nads then AMD have a vastly superior solution. Performance is the only metric that matters, not how the transisor budget was used to achieve that performance. This isn't something like fragment shaders or Tessellation that is required in hardware to achieve correct rendering, it is merely a way to increase GPU utilization and efficiency. If the GPU doesn't have utilization problems then it's fairly irrelevant.

AMD approach to Dx12 multi-engines is liekly a very good approach in future bpUs when we are hitting 6-10,000 shaders


One thing you say I know is true.
Whatever brand has the best performance has the better solution.

So far Nvidia has a poor solution and instead of saying it's coming, it's coming, they have to prove it now to avoid would could potentially be millions in lost sales - which is why It's highly likely it's already performing as well as it can.
50% more expensive than what a Polaris 390x equivalent will be (maybe less) and only 20% faster in AOTS and possibly total war warhammer - Many other games to follow.
I have never known Nvidia not tell people about anything that might jeopardise sales, yet they have no facts regarding how there implementation of Asynchronous compute competes with AMD. Also, technically they should not be calling it Async as that is false marketing unless they prove compute tasks can run in Parralel, not just sequentially or pseudo parralel. There is software available that is able to establish this out I look forward to seeing the results when the 1000 series arrives
 
Also, technically they should not be calling it Async as that is false marketing unless they prove compute tasks can run in Parralel, not just sequentially or pseudo parralel. There is software available that is able to establish this out I look forward to seeing the results when the 1000 series arrives

It isn't about stuff running in parallel - someone correct me if I'm wrong but I believe the difference is this - say you have 5 queues A, B, C, D and E where E depends on the results of A and C:

GCN:

|--A--||---E---|
|------B-----|
|--C--|
|-----D-----|

Maxwell (2nd gen):

|--A--|-------|---E---|
|------B-----|
|--C--|
|-----D-----|

(Where the length of each is the time taken)

But due to the architecture differences that is much less of an issue for nVidia than if the same thing happened on an AMD GPU - especially with Pascal where the queue B that takes a long time could be run concurrent with E later on and take a lot less time which AMD can't do i.e. would look something like:


|--A--||---E---|
-------|----B---|
|--C--|
-------|---D---|

(This is grossly over simplifying but I think it illustrates the effective differences)

EDIT: I think with Pascal it could also look like combinations of:

|--A--||--E--|
|--B--||--B--|
|--C--|
|--D--||-D-|

as well depending on the workload.

Though according to the oxide guy: "He also states that while Maxwell 2 (GTX 900 family) is capable of parallel execution, “The hardware doesn’t profit from it much though, since it has only little ‘gaps’ in the shader utilization either way. So in the end, it’s still just sequential execution for most workload"
 
Last edited:
Read this to help you understand async a bit better.

http://stackoverflow.com/questions/...between-asynchronous-and-parallel-programming

Its annoying that Nvidia are trying to confuse everybody with their smoke and mirrors but I stand firm and do not believe Nvidia has ASYNC compute in their hardware and it is not right for them to say they do.

I stand to be corrected when someone shows me proof but I have been waiting for twelve months already while Nvidia have been saying Async is coming. I do not believe it will be part of Pascal either otherwise we would all know about it.
 
Last edited:
I understand that - none of it covers what I'm trying to convey in terms of some simple way of looking at the equivalency of AMD and nVidia's approaches without getting bogged down into the technical debate.
 
One thing you say I know is true.
Whatever brand has the best performance has the better solution.

So far Nvidia has a poor solution and instead of saying it's coming, it's coming, they have to prove it now to avoid would could potentially be millions in lost sales
You're seriously overestimating how much people care about the tiny minority of DX12 titles utilizing async shaders to a strong degree man.
 
I understand that - none of it covers what I'm trying to convey in terms of some simple way of looking at the equivalency of AMD and nVidia's approaches without getting bogged down into the technical debate.

And that is exactly the way Nvidia want it to be. How can any of us discuss it properly when Nvidia do not have the decency to discuss it? We can't.
But I will say again, they have given zero proof of Async compute and I doubt that will change.
I'm sure there plan is to string everyone along for another 18 months or so until they launch hardware with Async engines. They have done that for twelve months already. They are nearly halfway there.
It's only right AMD should try to capitalise on this strength before Nvidia does manage to catch up. It's good for competition and will ultimately benefit all gamers.

Don't read into anything, just look at benchmarks. It's all the proof anyone should need, period. It baffles me why people defend what Nvidia say regarding the matter even though games featuring Async have not once benefitted Nvidia GPUs at all in respect of the performance gains Async compute brings.
 
It isn't about stuff running in parallel - someone correct me if I'm wrong but I believe the difference is this - say you have 5 queues A, B, C, D and E where E depends on the results of A and C:

GCN:

|--A--||---E---|
|------B-----|
|--C--|
|-----D-----|

Maxwell (2nd gen):

|--A--|-------|---E---|
|------B-----|
|--C--|
|-----D-----|

(Where the length of each is the time taken)

But due to the architecture differences that is much less of an issue for nVidia than if the same thing happened on an AMD GPU - especially with Pascal where the queue B that takes a long time could be run concurrent with E later on and take a lot less time which AMD can't do i.e. would look something like:


|--A--||---E---|
-------|----B---|
|--C--|
-------|---D---|

(This is grossly over simplifying but I think it illustrates the effective differences)

EDIT: I think with Pascal it could also look like combinations of:

|--A--||--E--|
|--B--||--B--|
|--C--|
|--D--||-D-|

as well depending on the workload.

Though according to the oxide guy: "He also states that while Maxwell 2 (GTX 900 family) is capable of parallel execution, “The hardware doesn’t profit from it much though, since it has only little ‘gaps’ in the shader utilization either way. So in the end, it’s still just sequential execution for most workload"

That is pretty much my understanding. And the oxide developer quite just reinforces the point that Maxwell just doesn't have the under utilization problem the Hawaii and Fiji suffer from so they have much less to gain. A quick peak at the theoretical FP32 compute performance compared to real-world game benchmarks is evidence enough.
 
Back
Top Bottom