AMD’s DirectX 12 Advantage Explained – GCN Architecture More Friendly To Parallelism Than Maxwell

drunkenmaster · 26 Aug 2015 at 13:35

layte said:
It does put some further fuel under the theory that the engine has been designed specifically to target strengths and weaknesses in the two different architectures.

As my post, no it doesn't. This puts further fuel that Nvidia has spent years harming everyone by paying devs to optimise for their own hardware. This is a game that is optimised FOR DX12/low level api/features on modern GPUs. Nvidia has been optimising for everything but these things for years. Physx for years being crippled on x87 code, being forced to run on a single thread for years. Nvidia is the opposite of AMD. AMD pushes for future standards, adds new features, Nvidia does it's best to add nothing new and slows down every push forward.

Comparing Nvidia pushing gameworks, their own code, into games which has harmed WAY more games than it's helped, usually reduces performance, adds bugs and is generally worse for AMD to another game that is optimised for DX12, an industry standard API.

It's however not at all surprising that the Nvidia fanboys are trying to equate one to the other.

layte · 26 Aug 2015 at 13:44

Tessellation in Crysis 2 was a DX11 feature, not additional code. You should look up the meaning of the word hypocrisy sometime.

Words from somebody intimately familiar with the engine: https://forum.beyond3d.com/threads/nvidia-game-works-good-or-bad.55289/page-22#post-1866833

If you stop and think about it, things actually look pretty bad for AMD. This title has been built from the ground up as a showcase for their architecture and has benefited from AMD sponsorship, yet the best it can manage on AMD's latest and greatest is a wash with the 980ti. How are titles benefiting from NV legwork or just titles without major OEM involvement going to work out for AMD? That's a worrying sign.

This tedious narrative you are attempting to create regarding Gameworks needs to stop. Gameworks itself can do nothing other than generate DX11 calls, as it works on AMD hardware as well, yes it most likely is not optimised for AMD hardware, but then again AOTS is certainly not optimised for NV hardware. Do you see where I am going or do we need to add Cognitive dissonance to the list of things you need to look up.

queamin · 26 Aug 2015 at 13:51

Was tessellation the water which was unseen under the ground a feature no it was added to slow down amd more than nvidia's cards just like over tessellation the straight edges in the concrete bollard's and that forced amd to add a slider to the drivers to stop nvidia dirty tricks.

Mtom · 26 Aug 2015 at 14:07

Tesselation is a feature, overusing it is not.
Async shaders, and parallelization are features of dx12, and they are not overused, just used. We know for long time that those are the main strenghts of GCN over maxwell, still you pretending to be all surprised.

layte · 26 Aug 2015 at 14:11

Mtom said:
Tesselation is a feature, overusing it is not.
Async shaders, and parallelization are features of dx12, and they are not overused, just used. We know for long time that those are the main strenghts of GCN over maxwell, still you pretending to be all surprised.

How can you claim they are not being overused? We have precisely one DX12 benchmark to test on. One that just happens to have been created to showcase a specific manufacturers architecture.

Perhaps we should wait and see how further DX12 titles perform before everyone starts acting as an expert on GPU architecture and performance. People have a data sample set of one and are now trying to fit pieces to a puzzle we don't know the complete picture of.

layte · 26 Aug 2015 at 14:11

queamin said:
Was tessellation the water which was unseen under the ground a feature no it was added to slow down amd more than nvidia's cards just like over tessellation the straight edges in the concrete bollard's and that forced amd to add a slider to the drivers to stop nvidia dirty tricks.

This is interesting: http://docs.cryengine.com/display/SDKDOC2/Debug+Views#DebugViews-Wireframe

This will draw the entire scene in wireframe, including objects hidden from view. (Can over complicate a busy scene).

humbug · 26 Aug 2015 at 14:11

Harlequin said:
taken from hexus forum

think I know what is happening.

Ashes of the Singularity makes use of Asynchronous Shading. Now we know that AMD have been big on advertising this feature. It is a feature which is used in quite a few Playstation 4 titles. It allows the Developer to make efficient use of the compute resources available. GCN achieves this by making use of 8 Asynchronous Compute Engines (ACE for short) found in GCN 1.1 290 series cards as well as all GCN 1.2 cards. Each ACE is capable of queuing up to 8 tasks. This means that a total of 64 tasks may be queued on GCN hardware which features 8 ACEs.

nVIDIA can also do Asynchronous Shading through its HyperQ feature. The amount of available information, on the nVIDIA side regarding this feature, is minimal. What we do know is that nVIDIA mentioned that Maxwell 2 is capable of queuing 32 Compute or 1 Graphics and 31 Compute for Asynchronous Shading. nVIDIA has been

Anandtech made a BIG mistake in their article on this topic which seems to have become the defacto standard article for this topic. Their information has been copied all over the web. This information is erroneous. Anandtech claimed that GCN 1.1 (290 series) and GCN 1.2 were Capable of 1 Graphics and 8 Compute queues per cycle. This is in fact false. The truth is that GCN 1.1 (290 series) and GCN 1.2 are capable of 1 Graphics and 64 Compute queues per cycle.
Anandtech also had barely no information on Maxwell's capabilities. Ryan Smith, the Graphics author over at Anandtech, assumed that Maxwell's queues were its dedicated compute units. Therefore Anandtech published that Maxwell 2 had a total of 32 Compute Units. This information is false.
The truth is that Maxwell 2 has only a single Asynchronous Compute Engine tied to 32 Compute Queues (or 1 Graphics and 31 Compute queues).
I figured this out when I began to read up on Kepler/Maxwell/2 CUDA documentation and I found what I was looking for. Basically Maxwell 2 makes use of a single ACE-like unit. nVIDIA name this unit the Grid Management Unit.

How it works?

The CPUs various Cores send Parallel streams to the Stream Queue Management. The Stream Queue Management sends streams to the Grid Management Unit (Parallel to Serial thus far). The Grid Management unit can then create multiple hardware work queues (1 Graphics and 31 Compute or 32 Compute) which are then sent in a Serial fashion to the Work Distributor (one after the other or in Serial based on priority) . The Work Distributor, in a Parallel fashion, assigns the work loads to the various SMMs. The SMMs then assigns the work to a specific array of CUDA cores. nVIDIA call this entire process "HyperQ".

Here's the documentation: (minimum of 5 posts before I can post the URL)

GCN 1.1 (290 series)/GCN 1.2, on the other hand, works in a very different manner. The CPUs various Cores send Parallel streams to the Asynchronous Compute Engines various Queues (up to 64). The Asynchronous Compute Engines prioritizes the work and then sends it off, directly, to specific Compute Units based on availability. That's it.

Maxwell 2 HyperQ is thus potentially bottlenecked at the Grid Management and then Work Distributor segments of its pipeline. This is because these stages of the Pipeline are "in order". In other words HyperQ contains only a single pipeline (Serial not Parallel).

AMDs Asynchronous Compute Engine implementation is different. It contains 8 Parallel Pipelines working independently from one another. This is why AMDs implementation can be described as being "out of order".

A few obvious facts come to light. AMDs implementation incurs less latency as well as having the ability of making more efficient use of the available Compute resources.

This explains why Maxwell 2 (GTX 980 Ti) performs so poorly under Ashes of the Singularity under DirectX 12 and when compared to even a lowly R9 290x. Asynchronous Shading kills its performance compared to GCN 1.1 (290 series)/GCN 1.2. The latter's performance is barely impacted.
GCN 1.1 (290 series)/GCN 1.2 are clearly being limited elsewhere, and I believe it is due to their Peak Rasterization Rate or Gtris/s. Many objects and units permeate the screen under Ashes of the Singularity. Each one is made up of Triangles (Polygons). Since both the Fury-X and the 290x/390x have the same amount of hardware rasterization units, I believe that this is the culprit. Some people have attribute this to the amount of ROps (64) that both Fury-X and 290/390x share. I thought the same at first but then I remembered about the Color Compression found in the Fury/Fury-X cards. The Fury/X make use of Color Compression algorithms which have shown to alleviate the Pixel Fill Rate issues which were found in the 290/390x cards. Therefore I do not believe that ROps (Render Back Ends) are the issue. Rater the Triangle Setup Engine (Raster/Hierarchical Z) are the likely culprits.

I've been away from this stuff for a few years so I'm quite rusty but Direct X 12 is getting me interested once again

Anand you suck.

Kaap are you reading this?

Mtom · 26 Aug 2015 at 14:20

layte said:
How can you claim they are not being overused? We have precisely one DX12 benchmark to test on. One that just happens to have been created to showcase a specific manufacturers architecture.

Perhaps we should wait and see how further DX12 titles perform before everyone starts acting as an expert on GPU architecture and performance.

How can you overuse parallelization? It sends the data, the hardware copes with it as it can. The GCN can run twice as much pipelines so it copes better. This is not a graphic feature which you use more than you need.
By the way you must have missed when oxide said that all manufacturers had access to their source code for a year now.

But i agree we should wait and see how it unfolds when real dx12 games arrive

I recommend reading the post above mine...some good info there

andybird123 · 26 Aug 2015 at 14:24

drunkenmaster said:
Firstly it isn't AMD's game but don't let that stop you, it's a DX12 feature, that Nvidia is supposed to support, not ADDITIONAL code that gets added and paid for and is unoptimised for one.

But lets really break down what you're saying. If you have enough information you want processed that you can fill up 4 i7 cores, would it be faster if you serialise the code to run all on one processor? Because that is what you're saying. If you serialise it you would be forcing way too much data into a limited pipeline.

We're talking about getting effective and efficient usage of a large number of shaders vs inefficiently having thousands of shaders but getting low utilisation. You can run superpi in a single thread on a single core very slowly or do the same calculation with multiple threads on multiple cores taking a small fraction of the time.

Firstly, take a look here and see what logo's are on the bottom of the homepage;
http://ashesofthesingularity.com/

Something about Ashes DX12 implementation is negatively impacting Nvidia's performance (from the DX11 implementation)... I'm quite willing to accept that Asynchronous Shaders should give AMD a performance boost (equalling Nvidia's DX11 performance no less)... however if AS is negatively impacting performance (as the guy from OC.net claims) then surely adding a separate code path that DOESN'T negatively impact another hardware vendor would be the prudent coarse of action?

no idea why you are talking about CPU's as we were discussing the Grid Management Unit on the GPU

the worst case for DX12 should be that at equals DX11, no? Not that DX12 actually causes a performance decrease

layte · 26 Aug 2015 at 14:32

Mtom said:
How can you overuse parallelization? It sends the data, the hardware copes with it as it can. The GCN can run twice as much pipelines so it copes better. This is not a graphic feature which you use more than you need.
By the way you must have missed when oxide said that all manufacturers had access to their source code for a year now.

But i agree we should wait and see how it unfolds when real dx12 games arrive

By generating far more commands than are required? This would saturate how Maxwell is believed to operate just as overusing Tessellation saturates GCN hardware.

Oh and I didn't miss anything Oxide have said. But with DX12 the onus is on Oxide to optimise, not the OEM. Oxide have built this from the ground up as a Mantle showcase, so it would be mad to argue that it is not heavily favouring GCN.

humbug · 26 Aug 2015 at 14:39

andybird123 said:
Firstly, take a look here and see what logo's are on the bottom of the homepage;
http://ashesofthesingularity.com/

Something about Ashes DX12 implementation is negatively impacting Nvidia's performance (from the DX11 implementation)... I'm quite willing to accept that Asynchronous Shaders should give AMD a performance boost (equalling Nvidia's DX11 performance no less)... however if AS is negatively impacting performance (as the guy from OC.net claims) then surely adding a separate code path that DOESN'T negatively impact another hardware vendor would be the prudent coarse of action?

no idea why you are talking about CPU's as we were discussing the Grid Management Unit on the GPU

the worst case for DX12 should be that at equals DX11, no? Not that DX12 actually causes a performance decrease

That was a problem with Nvidia's Drivers.

fs123 · 26 Aug 2015 at 14:45

layte said:
By generating far more commands than are required? This would saturate how Maxwell is believed to operate just as overusing Tessellation saturates GCN hardware.

Oh and I didn't miss anything Oxide have said. But with DX12 the onus is on Oxide to optimise, not the OEM. Oxide have built this from the ground up as a Mantle showcase, so it would be mad to argue that it is not heavily favouring GCN.

In that case it is the onus of the PCars, Witcher 3 devs to optimize too right? Sadly they flat out blamed AMD drivers for the poor performance and said the could not optimize for AMD gpu's.

ubersonic · 26 Aug 2015 at 14:45

D.P. said:
And yes, the fact that t=Dx12 ran slower than Dx11 on Nvidia GPUs is just proof that the game engine is flawed.

Flawed is a strong term, back when scrypt coin mining was all the rage the Radeon cards battered their Geforce gaming counterparts because their architecture was better suited to it. From the articles it seems Nvidia do better in DX11 than DX12 because their architecture is better suited to the DX11 style of programming than it is DX12 one and they have no ebay to get around that with drivers.

Correct me if I'm wrong here but that's the way it reads.

layte · 26 Aug 2015 at 14:55

fs123 said:
In that case it is the onus of the PCars, Witcher 3 devs to optimize too right? Sadly they flat out blamed AMD drivers for the poor performance and said the could not optimize for AMD gpu's.

Neither of those titles were DX12. We know that DX11 and earlier required heavy optimisation from the hardware vendor for optimum performance. Unfortunately AMD did not seem very interested in doing this until it became a PR issue.

CAT-THE-FIFTH · 26 Aug 2015 at 14:57

r7slayer said:
Lets hope dx12 games don't make use of too much asynchronous shading then

I don't know about DX12 games but there had been noise from devs about how asynchronous shaders are giving performance boosts for the consoles.

Harlequin · 26 Aug 2015 at 15:01

another reply from the poster on hexus

Well to start,

Tahiti does have two ACE's and they do have a queue depth of 8 each for a total of 16. That's why you see an R9 280x surpass most nVIDIA Maxwell parts (not the Maxwell 2 parts of course). I have since done some rather extensive research and even contacted Oxide on the CPU side of things in order to get a better understanding of the Nitrous engine on that front. AotS makes use of SSE2 optimizations for compatibility reasons. On top of that you can hit a system memory bandwidth bottleneck on just 40% of the frame when using a socket 1155 Ivy Bridge CPU and Z77 motherboard (with a memory bandwidth limit of around 20GB/s). So it's quite memory bandwidth starved on the CPU front. This explains AMD FX's poor CPU performance (SSE2 and the memory bandwidth issue).

On the GPU side...

From what I've been able to gather, DirectX 12 was built to leverage the power of the XBox One, on one end, and the PC on the other. On the XBox One side, there was already a low level API at work but it was not built to execute code in a Parallel fashion. The 20% boost, claimed by Tim Sweeney of Epic games, which is to come to the Xbox One will be on games which are programmed for DirectX 12. The reason they will get a 20% boost is because of the use of Asynchronous Shading (which finally allows XBox One devs to make use of the two ACE's available).

There are two thing which Asynchronous shading does well which benefit PC games, running on GCN (nVIDIAs Kepler, Maxwell and Maxwell 2 parts will be a bit of a mixed bag) as well as the XBox One.

1. Post Processing Effects
2. Lighting

There are also other improvements but they're mostly tailored to VR so I won't mention them. For Post Processing Effects (Examples include blur filters, anti-aliasing, depth-of-field, light blooms, tone mapping, and color correction) the ACEs can operate in parallel with the graphics command processor and two DMA engines. The graphics command processor handles graphics queues, the ACEs handle compute queues, and the DMA engines handle copy queues. Each queue can dispatch work items without waiting for other tasks to complete, allowing independent command streams to be interleaved on the GPU’s Shader Engines and execute simultaneously. This results in an increase in compute unit utilization and performance by filling gaps in the pipeline, where the GPU would otherwise be forced to wait for certain tasks to complete before working on the next one in sequence. So say you have lightweight compute/copy queues (requiring relatively few processing resources) they can be overlapped with heavyweight graphics queues. This allows the smaller tasks to be executed during stalls or gaps in the execution of larger tasks, you end up improving utilization of processing resources and allowing more work to be completed in the same time frame.

Post Processing effects get around 46% boost in performance while lighting effects get a 10% boost per light. So you can fill your screen with various cinematic effects with very little effect on performance compared to when you're not using asynchronous shading.

On another note, Physics also gains a boost so we should see TressFX perform admirably.

nVIDIA could already derive that degree of efficiency under DirectX11, AMD couldn't. The end result is that GCN Compute Units were sitting there idling, waiting for a Graphics command to finish execution, before working on a compute command. It was sort of like being stuck at a red light at an intersection.

With asynchronous shading, GCN is able to hit its theoretical compute throughput. This results in an AMD Radeon R9 290x (with a theoretical compute capability of around 5.8 Tflops) of being able to hit those theoreticals. The various ACE queues simply prioritize loads and also share the available compute unit resources in order to ensure there's always work being done (keeping the compute units fed).

So while it appeared to us that, say, a GTX 780 was a faster graphics card than an R9 290x, this wasn't the case. GCNs parallel nature simply couldn't benefit from the serial nature of DirectX 11. We end up with a scenario where an R9 290x can even keep up with nVIDIA's latest and greatest the GTX 980 Ti.

Of course in DirectX 12 titles, which are not compute bound, a GTX 980 Ti will pull ahead. But in compute bound scenarios, the extra degree of latency derived from nVIDIA HyperQ solution, their Asynchronous shading nomenclature, by having two units setup in a hierarchical fashion (Grid Management Unit feeding a work Distributor which then feeds an available SMX) leads to a loss in efficiency. HyperQ is also limited in the amount of workloads it can queue up and prioritize as I mentioned in my previous post. Therefore the GTX 980 Ti can't hit its theoretical compute limits.

As for Fiji, I've since changed my views on what is bottlenecking it. I first assumed it was the ROPs (64) but the color compression additions to Fiji actually leads it to perform better on those types of operations over Hawaii. Then I thought maybe it was the Rasterizers (triangle throughput) but the 3D Mark API Overhead test actually shows AMD GCN Hawaii and Fiji doing better than nVIDIA Maxwell/2 despite having less theoretical rasterizer throughput (expressed in Gtris/s).

I've since been unable to explain Fiji's odd behaviour.

Once thing is certain, in compute heavy scenarios... a card launched in Q4 2013 can keep up with a card launched last May. That's pretty impressive.

is he on to something?

humbug · 26 Aug 2015 at 15:10

AMD's take on DX12 and ASYNC Shaders.

https://community.amd.com/community/gaming/blog/2015/05/12/major-new-features-of-directx-12

fs123 · 26 Aug 2015 at 15:19

Harlequin said:
another reply from the poster on hexus

is he on to something?

Can you link to the thread..sounds interesting so would like to follow it. Thanks.

andybird123 · 26 Aug 2015 at 16:09

Harlequin said:
another reply from the poster on hexus

is he on to something?

Yes, he is still saying that Ashes is specifically coded to take advantage of GCN, probably to the detriment of Maxwell

how other games approach this remains to be seen

basically he is highlighting what we already know - with DX12 optimisation is in the hands of developers

JediFragger · 26 Aug 2015 at 16:14

andybird123 said:
Y

basically he is highlighting what we already know - with DX12 optimisation is in the hands of developers

No doubt nVidia will be chucking its wallet around a fair bit