think I know what is happening.
Ashes of the Singularity makes use of Asynchronous Shading. Now we know that AMD have been big on advertising this feature. It is a feature which is used in quite a few Playstation 4 titles. It allows the Developer to make efficient use of the compute resources available. GCN achieves this by making use of 8 Asynchronous Compute Engines (ACE for short) found in GCN 1.1 290 series cards as well as all GCN 1.2 cards. Each ACE is capable of queuing up to 8 tasks. This means that a total of 64 tasks may be queued on GCN hardware which features 8 ACEs.
nVIDIA can also do Asynchronous Shading through its HyperQ feature. The amount of available information, on the nVIDIA side regarding this feature, is minimal. What we do know is that nVIDIA mentioned that Maxwell 2 is capable of queuing 32 Compute or 1 Graphics and 31 Compute for Asynchronous Shading. nVIDIA has been
Anandtech made a BIG mistake in their article on this topic which seems to have become the defacto standard article for this topic. Their information has been copied all over the web. This information is erroneous. Anandtech claimed that GCN 1.1 (290 series) and GCN 1.2 were Capable of 1 Graphics and 8 Compute queues per cycle. This is in fact false. The truth is that GCN 1.1 (290 series) and GCN 1.2 are capable of 1 Graphics and 64 Compute queues per cycle.
Anandtech also had barely no information on Maxwell's capabilities. Ryan Smith, the Graphics author over at Anandtech, assumed that Maxwell's queues were its dedicated compute units. Therefore Anandtech published that Maxwell 2 had a total of 32 Compute Units. This information is false.
The truth is that Maxwell 2 has only a single Asynchronous Compute Engine tied to 32 Compute Queues (or 1 Graphics and 31 Compute queues).
I figured this out when I began to read up on Kepler/Maxwell/2 CUDA documentation and I found what I was looking for. Basically Maxwell 2 makes use of a single ACE-like unit. nVIDIA name this unit the Grid Management Unit.
How it works?
The CPUs various Cores send Parallel streams to the Stream Queue Management. The Stream Queue Management sends streams to the Grid Management Unit (Parallel to Serial thus far). The Grid Management unit can then create multiple hardware work queues (1 Graphics and 31 Compute or 32 Compute) which are then sent in a Serial fashion to the Work Distributor (one after the other or in Serial based on priority) . The Work Distributor, in a Parallel fashion, assigns the work loads to the various SMMs. The SMMs then assigns the work to a specific array of CUDA cores. nVIDIA call this entire process "HyperQ".
Here's the documentation: (minimum of 5 posts before I can post the URL)
GCN 1.1 (290 series)/GCN 1.2, on the other hand, works in a very different manner. The CPUs various Cores send Parallel streams to the Asynchronous Compute Engines various Queues (up to 64). The Asynchronous Compute Engines prioritizes the work and then sends it off, directly, to specific Compute Units based on availability. That's it.
Maxwell 2 HyperQ is thus potentially bottlenecked at the Grid Management and then Work Distributor segments of its pipeline. This is because these stages of the Pipeline are "in order". In other words HyperQ contains only a single pipeline (Serial not Parallel).
AMDs Asynchronous Compute Engine implementation is different. It contains 8 Parallel Pipelines working independently from one another. This is why AMDs implementation can be described as being "out of order".
A few obvious facts come to light. AMDs implementation incurs less latency as well as having the ability of making more efficient use of the available Compute resources.
This explains why Maxwell 2 (GTX 980 Ti) performs so poorly under Ashes of the Singularity under DirectX 12 and when compared to even a lowly R9 290x. Asynchronous Shading kills its performance compared to GCN 1.1 (290 series)/GCN 1.2. The latter's performance is barely impacted.
GCN 1.1 (290 series)/GCN 1.2 are clearly being limited elsewhere, and I believe it is due to their Peak Rasterization Rate or Gtris/s. Many objects and units permeate the screen under Ashes of the Singularity. Each one is made up of Triangles (Polygons). Since both the Fury-X and the 290x/390x have the same amount of hardware rasterization units, I believe that this is the culprit. Some people have attribute this to the amount of ROps (64) that both Fury-X and 290/390x share. I thought the same at first but then I remembered about the Color Compression found in the Fury/Fury-X cards. The Fury/X make use of Color Compression algorithms which have shown to alleviate the Pixel Fill Rate issues which were found in the 290/390x cards. Therefore I do not believe that ROps (Render Back Ends) are the issue. Rater the Triangle Setup Engine (Raster/Hierarchical Z) are the likely culprits.
I've been away from this stuff for a few years so I'm quite rusty but Direct X 12 is getting me interested once again