AMD’s DirectX 12 Advantage Explained – GCN Architecture More Friendly To Parallelism Than Maxwell

pete910 · 1 Sep 2015 at 20:42

@Matt

Do you know if Vulkan uses this ?

Edit:
Asynchronous Shading that is

Edit 2:

Never mind, just read it

Final8y · 2 Sep 2015 at 15:36

[Mahigan:]
Yes... Async helps them achieve what is in this slide...

Latency becomes hidden by overlapping executions of Wavefronts. That's why GCN retains the same degree of latency as you throw more and more Kernels at it. GCN is far more parallel than competing architectures. I wouldn't say it is faster, it's just able to take on far more computational workloads (Threads) at any given time.

If you throw too much work at Maxwell/2, it begins to bottleneck. We see this result with the staircase effect, on nVIDIAs architecture, in Beyond3Ds graphs. So while Maxwell2 can compute a Kernel containing 32 threads in 25ms, GCN can compute a Kernel containing 64 threads (twice the commands) in 38-50ms. The problem is that if you throw a Kernel, containing 32 threads, at GCN, it will take the same 38-50ms. This is the result Beyond3D is getting and concluding (Jawed for example) that Maxwell 2 is so superior at compute.

If you add Async to the mix, You have that same 64 thread Kernel taking 38-50ms as well as a parallel Graphic task. So if we do the math, Maxwell 2 would take 50ms to handle a Kernel with 64 threads plus the 8-12ms it takes to handle the Graphics task.

I think that Beyond3D are CUDA programmers, if true, you can't fault them for not knowing.

At the end of this, Beyond3D will likely conclude that Oxide did something wrong when, in fact, they did something wrong in their tests.

http://www.overclock.net/t/1569897/various-ashes-of-the-singularity-dx12-benchmarks/1720

Some interesting results.

So that's the 2nd test on a 980Ti,

1st:
Compute: 5.67ms
Graphics: 16.77ms

Graphics + Compute: 21.15ms
Graphics + Compute (Single Commandlist): 20.70ms

And for 512th:
Compute: 76.11ms
Graphics: 16.77ms
Graphics + Compute: 97.38ms
Graphics + Compute (Single Commandlist): 2294.69ms

---------------------------------------

In both the 1st to 512th, Async Mode adds up the time. Single Commandlist mode went nuts.

Serial:
A (Compute) + B (Graphics) = A + B

Async:
A + B = A OR B

Right? Or is that not how we are meant to interpret the data of this test?

980 Ti
Compute only:
1. 6.79ms

Graphics only: 16.21ms

Graphics + compute:
1. 20.22ms

Graphics, compute single commandlist:
1. 20.04ms

Your result is identical to others. Running Graphics + Compute results in an additive output, close to the sum of compute + graphics.

Also your single commandlist results (forced), result in ever rising timings as we've seen with the others, up to 281st with a time of 2117.00ms!

Is this what Oxide is talking about? When they try to force direct async mode it would mess up.

First post here, I was curious on this matter so I ran both on my spare and main card.

Just sharing results if it might prove useful.

AsyncCompute
written by MDolenc

7950 Catalyst 15.8b

Graphics only: 57.37ms (29.24G pixels/s)
Graphics + compute: 238.70ms (7.03G pixels/s)
Graphics, compute single commandlist: 295.77ms (5.67G pixels/s)

980 Forceware 355.82

Graphics only: 23.23ms (72.21G pixels/s)
Graphics + compute: 103.58ms (16.20G pixels/s)
Graphics, compute single commandlist: 2433.35ms (0.69G pixels/s)

And the last example the 980 is clearly beating the 7950 as its a much more powerful card until you get to the last test Parallel Asynchronous Compute.
https://forum.beyond3d.com/threads/dx12-performance-thread.57188/page-15

drunkenmaster · 2 Sep 2015 at 17:07

I'm fully under the impression Nvidia is supporting async as far as enabling games to make a call for async at which point the driver just serialises the process and context switching is causing horrible performance issues when context switching overloads the hardware.

Final8y · 2 Sep 2015 at 17:23

Yeah, except it's not functional at all:

There's an almost constant step-up between the blue and the red lines, and that step is almost always equal to the constant value of the green line. This means the GPU is doing rendering + context switching + compute task.
There's no Async Compute happening on the hardware level at all.

#322
ToTTenTranz, 26 minutes ago

https://forum.beyond3d.com/posts/1869587/

humbug · 2 Sep 2015 at 18:52

drunkenmaster said:
I'm fully under the impression Nvidia is supporting async as far as enabling games to make a call for async at which point the driver just serialises the process and context switching is causing horrible performance issues when context switching overloads the hardware.

I think there is a bit of a contradiction in that statement.

Nvidia can't be running ASync if the operations are serialised, the serial switching between tasks is as a result of a lack of parallel operations

ASync is parallel, without it its running tasks in Serial which is exactly what Maxwell is doing.

drunkenmaster · 2 Sep 2015 at 19:01

I didn't say they were, I said support it as far as enabling games to make calls, not that it's doing them in hardware. I'm saying, it's advertising it can but reordering commands at the driver level to take async calls from the game and turn them into serialised calls for the hardware.

It's effectively pretending to support async fully, while not actually supporting it at the hardware level.

FredFlint · 2 Sep 2015 at 19:39

drunkenmaster said:
I didn't say they were, I said support it as far as enabling games to make calls, not that it's doing them in hardware. I'm saying, it's advertising it can but reordering commands at the driver level to take async calls from the game and turn them into serialised calls for the hardware.

It's effectively pretending to support async fully, while not actually supporting it at the hardware level.

AKA: Emulating it.

drunkenmaster · 2 Sep 2015 at 20:17

FredFlint said:
AKA: Emulating it.

I wondered if I should use the term but, ultimately emulation is usually doing it but in software much more slowly. This is more a case of not supporting it at all but telling the outside world you do support it.

AMD and Nvidia afaik support quite a few things only in software but a lot can be done cheaply and either couldn't have hardware made for it in the current gen or was simply deemed not required. There are other things that take a huge performance hit running in hardware and are nearly useless done that way but at least still possible.

Nvidia's async seems more a case of faking it than emulating it.

humbug · 3 Sep 2015 at 00:18

I don't think you can emulate ASync, either the hardware is parallel or its serial, you can't emulate multi-core CPU's on a single core CPU.

What you can do is use software to organise task queuing to reduce latency, tho i'm pretty sure Nvidia are already doing that.

Orangey · 3 Sep 2015 at 00:19

Simulating is the word I think, within a computer-science context.

humbug · 3 Sep 2015 at 00:34

Orangey said:
Simulating is the word I think, within a computer-science context.

I guess its possible if you write a higher level subroutine to 'Simulate' parallel scheduling, but then wouldn't that add its own latency? And where do you pool it? the L2 is not designed for it, is it big enough?
Some of it is missing on my GPU

FredFlint · 3 Sep 2015 at 04:33

humbug said:
I don't think you can emulate ASync, either the hardware is parallel or its serial, you can't emulate multi-core CPU's on a single core CPU.

What you can do is use software to organise task queuing to reduce latency, tho i'm pretty sure Nvidia are already doing that.

Any hardware function can be emulated in software, it will be slow compared to dedicated hardware but it can be done. Also multi-core CPU's are emulated in VM's.

Rroff · 3 Sep 2015 at 04:53

nVidia's stinger device used to have quite impressive software performance for missing hardware features but I can't see even running a virtual async compute system emulated via serial compute being able to provide useful performance.

On a semi related note they used to have a real time 1:1 performance software emulation of the kepler core for testing that required a server array the size of a shipping container heh.

Competitor rules

AMD’s DirectX 12 Advantage Explained – GCN Architecture More Friendly To Parallelism Than Maxwell

More options

pete910

pete910

Final8y

Final8y

drunkenmaster

drunkenmaster

Final8y

Final8y

humbug

humbug

drunkenmaster

drunkenmaster

FredFlint

FredFlint

drunkenmaster

drunkenmaster

humbug

humbug

Orangey

Orangey

humbug

humbug

FredFlint

FredFlint

Rroff

Rroff