AMD Polaris architecture – GCN 4.0

GFX-Kid · 22 May 2016 at 20:51

Lots of people posting here have obvious biases, just because this guy made a quick buck from amd shares does not make his opinion suddenly invalid. At least he was not being dishonest and shared the fact with everyone in the first place.

Let's leave it at that.

I am panting here for some real Polaris news!

Where is that flopper fella? He always enlightens us all with high quality information

h4rm0ny · 22 May 2016 at 20:53

Ferrari100 said:
You say you care not for this harmony poster but if you were to retrace recent posts you might now realise that he actually has been, for want of a better word, stalking me.

Don't be ridiculous. I've responded to you once in that train-wreck thread where you claimed that Zen was going to be as fast as Skylake "at least". I've also responded to you here. That does not constitute "stalking".

GFX-Kid said:
Lots of people posting here have obvious biases, just because this guy made a quick buck from amd shares does not make his opinion suddenly invalid. At least he was not being dishonest and shared the fact with everyone in the first place.

But they weren't honest, that's why people kicked up a fuss. Someone asked them if they had an interest after something they posted in another thread and they said they had no interest. Then someone noticed their username was the same as a poster on another forum and that they had a forty-grand of shares in AMD and that they'd also worked for one of the AIBs.

My main issue is they post inaccurate things and call any criticism "mocking" and complain that people are trying to destroy AMD. It's little to do with what they choose to do with their money.

Ferrari100 · 22 May 2016 at 20:55

Gregster said:
NVidia might well have to rely on the brute force approach and going by AoTS, that seems to be doing ok so far. I am far from an engineer and only those at NVidia know what is what so far, so give it time to settle and then will know for sure. You can then say "I told you so" or I can

Absolutely and I am happy with that. I am just reading between clear lines, that all.

The 1080s brute strength is what will get it through, absolutely. It was a thought to me that maybe Nvidia would not have clocked the 1080 so high if it was not for AMD async. The last thing they would want is a 1080 to match Fury X in Async games , right?
I think Nvidia have been forced into a position that they have managed as best they can - again that is my opinion.

Gregster · 22 May 2016 at 20:59

Ferrari100 said:
Absolutely and I am happy with that. I am just reading between clear lines, that all.

The 1080s brute strength is what will get it through, absolutely. It was a thought to me that maybe Nvidia would not have clocked the 1080 so high if it was not for AMD async. The last thing they would want is a 1080 to match Fury X in Async games , right?
I think Nvidia have been forced into a position that they have managed as best they can - again that is my opinion.

Yer, I am just reading the 1080 Whitepaper and it looks like Pascal needs a different programming method.

Pascal uses Pixel level preemption as opposed to ACEs to deal with it.

Asynchronous Compute
Modern gaming workloads are increasingly complex, with multiple independent, or “asynchronous,”
workloads that ultimately work together to contribute to the final rendered image. Some examples of
asynchronous compute workloads include:
 GPU-based physics and audio processing
 Postprocessing of rendered frames
 Asynchronous timewarp, a technique used in VR to regenerate a final frame based on head
position just before display scanout, interrupting the rendering of the next frame to do so
These asynchronous workloads create two new scenarios for the GPU architecture to consider.
The first scenario involves overlapping workloads. Certain types of workloads do not fill the GPU
completely by themselves. In these cases there is a performance opportunity to run two workloads at
the same time, sharing the GPU and running more efficiently—for example a PhysX workload running
concurrently with graphics rendering.
For overlapping workloads, Pascal introduces support for “dynamic load balancing.” In Maxwell
generation GPUs, overlapping workloads were implemented with static partitioning of the GPU into a
subset that runs graphics, and a subset that runs compute. This is efficient provided that the balance of
work between the two loads roughly matches the partitioning ratio. However, if the compute workload
takes longer than the graphics workload, and both need to complete before new work can be done, and
the portion of the GPU configured to run graphics will go idle. This can cause reduced performance that
may exceed any performance benefit that would have been provided from running the workloads
GeForce GTX 1080 Whitepaper GeForce GTX 1080 GPU Architecture In-Depth
| 15
overlapped. Hardware dynamic load balancing addresses this issue by allowing either workload to fill the
rest of the machine if idle resources are available.
Figure 10: Pascal's Dynamic Load Balancing Reduces GPU Idle Time When Graphics Work Finishes Early,
Allowing the GPU to Quickly Switch to Compute
Time critical workloads are the second important asynchronous compute scenario. For example, an
asynchronous timewarp operation must complete before scanout starts or a frame will be dropped. In
this scenario, the GPU needs to support very fast and low latency preemption to move the less critical
workload off of the GPU so that the more critical workload can run as soon as possible.
As a single rendering command from a game engine can potentially contain hundreds of draw calls, with
each draw call containing hundreds of triangles, and each triangle containing hundreds of pixels that
have to be shaded and rendered. A traditional GPU implementation that implements preemption at a
high level in the graphics pipeline would have to complete all of this work before switching tasks,
resulting in a potentially very long delay.
To address this issue, Pascal is the first GPU architecture to implement Pixel Level Preemption. The
graphics units of Pascal have been enhanced to keep track of their intermediate progress on rendering
work, so that when preemption is requested, they can stop where they are, save off context information
about where to start up again later, and preempt quickly. The illustration below shows a preemption
request being executed.

http://international.download.nvidi...al/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

Calin Banc · 22 May 2016 at 20:59

Calin Banc said:
I think they said in a post a while back that there is no specific code for async, just that nvidia has it turned off by default. I think you can still turn it on through a .ini config or at least that's what I've heard. Even if they had some special code for async, it looks like Pascal doesn't truly support it (it just doesn't take a performance hit while doing it), so it's rather useless anyway.

At this point I'd say nothing is really "trivial" in dx12 and most devs won't say much about which GPU is limiting or supporting what (no matter if it is AMD or nVIDIA), due to obvious reasons.

humbug said:
My guess is AMD are banking on their console monopoly for ASync.

My guess is "not trivial" means you can't get performance out of what nvidia has, so you'll end up with a mess that as a developer and perhaps as a publisher, is just best to stay away from it for the moment.

From Oxide themselves:

n regards to the purpose of Async compute, there are really 2 main reasons for it:

1) It allows jobs to be cycled into the GPU during dormant phases. In can vaguely be thought of as the GPU equivalent of hyper threading. Like hyper threading, it really depends on the workload and GPU architecture for as to how important this is. In this case, it is used for performance. I can't divulge too many details, but GCN can cycle in work from an ACE incredibly efficiently. Maxwell's schedular has no analog just as a non hyper-threaded CPU has no analog feature to a hyper threaded one.

2) It allows jobs to be cycled in completely out of band with the rendering loop. This is potentially the more interesting case since it can allow gameplay to offload work onto the GPU as the latency of work is greatly reduced. I'm not sure of the background of Async Compute, but it's quite possible that it is intended for use on a console as sort of a replacement for the Cell Processors on a ps3. On a console environment, you really can use them in a very similar way. This could mean that jobs could even span frames, which is useful for longer, optional computational tasks.

Saying we heavily rely on async compute is a pretty big stretch. We spent a grand total of maybe 5 days on Async Shader support. It essentially entailed moving some ( a grand total of 4, IIRC) compute jobs from the graphics queue to the compute queue and setting up the dependencies. Async compute wasn't available when we began architecting (is that a word?) the engine, so it just wasn't an option to build around even if we wanted to. I'm not sure where this myth is coming from that we architected around Async compute. Not to say you couldn't do such a thing, and it might be a really interesting design, but it's not OUR current design.

Saying that Multi-Engine (aka Async Compute) is the root of performance increases on Ashes between DX11 to DX12 on AMD is definitely not true. Most of the performance gains in AMDs case are due to CPU driver head reductions. Async is a modest perf increase relative to that.

Regarding Async compute, a couple of points on this. FIrst, though we are the first D3D12 title, I wouldn't hold us up as the prime example of this feature. There are probably better demonstrations of it.

Our use of Async Compute, however, pales with comparisons to some of the things which the console guys are starting to do. Most of those haven't made their way to the PC yet, but I've heard of developers getting 30% GPU performance by using Async Compute.

There are 2 main types of tasks for a GPU, graphics and compute. D3D12 exposes main 2 queue types, a universal queue (compute and graphics), and a compute queue. For Ashes, use of this feature involves taking compute jobs which are already part of the frame and marking them up in a way such that hardware is free to coexecute it with other work. Hopefully, this is a relatively straightfoward tasks. No additional compute tasks were created to exploit async compute. It is merely moving work that already exists so that it can run more optimally. That is, if async compute was not present, the work would be added to the universal queue rather than the compute queue. The work still has to be done, however.

The best way to think about it is that the scene that is rendered remains (virtually) unchanged. In D3D12 the work items are simply arranged and marked in a manner that allows parallel execution. Thus, not using it when you could is seems very close to intentionally sandbagging performance.

We don't 'optimize' for it per say, we detangled dependencies in our scene so it can execute in parallel. Thus, I wouldn't say we optimized or built around it, we just moved some of the rendering work to compute and scheduled it to co-execute. Since we aren't a console title, we're not really tuning it like someone might on an XboxOne or PS4. However, consoles guys I've talked to think that 20% increase in perf is about the range that is expected for good use on a console anyway.

http://www.overclock.net/t/1575638/...able-legends-dx12-benchmark/110#post_24475280
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/2130#post_24379702
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1400#post_24360916
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1200#post_24356995
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1200#post_24356995
http://www.overclock.net/t/1569897/...ingularity-dx12-benchmarks/1200#post_24356995

Basically what Oxide did was to take work that was already there and that required processing, marking it so that the GPU can recognize it and that's it. No further magic optimization for AMD. A total grand of 5 days isn't that problematic as well in terms of manpower.

So in the end if you take out the dx11 bottleneck plus some async you get to the theoretical performance of the card - ergo the amd cards doing better than usual. Most likely if nvidia could have done this, "trivial" would have not even been mentioned.

GFX-Kid · 22 May 2016 at 21:09

h4rm0ny said:
But they weren't honest, that's why people kicked up a fuss. Someone asked them if they had an interest after something they posted in another thread and they said they had no interest. Then someone noticed their username was the same as a poster on another forum and that they had a forty-grand of shares in AMD and that they'd also worked for one of the AIBs.

My main issue is they post inaccurate things and call any criticism "mocking" and complain that people are trying to destroy AMD. It's little to do with what they choose to do with their money.

I see. Not been paying too much attention to it myself. Just visit by for Polaris news.

Seriously, flopper, where you at?

Mauller said:
Can people go and derail a different thread.

+1

old gamer · 22 May 2016 at 21:13

GFX-Kid said:
I see. Not been paying too much attention to it myself. Just visit by for Polaris news.

Seriously, flopper, where you at?

+1

He's gone to buy some shades, the future's so bright

Ferrari100 · 22 May 2016 at 21:21

Gregster said:
Yer, I am just reading the 1080 Whitepaper and it looks like Pascal needs a different programming method.

Pascal uses Pixel level preemption as opposed to ACEs to deal with it.

http://international.download.nvidi...al/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf

Thanks for that. It looks like this is the best implementation they could come up with without hardware. It looks like a lot of overhead that those fast cores will have to cope with.

I am still not sure if you can technically call this async compute though because Async means it is possible to do many things at once. This preemption and context switching suggests one task has to wait while another executes rather than allowing more than one task to execute at the same time. AMD aces can just dish out compute tasks to any available core it finds.
Do you see where I am coming from?

I am really excited to see what Polaris brings with its new arch. The best, fastest GPU will be compared to the 390x no doubt. I am hoping for 980ti performance with the possibility of a little OC.

CAT-THE-FIFTH · 22 May 2016 at 21:22

old gamer said:
He's gone to buy some shades, the future's so bright

Hopefully not too bright otherwise we won't see anything on the screen.

Mauller · 22 May 2016 at 22:03

CAT-THE-FIFTH said:
Hopefully not too bright otherwise we won't see anything on the screen.

Brighter than the sun!

No!

Just a HDR monitor connected to a Polaris GPU!

:cool:

andybird123 · 22 May 2016 at 23:07

Ferrari100 said:
Thanks for that. It looks like this is the best implementation they could come up with without hardware. It looks like a lot of overhead that those fast cores will have to cope with.

I am still not sure if you can technically call this async compute though because Async means it is possible to do many things at once. This preemption and context switching suggests one task has to wait while another executes rather than allowing more than one task to execute at the same time. AMD aces can just dish out compute tasks to any available core it finds.
Do you see where I am coming from?

I am really excited to see what Polaris brings with its new arch. The best, fastest GPU will be compared to the 390x no doubt. I am hoping for 980ti performance with the possibility of a little OC.

No. Nvidia GPU's CAN do graphics and compute tasks simultaneously, just less of them than AMD's cards, if you try to do too many it forces a context switch. Devs optimising for AMD cards or using AMD guidelines are going to be trying to force too many tasks so NVidia have just sidestepped the whole issue by improving pre-emption and context switching to the point where even poorly optimised code will still run full pelt.

Nvidia don't get as big a bonus from async as their GPU's and drivers are already well optimised for typical gaming loads, but its disingenuous to claim they don't support it and are just serialising the entire workflow.

Ferrari100 · 22 May 2016 at 23:16

andybird123 said:
No. Nvidia GPU's CAN do graphics and compute tasks simultaneously, just less of them than AMD's cards, if you try to do too many it forces a context switch. Devs optimising for AMD cards or using AMD guidelines are going to be trying to force too many tasks so NVidia have just sidestepped the whole issue by improving pre-emption and context switching to the point where even poorly optimised code will still run full pelt.

Nvidia don't get as big a bonus from async as their GPU's and drivers are already well optimised for typical gaming loads, but its disingenuous to claim they don't support it and are just serialising the entire workflow.

Async compute does not mean graphics and compute tasks executing simultaneously, it means compute tasks operating Asynchronously (i.e one or more at the same time and whatever hardware/thread/process is executing the compute task, it does not have to wait for one task to finish before dispatching another.)

If Nvidia or somebody else can show that is happening I will hapilly accept their GPUs can do Asynchronous compute. Any other way is not true async compute.

So, what are your views on Polaris?

CAT-THE-FIFTH · 22 May 2016 at 23:22

Maybe we need to give it a few months before we can denote a winner in the Async Wars - we need devs to play around with the different approaches first?

humbug · 22 May 2016 at 23:25

andybird123 said:
Is it really a 30% performance boost, or is it just removing a bottleneck that AMD suffers from casuing a 30% performance drop in other titles.

It is the latter, I explained this some page's back.

Ferrari100 · 22 May 2016 at 23:31

humbug said:
It is the latter, I explained this some page's back.

So does this remove all AMD'S bottlenecks? If so Nvidia are left with a significant bottle neck somewhere.
For example, if a 390x is generally around 980 performance without Async but experiencing a 20% boost in performance when using ASYNC, where does the 980 bottleneck lie that prevents it from keeping up? I'm interested to know.

andybird123 · 22 May 2016 at 23:36

Eh?
One GPU gaining 20% doesnt mean the other loses 20%

Ferrari100 · 22 May 2016 at 23:40

andybird123 said:
Eh?
One GPU gaining 20% doesnt mean the other loses 20%

Read it again, I explain it better. I mean why can't it keep up. There is a bottleneck if what you say is true. I am trusting what you say and want to know where Nvidia's bottleneck is?
Thanks

queamin · 22 May 2016 at 23:42

I thought 390x and 980ti had about the same GFLOPs it is just harder to get Amd cards to max out their GFLOPs before dx12, but that don't mean 390x is as powerful as a 980ti just means sometimes it can close the gap a bit in dx12 to dx11.

Ferrari100 · 22 May 2016 at 23:48

queamin said:
I thought 390x and 980ti had about the same GFLOPs it is just harder to get Amd cards to max out their GFLOPs before dx12, but that don't mean 390x is as powerful as a 980ti just means sometimes it can close the gap a bit in dx12 to dx11.

That hypothesis can proven in DX12 games that do not use ASync compute and I believe it will hold true due to DX12 making it easier to use multiple CPU cores where available.

Don't get confused with what DX12 offers in this respect and Async compute, they are two different things that improve performance in different ways.

andybird123 · 22 May 2016 at 23:48

Ferrari100 said:
Read it again, I explain it better. I mean why can't it keep up. There is a bottleneck if what you say is true. I am trusting what you say and want to know where Nvidia's bottleneck is?
Thanks

No, thats not what i said at all - youre creating a strawman argument and being rather facetious in the process.
Without async the 390x is not fully utilised, its a bigger chip than a 980 and SHOULD be faster.

If they both get 60fps without async, the 390x goes up to 70 with async, the 980 still gets 60 because its already fully utilised at 60, it hasnt gone down, theres no bottleneck, the 390X has had a bottleneck removed and so being a bigger chip it now gets a bigger score