AMD Polaris architecture – GCN 4.0

Mtom · 25 Mar 2016 at 10:02

CAT-THE-FIFTH said:
According to the full article,Polaris was meant to be late Q1 or early Q2. It has been delayed a quarter already. The Capsaicin roadmap is the new one with the delayed releases.

They have just massively cut their head start over Nvidia if what Gibbo says is true,ie,late summer now for Polaris and "summer" for Pascal.

I just hope what Mauller says is true - all the prebuilts get released more in early summer.

AMD releasing on a new node at the same time as Nvidia has not happened for a very long time and it would be disaster for them.

Edit!!

Late summer as Gibbo says would make it delayed over a quarter then.

So that probably confirms good old GF are being used not Samsung.

Or AMD knows how NV stands with Pascal, and they have time to play around. no one seen a working Pascal yet, and i heard even OEMs haven't got any.

Steampunk · 25 Mar 2016 at 10:12

Mtom said:
Or AMD knows how NV stands with Pascal, and they have time to play around. no one seen a working Pascal yet, and i heard even OEMs haven't got any.

Which you think would be a good reason to get Polaris out sooner, but maybe AMD are clearing old inventory, building new stock to avoid Polaris shortages, tweaking Polaris to be better on yields and clocks, etc. Maybe those extra few months are worth it to AMD, because it means they can do a full launch with high availability, especially if they know the product is good.

David Bisset · 25 Mar 2016 at 10:15

Kaapstad said:
I only need to remember one single fact about HBM1 and that is - the performance is bad @1080p but improves if you overclock it on LN2.

Obviously no one is going to game using LN2 but it proves the basic concept that HBM1 clockspeed is way too low and throttles performance.

HBM1 is flawed end of.

You continue to blame HBM for low-res performance levels without any basis whatsoever. Just to be clear: clockspeed that you keep citing as the problem is literally just a number. It matters only for it's impact on other numbers, such as bandwidth, latency etc. And HBM wins in these regards.

Many threads have had the same conversation with people pointing out to you that your assumption about the cause of the performance problems isn't right but you keep saying it - you're a huge enthusiast and I love reading about your crazy setups but it's a shame when you get something stuck in your head then insist that as you own the cards no-one else could be right. Yes, 1080p performance is relatively poor but as others have pointed out there are a bunch of factors that could be contributing, such as driver overhead, restrictions in other areas of the architecture such as DM is pointing out with the ROPs.

Even if leaving all speeds the same bar the memory yields performance gains this still doesn't tell you the memory is the problem - e.g. if the ROPs are holding performance back they may gain improved throughput with faster memory access thus you see gains. However, we can't see what GDDR5 would be like with the same setup - given the lower bandwidth and similar latency the problem would be exacerbated if ROP throughput was the problem, indicating the memory tech isn't to blame. We also can't try increasing the ROP count unfortunately to see how much of a boost that gives. (This was a wholly hypothetical example, I've not looked for if performance changes significantly when overclocking only the memory nor have I looked into why. I suspect that my somewhat flippant answer above is garbage

Edit: Though it does appear that latency & bandwidth of memory access is very relevant to ROP throughput so LN2 should help on any memory tech)

Anyhow, I've waffled on enough, I suppose I should do some work

Oh boy the months are passing slowly - I want new shiny techs to choose between! :'(

Kaapstad · 25 Mar 2016 at 10:54

Mauller said:
And yet this piece of drivel you keep spouting has been debunked multiple times when driver overhead is no longer an issue. Or do you like to ignorantly ignore all dx12 benches when fijis dx12 performance jumps ahead?

Or even scratch that, even in dx11 Fiji runs better than ever and runs faster than a 980ti in farcry primal and a few other games now.

The problem is not HBM and never has been, it has been shader utilisation.

David Bisset said:
You continue to blame HBM for low-res performance levels without any basis whatsoever. Just to be clear: clockspeed that you keep citing as the problem is literally just a number. It matters only for it's impact on other numbers, such as bandwidth, latency etc. And HBM wins in these regards.

Many threads have had the same conversation with people pointing out to you that your assumption about the cause of the performance problems isn't right but you keep saying it - you're a huge enthusiast and I love reading about your crazy setups but it's a shame when you get something stuck in your head then insist that as you own the cards no-one else could be right. Yes, 1080p performance is relatively poor but as others have pointed out there are a bunch of factors that could be contributing, such as driver overhead, restrictions in other areas of the architecture such as DM is pointing out with the ROPs.

Even if leaving all speeds the same bar the memory yields performance gains this still doesn't tell you the memory is the problem - e.g. if the ROPs are holding performance back they may gain improved throughput with faster memory access thus you see gains. However, we can't see what GDDR5 would be like with the same setup - given the lower bandwidth and similar latency the problem would be exacerbated if ROP throughput was the problem, indicating the memory tech isn't to blame. We also can't try increasing the ROP count unfortunately to see how much of a boost that gives. (This was a wholly hypothetical example, I've not looked for if performance changes significantly when overclocking only the memory nor have I looked into why. I suspect that my somewhat flippant answer above is garbage Edit: Though it does appear that latency & bandwidth of memory access is very relevant to ROP throughput so LN2 should help on any memory tech)

Anyhow, I've waffled on enough, I suppose I should do some work

Oh boy the months are passing slowly - I want new shiny techs to choose between! :'(

Yet overclocking HBM1 gives more performance.

HBM1 already has more than enough bandwidth due to it's wide bus, that leaves clockspeed which is giving the extra performance when overclocked.

HBM1 is flawed because it does not run at a higher clockspeed.

Mauller · 25 Mar 2016 at 11:16

Kaapstad said:
Yet overclocking HBM1 gives more performance.

HBM1 already has more than enough bandwidth due to it's wide bus, that leaves clockspeed which is giving the extra performance when overclocked.

HBM1 is flawed because it does not run at a higher clockspeed.

Strawman argument, overclocking normally yields more performance like people have stated numerous time. Also HBM has far greater parallel access compared to GDDR5 so overclocking will always yield more performance.

It has also been shown numerous times with DX12 that 1080p performance is more driver limiting for fiji than anything else. But as i also said, that seems to be decreasing as fiji is catching up and overtaking GM100 in 1080p in some newer games in dx11. And in general, fiji's 1080p performance is nowhere as bad as it used to be at launch.

The problem is architectural and fiji's inability to utilise its shaders effectively in more single threaded workloads, which GCN4 fixes and appears to give the GCN architecture a large boost in shader performance and utilisation.

Mauller · 25 Mar 2016 at 12:09

Keeping on Topic, a rumour has been going around that GCN 4.0 will utilise CPU style Power gating on it's CU's, allowing it to put them into a low power state when not needed.

Seems like a very good feature for mobile parts.

Edit - It appears to come from the open linux gpu commits for baffin and Ellesmere

https://lists.freedesktop.org/archives/dri-devel/2016-March/103402.html?utm_source=anzwix

Eric Huang (9):
drm/amd/powerplay: add thermal control for elm/baf
drm/amd/powerplay: add UVD&VCE DPM and powergating support for elm/baf
drm/amd/powerplay: add all blocks clockgating support through
SMU/powerplay
drm/amd/powerplay: add GFX/SYS clockgating support for ELM/BAF
drm/amd/powerplay: add GFX per cu powergating support through
SMU/powerplay
drm/amd/powerplay: add GFX per cu powergating for Baffin
drm/amd/amdgpu: add medium grain powergating support for Baffin
drm/amd/amdgpu: add power gating initialization support for GFX8.0
drm/amd/amdgpu: add power gating init for Baffin

Edit2 - Baffin and elesmere appear support up to 8 ACES (or maybe 4HWS blocks since a single HWS performs the work of two ACES apparently) each and 32 Threads along with 256bit general purpose registers. Asuming i am iterpreting the GPRS stuff correctly.

https://lists.freedesktop.org/archives/dri-devel/2016-March/103428.html

+ case CHIP_BAFFIN:
+ ret = amdgpu_atombios_get_gfx_info(adev);
+ if (ret)
+ return ret;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;

+ case CHIP_ELLESMERE:
+ ret = amdgpu_atombios_get_gfx_info(adev);
+ if (ret)
+ return ret;
+ adev->gfx.config.max_gprs = 256;
+ adev->gfx.config.max_gs_threads = 32;
+ adev->gfx.config.max_hw_contexts = 8;

Edit 3 - it appears that Ellesmere can also support per CU powergating but is not enabled as of yet in the driver.

https://lists.freedesktop.org/archives/dri-devel/2016-March/103419.html

+/* This function is for Baffin only for now,
+ * Powerplay will only control the static per CU Power Gating.
+ * Dynamic per CU Power Gating will be done in gfx.
+ */
+int ellesmere_phm_enable_per_cu_power_gating(struct pp_hwmgr *hwmgr, bool enable) <- line denoting ellesmere support although not enabled.

Edit 4 - confirmation of 256bit bus with 8 GDDR channels. Also Baffin appears to support 8 Gddr banks, most likely ganging two chips on one channel.

https://lists.freedesktop.org/archives/dri-devel/2016-March/103474.html

+ case BW_CALCS_VERSION_ELLESMERE:
+ vbios.number_of_dram_channels = 8;
+ vbios.dram_channel_width_in_bits = 32;
+ vbios.number_of_dram_banks = 8;

+ case BW_CALCS_VERSION_BAFFIN:
+ vbios.number_of_dram_channels = 4;
+ vbios.dram_channel_width_in_bits = 32;
+ vbios.number_of_dram_banks = 8;

Edit 4.2 - they appear to have GDDR running at 6000mhz with a default core clock of 1152mhz if this is correct.

+ vbios.high_yclk = bw_int_to_fixed(6000); <-Ram clock stuff
+ vbios.mid_yclk = bw_int_to_fixed(3200);
+ vbios.low_yclk = bw_int_to_fixed(1000);
+ vbios.low_sclk = bw_int_to_fixed(300); <-GPu clock stuff
+ vbios.mid_sclk = bw_int_to_fixed(974);
+ vbios.high_sclk = bw_int_to_fixed(1154);

Kaapstad · 25 Mar 2016 at 13:53

Mauller said:
Strawman argument, overclocking normally yields more performance like people have stated numerous time. Also HBM has far greater parallel access compared to GDDR5 so overclocking will always yield more performance.

It has also been shown numerous times with DX12 that 1080p performance is more driver limiting for fiji than anything else. But as i also said, that seems to be decreasing as fiji is catching up and overtaking GM100 in 1080p in some newer games in dx11. And in general, fiji's 1080p performance is nowhere as bad as it used to be at launch.

The problem is architectural and fiji's inability to utilise its shaders effectively in more single threaded workloads, which GCN4 fixes and appears to give the GCN architecture a large boost in shader performance and utilisation.

So you are telling with all the bandwidth HBM1 has it still needs to be overclocked.

You have just lost your argument.

r7slayer · 25 Mar 2016 at 14:09

When next gen high end cards come out from both camps with HBM i can see some people changing their tunes lol.

David Bisset · 25 Mar 2016 at 14:16

Kaap please read my post you quoted. Nothing you're seeing suggests HBM is poor. Indeed what you see backs what DM says is the problem, as if ROPs constrain performance then memory bandwidth becomes even more important. As Mauller also pointed out, if it were HBMs 'fault' then driver changes wouldn't be closing the gap. I'd suggest both of their answers are relevant & are both contributing to what we see, but that HBM is not in any way a factor.

To repeat clockspeed means nothing. At all. Literally irrelevant as a number when looking at performance. You have two main metrics, latency and bandwidth. Latency is similar between GDDR5 and HBM (and system RAM) as it's about the limits of the actual chips. Bandwidth is better with HBM. Clockspeed is simply a mechanism to get bandwidth through your interface, not something that gives any performance in itself. The reason increasing clocks helps is that it increases bandwidth. GDDR5 would be worse, even though clockspeeds of 7000 sound impressive compared with 500 it's totally a pointless comparison.

The only time clockspeed can be used as even a slight performance indicator is when comparing two systems using the same tech, and even then it's poor as we don't generally see the timings of GDDR so sometimes faster clockspeed is causing increased latency so you're trading one relevant metric for another meaning situationally worse performance.

You might find this, though written a while ago, provides some useful information: https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/
It's still fairly relevant to todays cards.

(Edit any pedants: yes, bandwidth and latency can alternately be represented by the trio of frequency, bus width and timings ... but as bus width is clearly better with HBM we're just comparing the combo of width with frequency for bandwidth, and timings with frequency for latency, making frequency on it's own not a relevant stat. As we don't normally see timings we can't compare easily there, but we can compare latency and doing so reveals that even using HBM as 'wide GDDR' rather than looking at better addressing we see matching latency)

Orangey · 25 Mar 2016 at 15:00

What if you need to fetch many small data held in contiguous addresses?

Mauller · 25 Mar 2016 at 15:07

Kaapstad said:
So you are telling with all the bandwidth HBM1 has it still needs to be overclocked.

You have just lost your argument.

I am not saying it needs overclocking, you are the one implying that. I was saying that overclocking always increases performance regardless of what system you are using.

You are forgetting that per pin the data rate increases more when overclocked on HBM compared to GDDR, the bandwidths quoted are just aggregate bandwidth. Per pin GDDR has more bandwidth but greater latency as you are queueing up more commands. With HBM it is lower per pin but you are going wide so you queue up far less command's due to the greater parallel access. No different to how performance is increased with systems such as Nand flash.

So as a percentage when you overclock HBM you increase bandwidth per pin to a greater extent than with GDDR5.

Also, you keep ignoring our quoting improved 1080p performance with newer drivers and DX12. So stop with the straw man arguments and get over it.

You either go slower, wide and parallel. Or you go thinner, higher clocks and serial. The overall band with will be the same in the end, just the latter can have latency issues. Learn how memory systems work.

pmc25 · 25 Mar 2016 at 16:02

Kaapstad said:
So you are telling with all the bandwidth HBM1 has it still needs to be overclocked.

You have just lost your argument.

No he hasn't. You have over and over and over. Just like your ridiculous proclamations re: RAM capacity. Fiji has an obscene number of shaders. It's bw gated, as seen by how little core OC without memory OC yields in performance increase.

You've still yet to explain how you think memory clocks make a card faster or slower in and of themselves, as opposed to how they determine bandwidth as a function of bus width.

layte · 25 Mar 2016 at 19:14

You could make an argument that the relatively low clock speed of HBM means that the number of memory operations per second is much lower than in GDDR5. So, where a large number of discreet memory requests are being made they would be throttled in comparison.

How much of an impact that makes in reality is another matter. HBM or a derivative of is obviously the choice memory solution going forward.

Mauller · 25 Mar 2016 at 19:27

layte said:
You could make an argument that the relatively low clock speed of HBM means that the number of memory operations per second is much lower than in GDDR5. So, where a large number of discreet memory requests are being made they would be throttled in comparison.

How much of an impact that makes in reality is another matter. HBM or a derivative of is obviously the choice memory solution going forward.

That is the thing, HBM can accommodate more unique memory operations per second because of how wide it is. You can also read and write different blocks on the same chip concurrently, while with GDDR5 you have to queue up your commands then you are either reading or writing to an entire chip at once.

The only times when GDDR5 would be at an advantage with latency is when writing a file that is larger than the per pin bandwidth. But when it comes to shader operations they will always be performing many small I/O operations because there is only so much level 1 and 2 cache in the shaders. you can't store an entire texture in a shader cache unless it is tiny.

Which is why HBM in the end is better for GPU's as the number of shaders start to further increase. And also why it is coming to compute parts first since it is a better fit for those kind of workloads.

Kaapstad · 25 Mar 2016 at 19:29

David Bisset said:
Kaap please read my post you quoted. Nothing you're seeing suggests HBM is poor. Indeed what you see backs what DM says is the problem, as if ROPs constrain performance then memory bandwidth becomes even more important. As Mauller also pointed out, if it were HBMs 'fault' then driver changes wouldn't be closing the gap. I'd suggest both of their answers are relevant & are both contributing to what we see, but that HBM is not in any way a factor.

To repeat clockspeed means nothing. At all. Literally irrelevant as a number when looking at performance. You have two main metrics, latency and bandwidth. Latency is similar between GDDR5 and HBM (and system RAM) as it's about the limits of the actual chips. Bandwidth is better with HBM. Clockspeed is simply a mechanism to get bandwidth through your interface, not something that gives any performance in itself. The reason increasing clocks helps is that it increases bandwidth. GDDR5 would be worse, even though clockspeeds of 7000 sound impressive compared with 500 it's totally a pointless comparison.

The only time clockspeed can be used as even a slight performance indicator is when comparing two systems using the same tech, and even then it's poor as we don't generally see the timings of GDDR so sometimes faster clockspeed is causing increased latency so you're trading one relevant metric for another meaning situationally worse performance.

You might find this, though written a while ago, provides some useful information: https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-graphics-pipeline-2011-index/
It's still fairly relevant to todays cards.

(Edit any pedants: yes, bandwidth and latency can alternately be represented by the trio of frequency, bus width and timings ... but as bus width is clearly better with HBM we're just comparing the combo of width with frequency for bandwidth, and timings with frequency for latency, making frequency on it's own not a relevant stat. As we don't normally see timings we can't compare easily there, but we can compare latency and doing so reveals that even using HBM as 'wide GDDR' rather than looking at better addressing we see matching latency)

And yet cards with a lot less bandwidth like the TitanX and 980 Ti are faster @1080p.

HBM1 needs MHz to compete.

layte · 25 Mar 2016 at 19:50

Mauller said:
That is the thing, HBM can accommodate more unique memory operations per second because of how wide it is. You can also read and write different blocks on the same chip concurrently, while with GDDR5 you have to queue up your commands then you are either reading or writing to an entire chip at once.

That's quite the case though, as it's not 1 command per pin is it. I'm sure I read somewhere that Fiji has 8 512bit memory controllers http://www.bit-tech.net/hardware/graphics/2015/06/24/amd-radeon-r9-fury-x-review/1 , whilst the 980ti has 12 32bit controllers. You can only handle one operation per cycle per controller. So massive bandwidth on sequential read or writes, but constraints when larger number of discreet operations are required.

Edit: I still agree that HBM is the premier solution. I just find this an interesting discussion.

pmc25 · 25 Mar 2016 at 20:00

Kaapstad said:
And yet cards with a lot less bandwidth like the TitanX and 980 Ti are faster @1080p.

HBM1 needs MHz to compete.

You've clearly either been trolling the entire time, or are too embarrassed to admit that you're wrong.

You haven't provided a single substantiating claim, or even postulated how it could work like you say it does (with zero evidence).

And anyone with even the slightest sense could tell you that NVIDIA and AMD's cards are very dissimilar in hardware architecture.

ICDP · 25 Mar 2016 at 20:16

CAT-THE-FIFTH said:
I really hope you are right!! AMD needs a homerun with Polaris and a few months of Maxwell looking meh would help them.

Unless they have some inside information saying Pascal is delayed or Polaris is much better,I would not be cutting it close with Nvidia. I still remember when Nvidia sold more FX cards than ATI with the 9000 series,despite the former being crap!

The reason why the HD5000 series did so well is since they had a six month headstart on Nvidia,had decent price,decent performance and decent power consumption. But they could not still outsell Nvidia but did gain back marketsahre,and once Fermi released Nvidia gained back sales.

CAT I consider you one of the most sensible posters on this forum but you are reading too much into this latest SA article.

In Nov SA (allegedly) had a source claiming Polaris "VG10" would be released in late Q1/early Q2. Or to put it another way, March/April 2016. The latest SA article is taking the info from AMD Caspaicin event that states "mid 2016/before back to school" as evidence of a slip by at least one quarter.

So the official info from AMD stating mid 2016 has not changed and any earlier dates were simply conjecture. This is just SA saying Polaris is not being release in March/April as their earlier Nov article claimed.

h4rm0ny · 25 Mar 2016 at 20:17

Kaapstad said:
And yet cards with a lot less bandwidth like the TitanX and 980 Ti are faster @1080p.

HBM1 needs MHz to compete.

I'm not sure if this is me, but I don't think that makes sense.

Mauller · 25 Mar 2016 at 20:48

layte said:
That's quite the case though, as it's not 1 command per pin is it. I'm sure I read somewhere that Fiji has 8 512bit memory controllers http://www.bit-tech.net/hardware/graphics/2015/06/24/amd-radeon-r9-fury-x-review/1 , whilst the 980ti has 12 32bit controllers. You can only handle one operation per cycle per controller. So massive bandwidth on sequential read or writes, but constraints when larger number of discreet operations are required.

Edit: I still agree that HBM is the premier solution. I just find this an interesting discussion.

Even with just 8 the memory controllers will act very differently due to the way HBM works, they wouldnt use the same methods for controlling the chips as the GDDR memory controller would. (a bit of trivia is that the HBM memory controllers are smaller and less complex.

) And Nvidias approach is to use a single memory controller for every GDDR channel. The Hawaii architecture also uses 8 memory controllers with each controller managing 2 GDDR channels. The controllers themselves may act very differently to each-other so not exactly a guide to performance.