AMD Polaris architecture – GCN 4.0

Davedree · 7 Jan 2016 at 15:50

humbug said:
Draw Call Thoughput bottlenecking, at 1080P the CPU is doing more of the work, the higher the res goes the more of that work is off loaded to the GPU, thats when the true power of the GPU architecture comes in to play, where its not bottlenecked by a combination of a crap API and its own Drivers.

Credely put
Direct x 11 crudely is serial programming throughput?
Direct x 12 is asynch and parallel throughput?

I don't believe that the cpuy bottlenck is that bad at 1080p,It's always been my opinion that Fury/Tonga are too wide on the front end and with the 8 ace unit and global data share, as it's lacking a hardware scheduler it relys on the directx 11 software for their front end commanding, which doesn't help the feed the wider parallel architecture. The solution in Polaris seems to be a hardware scheduler and better shader throughput + direct x 12

humbug · 7 Jan 2016 at 16:02

Davedree said:
Credely put
Direct x 11 crudely is serial programming throughput?
Direct x 12 is asynch and parallel throughput?

I don't believe that the cpuy bottlenck is that bad at 1080p,It's always been my opinion that Fury/Tonga are too wide on the front end and with the 8 ace unit and global data share, as it's lacking a hardware scheduler it relys on the directx 11 software for their front end commanding, which doesn't help the feed the wider parallel architecture. The solution in Polaris seems to be a hardware scheduler and better shader throughput + direct x 12

You don't believe it, that much is obvious.

Selection Bias - A person picks only the evidence that fits the conclusion they have already come to.

With respect, we all do it at some point.

Geometry Tessellation management, Colour and Texture Compression, those things get more difficult as the resolution goes up, not easier. so if you want to use Front End Scheduling then the opposite of your argument is what should be happening.

DX12 is not just a Parallel extraction layer, DX11 is also Parallel, the difference here, and i think this is what you are driving at, is DX12 will use 4 Draw Call threads where as DX11 will only use 1 'Usually' but it is capable of using 2.

The biggest difference between DX11 and 12 is in its efficiency, it will extract 3x more calls on each thread than DX11.

The only way you can explain higher performance scaling at higher res is through external architectural bottlenecking, that can only be the API and / or Drivers.

humbug · 7 Jan 2016 at 16:22

I want to add to this.

AMD do have ' or have had a bit of an issue with Tessellation and Texture Compression. when compared with Nvidia.

But, with Texture Compression on the AMD side its usually overcome with brute force rather than efficiency, AMD traditionally have a wider memory Bus, 384Bit vs 256Bit, 512Bit vs 385Bit, 4096Bit vs 384Bit.
The width of that bus dictates in combinations with memory speed the memory bandwidth, 250GB/s - 320GB/s - 512GB/s the memory bandwidth is what matters with texture LOD, the more bandwidth you have the higher the performance.

So while Nvidia have better and more efficient Texture compression, AMD have more muscle for it.

As for Tessellation, there isn't a lot AMD can do about that, other that just having more raw power.

AMD have now, by the looks of it addressed these things, Tonga and Fiji have far more efficient texture compression, Polaris looks like it will improve that more and address the Tessellation problem.

bru · 7 Jan 2016 at 16:31

humbug said:
I want to add to this.

AMD do have ' or have had a bit of an issue with Tessellation and Texture Compression. when compared with Nvidia.

But, with Texture Compression on the AMD side its usually overcome with brute force rather than efficiency, AMD traditionally have a wider memory Bus, 384Bit vs 256Bit, 512Bit vs 385Bit, 4096Bit vs 384Bit.
The width of that bus dictates in combinations with memory speed the memory bandwidth, 250GB/s - 320GB/s - 512GB/s the memory bandwidth is what matters with texture LOD, the more bandwidth you have the higher the performance.

So while Nvidia have better and more efficient Texture compression, AMD have more muscle for it.

As for Tessellation, there isn't a lot AMD can do about that, other that just having more raw power.

AMD have now, by the looks of it addressed these things, Tonga and Fiji have far more efficient texture compression, Polaris looks like it will improve that more and address the Tessellation problem.

Sounds about right, it will be interesting going forward, as the new wave of HBM2 cards will have the very similar if not exactly the same bandwidth on both AMD and NVidia cards. It will give us a even platform to see whose features are faster/more efficient over each other.
Of course in the grand scheme of things I can see both sides being close just as they have been for the last god knows how many years. It certainly wont be one side massively further ahead than the other.

KillBoY_UK · 7 Jan 2016 at 16:32

FoxEye said:
Surely AMD's new cards will have been design locked for ~~months and months~~ a long time?
.

It tends to take 3-4 years R&D time to go from concept to release for a new architecture

FoxEye · 7 Jan 2016 at 16:38

KillBoY_UK said:
It tends to take 3-4 years R&D time to go from concept to release for a new architecture

My question was more along the lines of "can they change anything in hardware without having to tape out again?" I'm sure I read that any architectural hardware change would require another tape out.

Given that Radeon Tech Group was formed in September, they didn't have time change - well, anything? - in hardware and keep the same release window.

So besides changing the name to Polaris, anything else would have to be software/driver improvements, rather than hardware changes?

I don't "know" any of this, which is why I'm phrasing it as a question rather than a statement

Also this:
http://www.amd.com/en-us/press-releases/Pages/amd-demonstrates-2016jan04.aspx

AMD expects shipments of Polaris architecture-based GPUs to begin in mid-2016

I think Charlie's talk of releasing some cards in two months is overly optimistic.

drunkenmaster · 7 Jan 2016 at 16:41

Davedree said:
Yes it has if you are basing 40% ipc gain of Zen vs Carrizo.
Carrizo is Excavator.
Don't blame me I didn't bring up zen in the thread, I'm just stating facts and stopping people getting amd hyped.

Very little of what you've stated are facts. One, Carrizo is optimised for the process for the area it's being used in. You choose the right process and use different metals sometimes, different transistor designs for different frequencies. It is better optimised for lower leakage, better idle power cut off and lower clock frequencies. You could EASILY take the same exact architecture and even on the same node and most of the same process use a different transistor design, different spacing and different number of layers to quite significantly change the electrical performance to make it much more efficient at higher clocks.

So no, Carrizo doesn't 'not have any power improvements' because you are comparing apples with oranges when comparing chips manufactured for different purposes. An architecture doesn't simply work one way and one way only. Layout, exact node and where you want to target performance have a very large effect on power output at various clocks.

Also, IPC has absolutely everything to do with clock speed, so anyone(which is both of you) stating it doesn't, don't know what they are talking about.

High IPC designs in general want as low latency as possible meaning a very short pipeline. Why, because the time the CPU is doing the least work is when it's mispredicted an instruction and has to fetch a new one. So you want as short a pipeline as you can get, which is directly linked to the achievable clock speeds of an architecture.

Stages in the pipeline and clock speed the architecture will run at are intrinsically linked. Having a wider core means to fill it efficiently you need larger prediction logic to keep it filled better and more consistently and you want a short pipeline to go along with it or it just isn't optimised. A very high clock speed design will want more stages in the pipeline and a narrower core.

IPC, clock speed, width of the core and amount of supporting core logic to keep the cores filled up are all very heavily linked together. There is a reason they didn't continue with a 5Ghz Core 2 Duo. P4, long pipeline, high clock speed, narrow core. Core 2 Duo, wider core, short pipeline, low clock speed.

When AMD go out there and say X architecture is y% faster than Z architecture they aren't talking about a specific implementation but in general for that architecture because as above you can take any architecture and change the layout/transistor/node to drastically change it's performance. They are saying a mobile Zen aimed at low power would have a 40% higher IPC than a similar Excavator design, and a Zen FX for desktop and high power would be 40% higher IPC than a Excavator done for desktop/high power.

humbug · 7 Jan 2016 at 16:42

bru said:
Sounds about right, it will be interesting going forward, as the new wave of HBM2 cards will have the very similar if not exactly the same bandwidth on both AMD and NVidia cards. It will give us a even platform to see whose features are faster/more efficient over each other.
Of course in the grand scheme of things I can see both sides being close just as they have been for the last god knows how many years. It certainly wont be one side massively further ahead than the other.

Yes, If AMD can get the same sort of level of Tessellation and Compression throughput efficiency as Nvidia, and maintain an architectural performance lead, it means Nvidia will have to up their game with architectural performance to keep pace.

HBM2 should be the same for both, so thats texture LOD performance, but that still leaved the raw power in the architecture for a lot of other things, AMD do have more raw power in their GPU's for the same die size and power consumption..

drunkenmaster · 7 Jan 2016 at 16:45

humbug said:
I want to add to this.

AMD do have ' or have had a bit of an issue with Tessellation and Texture Compression. when compared with Nvidia.

But, with Texture Compression on the AMD side its usually overcome with brute force rather than efficiency, AMD traditionally have a wider memory Bus, 384Bit vs 256Bit, 512Bit vs 385Bit, 4096Bit vs 384Bit.
The width of that bus dictates in combinations with memory speed the memory bandwidth, 250GB/s - 320GB/s - 512GB/s the memory bandwidth is what matters with texture LOD, the more bandwidth you have the higher the performance.

So while Nvidia have better and more efficient Texture compression, AMD have more muscle for it.

As for Tessellation, there isn't a lot AMD can do about that, other that just having more raw power.

AMD have now, by the looks of it addressed these things, Tonga and Fiji have far more efficient texture compression, Polaris looks like it will improve that more and address the Tessellation problem.

Nvidia traditionally are the brute force method. In terms of memory compression AMD have almost always had lower memory storage, this is due to compression. You're mixing things up, having a wider bus doesn't mean they have worse memory compression, it means their architecture is designed for more throughput, not that the data being pushed through is less compressed.

There is a reason why so many games use less memory on AMD cards than Nvidia, their compression is better, not worse, and it's been that way since at least the 4870.

On the other side, memory bus, wider and slower is in architecture terms considered a bit more elegant than a narrower heavily clocked memory bus. Nvidia had a lot of problems with their memory buses in the 280-580 era, where they usually had a higher memory bus than AMD. These days they've gone for the simpler narrower memory controller and ramped the clock speeds up. Neither is really better or worse.

humbug · 7 Jan 2016 at 16:56

drunkenmaster said:
Nvidia traditionally are the brute force method. In terms of memory compression AMD have almost always had lower memory storage, this is due to compression. You're mixing things up, having a wider bus doesn't mean they have worse memory compression, it means their architecture is designed for more throughput, not that the data being pushed through is less compressed.

There is a reason why so many games use less memory on AMD cards than Nvidia, their compression is better, not worse, and it's been that way since at least the 4870.

On the other side, memory bus, wider and slower is in architecture terms considered a bit more elegant than a narrower heavily clocked memory bus. Nvidia had a lot of problems with their memory buses in the 280-580 era, where they usually had a higher memory bus than AMD. These days they've gone for the simpler narrower memory controller and ramped the clock speeds up. Neither is really better or worse.

I'm in danger of falling foul of my own selection bias trap here, but as far as i'm aware AMD do not use less V-Ram than Nvidia, other than sometime with its Fiji GPU's. nor should it, the texture is not compressed once in the buffer. its an algorithm not to much unlike compression tools, its compressed on pick-up and decompressed in the buffer.

And i do think Maxwell especially is just as good as Fiji and certainly better than Hawaii at Texture LOD despite having less raw Throughput, i don't know what it is but i do believe a GTX 980 has about 250GB/s on its 256Bit Bus vs the 390X at about 350GB/s on its 512Bit bus.

Orangey · 7 Jan 2016 at 16:57

It's not texture compression, it's colour data.

People took something they didn't understand and made it into "texture compression" because that was intuitive. And the misinfo spread like wildfire.

humbug · 7 Jan 2016 at 17:03

Texture compression
is a specialized form of image compression designed for storing texture maps in 3D computer graphics rendering systems. Unlike conventional image compression algorithms, texture compression algorithms are optimized for random access.

Davedree · 7 Jan 2016 at 17:08

drunkenmaster said:
Very little of what you've stated are facts. One, Carrizo is optimised for the process for the area it's being used in. You choose the right process and use different metals sometimes, different transistor designs for different frequencies. It is better optimised for lower leakage, better idle power cut off and lower clock frequencies. You could EASILY take the same exact architecture and even on the same node and most of the same process use a different transistor design, different spacing and different number of layers to quite significantly change the electrical performance to make it much more efficient at higher clocks.

Like I said in post 272 I was spot on with my point that depending on the process used will determine if Carrizo is worth porting to high frequency/high tdp, and you have just confirmed it in your own explanation, so bugger off I was not incorrect !

Davedree said:
''Carrizo on it's current low performance process is tuned for mobile, Ipc and clockspeed and headroom are very relevant if you wish to port this design onto desktop 65-95w (Bristol ridge). Also if benchmarks are going to be made upon performance of a throttling Excavator Apu, then that affects the comparsion being made to their ipc gains. They cannot simply use the gf28lp for desktop, if they port it to 14nm then awesome, but I imagine they'll use the 28shp which to be honest is pointless.''[/COLOR].

drunkenmaster said:
So no, Carrizo doesn't 'not have any power improvements' because you are comparing apples with oranges when comparing chips manufactured for different purposes. An architecture doesn't simply work one way and one way only. Layout, exact node and where you want to target performance have a very large effect on power output at various clocks.

Agreed which is my whole entire point, If people are stating zen will bring 40% Ipc over excavator then what version of Exavator are we using for a comparison?

drunkenmaster said:
Also, IPC has absolutely everything to do with clock speed, so anyone(which is both of you) stating it doesn't, don't know what they are talking about.

I said Ipc is relevant to clockspeeds and scaling so glad you were agreeing with my point.

drunkenmaster said:
High IPC designs in general want as low latency as possible meaning a very short pipeline. Why, because the time the CPU is doing the least work is when it's mispredicted an instruction and has to fetch a new one. So you want as short a pipeline as you can get, which is directly linked to the achievable clock speeds of an architecture.

Stages in the pipeline and clock speed the architecture will run at are intrinsically linked. Having a wider core means to fill it efficiently you need larger prediction logic to keep it filled better and more consistently and you want a short pipeline to go along with it or it just isn't optimised. A very high clock speed design will want more stages in the pipeline and a narrower core.

IPC, clock speed, width of the core and amount of supporting core logic to keep the cores filled up are all very heavily linked together. There is a reason they didn't continue with a 5Ghz Core 2 Duo. P4, long pipeline, high clock speed, narrow core. Core 2 Duo, wider core, short pipeline, low clock speed.

When AMD go out there and say X architecture is y% faster than Z architecture they aren't talking about a specific implementation but in general for that architecture because as above you can take any architecture and change the layout/transistor/node to drastically change it's performance. They are saying a mobile Zen aimed at low power would have a 40% higher IPC than a similar Excavator design, and a Zen FX for desktop and high power would be 40% higher IPC than a Excavator done for desktop/high power.

We shall see when the times comes. but the only baseline of an Excavator at the moment is a Carrizo which is limited in a 15w tdp oem.

Davedree · 7 Jan 2016 at 17:27

enough said :0

Orangey · 7 Jan 2016 at 17:28

humbug said:
Texture compression
is a specialized form of image compression designed for storing texture maps in 3D computer graphics rendering systems. Unlike conventional image compression algorithms, texture compression algorithms are optimized for random access.

Yes it exists. So?

http://www.hardocp.com/article/2014/09/18/nvidia_maxwell_gpu_geforce_gtx_980_video_card_review/4

humbug · 7 Jan 2016 at 17:51

Orangey said:
Yes it exists. So?

http://www.hardocp.com/article/2014/09/18/nvidia_maxwell_gpu_geforce_gtx_980_video_card_review/4

Isn't that what i said?

Textures are just images.

Orangey · 7 Jan 2016 at 19:27

Don't try and weasel out like you always do.

You were talking about additions to the GPU tech itself, not in general. You have been trying to set yourself up as an all-knowing expert lately because you played around with a CryEngine tutorial a bit, tried to waffle and got corrected.

At least own up to it for once.

I'm sure a lot of cover-up blather will follow, trying to dodge around it and have it forgotten, including trying to turn it back on me when all I did was make a simple correction that can be confirmed by several people reading this very thread.

Mauller · 7 Jan 2016 at 19:30

One thing that still bugs me is that texture still look like arse in the majority of games. even in modern games on ultra.

Lighting effects have improved dramatically but not so much on the texture side of things.

If a game received proper 3D positional audio with reflections, diffusion and diffraction of sounds. Along with some kind of decent Global illumination, that would be awesome.

Those demos of Crytecs SVOTI look awesome at only 3fps cost per scene, and it was not like the scenes they demonstrated were simple either.

flopper · 7 Jan 2016 at 19:34

Godlike talk about Polaris the best thing in graphics coming this year.

https://www.youtube.com/watch?v=hvD37UUcdIo

and everyone using a Nvidia Gsync screen buy amd Polaris and enjoy the best with amd freesync over HDMI.

https://www.youtube.com/watch?v=hHxKtiCaV8c

Davedree · 7 Jan 2016 at 19:36

Mauller said:
One thing that still bugs me is that texture still look like arse in the majority of games. even in modern games on ultra.

Lighting effects have improved dramatically but not so much on the texture side of things.

If a game received proper 3D positional audio with reflections, diffusion and diffraction of sounds. Along with some kind of decent Global illumination, that would be awesome.

Those demos of Crytecs SVOTI look awesome at only 3fps cost per scene, and it was not like the scenes they demonstrated were simple either.

Totally agree with you on this, when I was at school which was 16 years ago I often wondered how photo realistic graphics would be in 10 years time.
Well 16 years on I'm still waiting