AMD Working On An Entire Range of HBM GPUs To Follow Fiji And Fury Lineup – Has Priority To HBM2 Cap

andybird123 · 14 Jul 2015 at 12:01

+

Mauller said:
The memory latency does not work like that. The chip read latency Is constant regardless of the interface clock frequency. Latency is measured in clock cycles, it is why ddr4 has higher latency measurements than ddr3. And the memory cells being used are no different to those in ddr5, just they are stacked etc.

Plus as I showed, the 1080p performance is fine when using mantle so the low 1080p performance is being caused elsewhere than the HBM.

I thought about it last night and I think the cause is 'Over aggressive memory management' in the driver. The memory managment is in the game for thief in mantle mode so will not be affected by amds driver memory management, causing lower res performance to tank.

If you look at game memory usage at lower resolutions, such as in gregs Pcars video, the FX is using half the ram of the TX yet there is plenty off memory to use.

yeah, well done, clearly I wasn't talking about the the actual memory chips latency, I am talking about sequential read/write over a 4096-bit bus needing to be optimised to actually make use of such a wide bus

making sure your application and/or drivers make as much use of a 512bytes at a time, when previously you only had to contend with 48 or 64 byte chunks is quite a big change I would imagine

Kaapstad · 14 Jul 2015 at 12:04

Mauller said:
The memory latency does not work like that. The chip read latency Is constant regardless of the interface clock frequency. Latency is measured in clock cycles, it is why ddr4 has higher latency measurements than ddr3. And the memory cells being used are no different to those in ddr5, just they are stacked etc.

Plus as I showed, the 1080p performance is fine when using mantle so the low 1080p performance is being caused elsewhere than the HBM.

I thought about it last night and I think the cause is 'Over aggressive memory management' in the driver. The memory managment is in the game for thief in mantle mode so will not be affected by amds driver memory management, causing lower res performance to tank.

If you look at game memory usage at lower resolutions, such as in gregs Pcars video, the FX is using half the ram of the TX yet there is plenty off memory to use.

As I said last night, you don't see a lot of Maxwell cards that can run Mantle.

Mei · 14 Jul 2015 at 12:06

i keep thinking of the huddy waterfall analogy and the TLC song ^^;

AthlonXP1800 · 14 Jul 2015 at 12:06

andybird123 said:
I would imagine that pic is basically just a mock up and not an actual working final

People probably assumed it was mock up but it was not. Jen showed a working prototype, every pics I saw was the front of card but until now I found a really very interesting pic I never saw before showed the back of pascal card Jen held really wasn't a mock up. It showed gpu base bracket and a Mezzanine connector which is NVLink connector I think it will replace SLi connector.

http://www.techbang.com/posts/17670-things-that-nvidia-does-not-make-it-clear-at-the-gtc

D.P. · 14 Jul 2015 at 12:09

Mauller said:
I did, i gave you a link. Greg's Fury x being neck and neck with his Titan X.

plus the FX doing better by a large % at 1080p in mantle shows a driver/api issue, not a hardware one. regardless of the comparison being with a titan x, the FX is still beating itself in mantle compared to DX11. Plus no reviewers used mantle for their Fury x benchmarks to show this. since amd asked them not to due to the driver being immature, yet it still shows a decent uplift.

Going to finish here since there is no point continuing this conversation.

The FuryX is 20% slower in mantle than Dx11 in BF4! So. O its nit a driver issues at all.

JediFragger · 14 Jul 2015 at 12:10

Mei said:
i keep thinking of the huddy waterfall analogy and the TLC song ^^;

Can't think of that dude now without hearing a cough

D.P. · 14 Jul 2015 at 12:12

Rroff said:
I don't believe HBM is what is holding the Fury back at 1080 type resolution or atleast not the whole story - when you start pushing things like SPs and so on upto huge counts it becomes increasingly harder to fully utilise those capabilities and a 4K workload tends to suit that kind of thing better.

Aside from process limitations and new hardware features its part of the reason we aren't on variants of the R600 or G92 core but with like 16384 shaders, 2048 ROPs, etc.

Yep, I think throwing a load of extra pixel shaders on tonga simply hasn't scaled well. AMD are just suffering from the laws of diminishing returns

eddyr · 14 Jul 2015 at 12:12

D.P. said:
AMD"s plan may simply to buy up all the stock to limit Nvidia. If they just purchased stock and left it in a ware-out they might leave themselves open for anti-competition litigation, if they buy up all the stock and shove them in cards that don't really need it then they are relatively safe.

I doubt that is the case because HBM is very expensive o makes no sense for a cheap card. However, at some point AMD have to decide if they want a shared memory controller over the entire range or the lower range has to be yet another architecture with GDDR5.

Anyway, if HBM is supposedly going to be in short supply next year that would explain why FuryX is vaporware and paints a very bad picture for GPUs in 2016.

I suppose it depends what is in low supply. If it's HBM2 because of the higher clocks speed and number of functioning dies per stack then that could mean an abundance of HBM stacks which don't meet HBM2 specifications but are operable at lower frequencies which could be ideal for midrange gpu's with modest bandwidth req. If it's in assembly then that's a bit of a roadbump.

Looking at the product lineup HBM makes sense for the high performance mobile GPU's/desktop midrange, ie what would have been the 256bit and some of the higher performance 128bit parts. I can see OEM being interested in the reduced power and space. Just leaves what would have been the 64bit chips up in the air. Which will probably just be existing 28nm chips. I can see AMD doing a full 16nm line up with HBM. In part because they may have lower overall market demand to satisfy and it will give them an edge design wise. Something else to consider, since HBM allows a smaller die they can harvest more dies per wafer at 16nm than if they had to add, for eg, a 256bit mem interface, may not be a huge influence but perhaps significant enough to consider for midrange+.

On the other hand Nvidia might stick with GDDR5 for some 16nm 64-128bit chips and ofc whichever of their 28nm dies they continue to produce in 2016. Reasoning being: higher volume to supply; limited and maybe lacking priority access to HBM; preferred option to reserve HBM for high performance dies for either availability or cost reasons; the added size to the die from small gddr5 interfaces leading to negligible influence on yields; overall gpu+pcb still being quite small and more importantly easily slots in to existing design/manufacturing on OEMs part (a straight swap into existing laptop models); with the shift to HBM maybe an oversupply of cheap gddr5.

All speculation but reasonable enough. Not sure if there are any other considerations I have forgotten.

Mauller · 14 Jul 2015 at 12:17

But as I mentioned, the game running in mantle mode shows the memory is not the cause of the problem.

It has to be the aggressive memory management causing the problem.

Plus I think you are a little confused about clock frequency and read latency. Even if data is being sent at a lower frequency over a wider bus, the overall latency will be lower due to greater parallelism. Not including the lower transmission / error correction overhead having shorter traces provides.

If you are talking about transmission latency of a packet of data over a single line, then yes, the lower frequency will have greater latency but 'only' if the packet is greater than the lines bandwith at a given frequency. We are still talking about data that is being sent at near the speed of light. So wether it is packed into a whole memory cycle or half is redundant.

andybird123 · 14 Jul 2015 at 12:24

Mauller said:
But as I mentioned, the game running in mantle mode shows the memory is not the cause of the problem.

It has to be the aggressive memory management causing the problem.

Plus I think you are a little confused about clock frequency and read latency. Even if data is being sent at a lower frequency over a wider bus, the overall latency will be lower due to greater parallelism. Not including the lower transmission / error correction overhead having shorter traces provides.

If you are talking about transmission latency of a packet of data over a single line, then yes, the lower frequency will have greater latency but 'only' if the packet is greater than the lines bandwith at a given frequency. We are still talking about data that is being sent at near the speed of light. So wether it is packed into a whole memory cycle or half is redundant.

well that's what I'm questioning, are they currently wasting bandwidth by sending too many small packets, wasting width at lower resolutions... unless you actually work for AMD's driver team I am willing to bet it isn't a question you are able to answer

it's a balance between buffering enough data to make use of the wider bus vs. sending data as quickly as possible and taking the hit on bandwidth by leaving potentially usable packets empty

it would explain why AMD have said they have to hand tune memory management with HBM on pretty much a per game basis

As with drawcalls, there could also be added latency in the driver stack on vram transactions that is lowered with DX11 vs other API's, which faster vram negates but becomes an issue with a slower wider bus

Mauller · 14 Jul 2015 at 12:24

Kaapstad said:
As I said last night, you don't see a lot of Maxwell cards that can run Mantle.

You are ignorantly ignoring the point I am making. The FX is not held back by its choice of ram. Especially when I stated that it beats itself at 1080p in mantle in Thief and equalises with the titan x.

I am explaining where the bottleneck mostly is, not saying it is better than the tx in mantle.

And on D.Ps point, I did say that AMD told people the mantle driver was immature. Plus BF4 has always had issues in mantle with some cards and around memory management.

Mauller · 14 Jul 2015 at 12:28

andybird123 said:
well that's what I'm questioning, are they currently wasting bandwidth by sending too many small packets, wasting width at lower resolutions... unless you actually work for AMD's driver team I am willing to bet it isn't a question you are able to answer

it's a balance between buffering enough data to make use of the wider bus vs. sending data as quickly as possible and taking the hit on bandwidth by leaving potentially usable packets empty

it would explain why AMD have said they have to hand tune memory management with HBM on pretty much a per game basis

The aggressive memory management is more to do with avoiding the 4GB memory limit at 4K. But I still believe it is hindering lower resolution performance if the same memory algorithms are being used.

andybird123 · 14 Jul 2015 at 12:32

Mauller said:
The aggressive memory management is more to do with avoiding the 4GB memory limit at 4K. But I still believe it is hindering lower resolution performance if the same memory algorithms are being used.

Well yeah, if they are using the same algorithms at 1080p than 4K then they are shooting themselves in the foot to a certain extent

mmj_uk · 14 Jul 2015 at 12:38

I'm pretty sure I remember reading that HBM and GDDR5 have pretty much the same latency, so from a performance standpoint the extra bandwidth seems to be HBM's only advantage. IMO most of the benefits AMD see in HBM are in the physical packaging and operational properties.

andybird123 · 14 Jul 2015 at 12:44

mmj_uk said:
I'm pretty sure I remember reading that HBM and GDDR5 have pretty much the same latency, so from a performance standpoint the extra bandwidth seems to be HBM's only advantage. IMO most of the benefits AMD see in HBM are in the physical packaging and operational properties.

Yes, it is obvious that AMD needed hbm to get anywhere near releasing a 980ti level performance product under 300W, so they were forced to take on whatever disadvantages that conveyed

r7slayer · 14 Jul 2015 at 12:48

Kaapstad said:
HBM is waste of time for low end cards which are likely to be used @1080p where the new type of memory does not do well.

Better to stick with old fashioned GDDR5 which is probably cheaper too.

Why is that? It uses a lot less power and it allows for smaller PCB. Not to mention you dont necessarily have to clock the memory as high for low end, they could use lower clocking memory for low end too so it has its place.
This is great for low end!

pmc25 · 14 Jul 2015 at 13:25

IT Troll said:
I am sure priority access will really give AMD a much needed advantage. Nvidia will have to make do with whatever HBM2 is left after AMD have produced their 40 cards.

I think it's unlikely to prove a huge problem for NVIDIA, as Pascal is going to be very late regardless (which is a very big problem).

IT Troll · 14 Jul 2015 at 13:45

pmc25 said:
I think it's unlikely to prove a huge problem for NVIDIA, as Pascal is going to be very late regardless (which is a very big problem).

Current estimates are still 10-12 months aren't they? Now if they took 2 years to release something new and it didn't beat the competition that would be a big problem.

Rroff · 14 Jul 2015 at 13:52

pmc25 said:
I think it's unlikely to prove a huge problem for NVIDIA, as Pascal is going to be very late regardless (which is a very big problem).

Not seen anything suggesting its going to be really late so far - always the possibility TSMC won't actually be ramped upto volume production in the claimed timescale or unexpected issues in debugging.

Poneros · 14 Jul 2015 at 14:09

After how badly AMD mucked the whole launch of Fury X who really has any faith left in them even with them getting priority? Meh. Just reads like someone being overly hopeful. Again.