Are there some new nvidia drivers coming

bru · 6 Jun 2013 at 17:00

Don't be silly Roff, how dare we even suggest that Nvidia may be capable of doing the same thing that AMD did and bring out a driver that gives a nice boost to performance. :rolleyes:

Gregster · 6 Jun 2013 at 17:04

bru said:
Don't be silly Roff, how dare we even suggest that Nvidia may be capable of doing the same thing that AMD did and bring out a driver that gives a nice boost to performance.

I will be the first to moan at them for taking so long (as soon as I had finished benching that is)

James J · 6 Jun 2013 at 18:13

Gregster said:
I will be the first to moan at them for taking so long (as soon as I had finished benching that is)

haha

humbug · 6 Jun 2013 at 18:15

Rroff said:
My gut instinct is we aren't seeing the full performance GK110 can bring to gaming... IMO we should be seeing closer to that ~70% (but at the current clocks rather than clock for clock) that humbug originally mentioned not the ~50% it is - not sure if I can explain this very well...

When your dealing with crunching through gigabytes of data the optimal approach is often to batch up large amounts of data at once, delay/reorder some operations to get the best long term performance which is great for plowing through lots of data but not so optimal for typical gaming scenarios where you want to quickly process smaller amounts of data - in a simplistic sense this is why AMD's old vliw architecture had such high theoretical performance and does so well at some things but struggles to bring that level of performance to gaming.

I think we are seeing something similiar with Titan, it might be that some level of performance is unavoidably lost by dispatching data sub-optimally for game type processing on a compute focused design - its possible tho they still haven't fully optimised to get the best out of it and in shader heavy games/benchmarks we are likely to see upto ~20% increases (upto 60% in context with figures humbug mentioned) with future drivers.

Different architectures and different jobs, if a GPU has 4Tflops then it might be really good at compute but that's not to say its going to be any better at GFX rendering than another 2Tflop GPU.

Adding more performance to a GPU is not as simple as beefing a part of it it up.

The GTX Titan does not scale 70% with 75% extra SP's compared to the GTX 680 because it is not 75% more GPU, not even close.

In exactly the same way the 2048 SP 7970 does not scale 15% up from the 1792 SP 7950, as we all know its about 5 to 7%.
Reason: they both have exactly the same memory bandwidth and ROP's (the rest of the GPU)

The 1792 SP 7950 scales much better from the 1280 SP 7870 because it has a wider bus vs the 7870 (384Bit vs 256Bit) but its not 40% which is the difference in SP count, its about 30%, Tahiti LE with 1536 SP's is about 20%~ slower than the 7950 with 15% less SP's and about ~10% faster than the 7870 with 20% more SP's.

Long story short, for a +75% SP's GPU to scale +75% you need to scale the rest of the GPU up 100%.
And that's not what the GTX Titan is, it has 75% more SP's held back by only 50% extra bandwidth and ROP's

Take 50% off 75 and you have 37, add the 50% you gained from the rest of the GPU and you have about 55% total gain clock for clock, which is exactly what it is

Rroff · 6 Jun 2013 at 18:24

humbug said:
Different architectures and different jobs, if a GPU has 4Tflops then it might be really good at compute but that's not to say its going to be any better at GFX rendering than another 2Tflop GPU.

Adding more performance to a GPU is not as simple as beefing a part of it it up.

The GTX Titan does not scale 70% with 75% extra SP's compared to the GTX 680 because it is not 75% more GPU, not even close.

In exactly the same way the 2048 SP 7970 does not scale 15% up from the 1792 SP 7950, as we all know its about 5 to 7%.
Reason: they both have exactly the same memory bandwidth and ROP's (the rest of the GPU)

The 1792 SP 7950 scales much better from the 1280 SP 7870 because it has a wider bus vs the 7870 (384Bit vs 256Bit) but its not 40% which is the difference in SP count, its about 30%, Tahiti LE with 1536 SP's is about 20%~ slower than the 7950 with 15% less SP's and about ~10% faster than the 7870 with 20% more SP's.

Long story short, for a GPU to scale +75% you need to scale the rest of the GPU up 100%.
And that's not what the GTX Titan is, it has 75% more SP's held back by only 50% extra bandwidth and ROP's

Take 50% off 75 and you have 37, add the 50% you gained from the rest of the GPU and you have about 55% total gain clock for clock, which is exactly what it is

It will depend on how much shader workload makes up the overall performance, the different architectures (you can't transplant AMD shader performance scaling ratios directly onto nVidia) and some more complicated issues with pipeline depth latency that can start to impact scaling on higher end GPUs in some cases. Don't forget that the 680 has a relative poor compute/shader performance for what it is while still having more than adequate pixel and polygon pushing capabilities so you don't need to scale those up so much to get more performance overall out of the GPU as you would with say the 7970.

humbug · 6 Jun 2013 at 18:36

Rroff said:
It will depend on how much shader workload makes up the overall performance, the different architectures (you can't transplant AMD shader performance scaling ratios directly onto nVidia) and some more complicated issues with pipeline depth latency that can start to impact scaling on higher end GPUs in some cases. Don't forget that the 680 has a relative poor compute/shader performance for what it is while still having more than adequate pixel and polygon pushing capabilities so you don't need to scale those up so much to get more performance overall out of the GPU as you would with say the 7970.

Then lets use Nvidia.

GTX 680: 1536 SP's / 256Bit = 100%
GTX 670: 1344 SP's / 256Bit = ~95% with 20% less SP's and the same memory bandwidth.
GTX 660TI: 1344 SP's / 192Bit = ~80% with the same number of SP's as the GTX 670 but slower bus = 10%~ slower than that GTX 670.

Duff-Man · 6 Jun 2013 at 18:49

Honestly, I don't expect any huge improvements to GK110 from drivers. Perhaps another couple of percent in comparison to the GK104, but for the most part the GK110 -> GK104 comparison is roughly where you would expect it to be.

On fully GPU-limited tests, the performance already scales fairly linearly. The GTX780 is a good example; 50% more memory bandwidth, and about 40-45% more compute power (the in-application clocks on the 780 seem more variable than the 680). When they're clocked to the same speeds, you're not far off the 50% improvement that the increased compute power and memory bandwidth implies.

Besides, while GK110 has a multitude of complex internal interactions that aren't present in GK104, for gaming very few of these are ever utilised. The data pathways for gaming-type data will be very similar between GK110 and GK104.

As a GTX780 owner I'd love to see some more "free" performance, but I'm not holding my breath.

Rroff · 6 Jun 2013 at 19:01

humbug said:
Then lets use Nvidia.

GTX 680: 1536 SP's / 256Bit = 100%
GTX 670: 1344 SP's / 256Bit = ~95% with 20% less SP's and the same memory bandwidth.
GTX 660TI: 1344 SP's / 192Bit = ~80% with the same number of SP's as the GTX 670 but slower bus = 10%~ slower than that GTX 670.

Your forgetting the effects of the boost clock.

Duff-Man · 6 Jun 2013 at 19:22

humbug said:
Then lets use Nvidia.

GTX 680: 1536 SP's / 256Bit = 100%
GTX 670: 1344 SP's / 256Bit = ~95% with 20% less SP's and the same memory bandwidth.
GTX 660TI: 1344 SP's / 192Bit = ~80% with the same number of SP's as the GTX 670 but slower bus = 10%~ slower than that GTX 670.

It's not just the number of shaders, but the clockspeed. The number of shaders and the clockspeed together give you the floating-point performance, which is the thing you want to be comparing:

Floating point performance = #parallel threads X #clocks per second X number of computations per clock.

e.g:

GTX680: 1536 SPs X 1.006Ghz X 2 FLOPs per cycle = 3090 GFLOPS

7970 (original): 2048 SPs X 0.925Ghz X 2 FLOPs per cycle = 3789 GFLOPS

(etc)

That's the primary measure of performance ("pixel pushing power"), but you also have to consider memory bandwidth as well. For a true comparison you want both to increase in ratio. That's why the GTX680 is an interesting case - 50% bump in SPs, AND a 50% bump in memory bandwidth. When the clocks are set the same the GTX780 is pretty much a 50% bump over the GTX680, and in fully GPU-limited cases you would expect to see framerates increase by 50% as a result. Right now we're not too far off that, so I can't see any driver-related miracles coming through for gaming-type data.

Rroff · 6 Jun 2013 at 19:53

Duff-Man said:
Besides, while GK110 has a multitude of complex internal interactions that aren't present in GK104, for gaming very few of these are ever utilised. The data pathways for gaming-type data will be very similar between GK110 and GK104.

From what I've heard the difference in cache and scheduler behavior on GK110 over GK104 geared towards better handling of ILP, etc. results in some loss of efficency in handling gaming type data but its not really an area I'm an expert on.

Duff-Man · 6 Jun 2013 at 20:08

Rroff said:
From what I've heard the difference in cache and scheduler behavior on GK110 over GK104 geared towards better handling of ILP, etc. results in some loss of efficency in handling gaming type data but its not really an area I'm an expert on.

As I understand it, these features are more useful for relatively unpredictable data of varying types, that require a fairly dynamic allocation of the available resources. Gaming data tends to be fairly predictable - you have a mass of pixels, each of which needs a fairly similar set of processes applied to it.

It'd be interesting to see some PhysX-heavy benchmarks of the GTX680 vs GTX780 - that's one area where I can imagine GK110 pulling away from GK104 quite nicely.

...I'd hardly describe myself as an expert in this area either though. I've done a fair amount of GPGPU work, but nothing at a low level. I tell the thing to solve a sparse matrix system for me, or do some matrix-vector multiplications, and it goes away and does it

humbug · 6 Jun 2013 at 23:46

Rroff said:
Your forgetting the effects of the boost clock.

My illustration is quite obviously clock for clock, we all know the GTX 670 is only about 5%~ behind the GTX 680 at the same clocks, I mean how many times has that been said in this room? and that the GTX 660Ti is slower clock for clock than the GTX 670 with the same SP's, its because it has a slower bus, again something that is widely known.

Duff-Man said:
It's not just the number of shaders, but the clockspeed. The number of shaders and the clockspeed together give you the floating-point performance, which is the thing you want to be comparing:

Floating point performance = #parallel threads X #clocks per second X number of computations per clock.

e.g:

GTX680: 1536 SPs X 1.006Ghz X 2 FLOPs per cycle = 3090 GFLOPS

7970 (original): 2048 SPs X 0.925Ghz X 2 FLOPs per cycle = 3789 GFLOPS

(etc)

That's the primary measure of performance ("pixel pushing power"), but you also have to consider memory bandwidth as well. For a true comparison you want both to increase in ratio. That's why the GTX680 is an interesting case - 50% bump in SPs, AND a 50% bump in memory bandwidth. When the clocks are set the same the GTX780 is pretty much a 50% bump over the GTX680, and in fully GPU-limited cases you would expect to see framerates increase by 50% as a result. Right now we're not too far off that, so I can't see any driver-related miracles coming through for gaming-type data.

Its been said already, and not just by me, you can't compare different architectures.

AMD's GPU's have more stream processors, they have a higher compute performance rating than the GTX 680, about 3 times as powerful, you wouldn't think it looking at the figures (3Tflops vs 4) yet in practice they are.

Its because they use those resources differently and are geared to work with different instructions, CUDA vs OpenCL for example.

AMD's 7970 has 25% more SP's than the GTX 680, yet it is not any better at rendering GFX.
The closest example would be my 7870 having 1536 SP's, 32 ROP's and a 256Bit bus, exactly the same as the GTX 680, yet in GFX rendering the GTX 680 eats my 7870 for breakfast, some 25% faster. but with that same number of SP's my 7870 eats the GTX 680 alive in compute.

They are different GPU's with different ways of getting to result A.

But with both green and red muscular performance is just as important as stamina, just adding one only has half the overall effect.

You can see that for yourself every time you play with your clocks to get the highest scores or highest FPS, when you increase your memory speed you are increasing the other half of the GPU's performance and with that you get the full effect, increase your GPU clocks by 10% and you get 5%, increase your memory clocks by 10% as well and you get the other 5% to add to your 5% giving you the full 10%.

Rroff · 7 Jun 2013 at 00:03

humbug said:
My illustration is quite obviously clock for clock, we all know the GTX 670 is only about 5%~ behind the GTX 680 at the same clocks, I mean how many times has that been said in this room? and that the GTX 660Ti is slower clock for clock than the GTX 670 with the same SP's, its because it has a slower bus, again something that is widely known.

Clock for clock the 670 does not have 20% lower compute performance than the 680 (let alone 20% less SPs) - its barely 10% slower compute (gflops) clock for clock (~12% less SPs) its only 20% less gflops if you compare out the box stock clocks without taking into account boost clocks.

Unless I'm missing something pretty much everything you've said about kepler is inaccurate because your not allowing for the way boost works.

humbug said:
You can see that for yourself every time you play with your clocks to get the highest scores or highest FPS, when you increase your memory speed you are increasing the other half of the GPU's performance and with that you get the full effect, increase your GPU clocks by 10% and you get 5%, increase your memory clocks by 10% as well and you get the other 5% to add to your 5% giving you the full 10%.

This will depend where the bottleneck is both at a software level and the ratios on the hardware. i.e. tomb raider I can raise the core clocks up 10% and get a 10% increase in performance without even touch the memory clocks until I'm considerably above stock clocks because its not memory bandwidth limited.

Some GPUs come with barely adequate memory bandwidth out the box, others come with considerably more memory bandwidth than they need until you've massive increased the core clocks over stock.

humbug · 7 Jun 2013 at 00:28

Rroff said:
Clock for clock the 670 does not have 20% lower compute performance than the 680 (let alone 20% less SPs) - its barely 10% slower compute (gflops) clock for clock (~12% less SPs) its only 20% less gflops if you compare out the box stock clocks without taking into account boost clocks.

Unless I'm missing something pretty much everything you've said about kepler is inaccurate because your not allowing for the way boost works.

This will depend where the bottleneck is both at a software level and the ratios on the hardware. i.e. tomb raider I can raise the core clocks up 10% and get a 10% increase in performance without even touch the memory clocks until I'm considerably above stock clocks because its not memory bandwidth limited.

Some GPUs come with barely adequate memory bandwidth out the box, others come with considerably more memory bandwidth than they need until you've massive increased the core clocks over stock.

Clock for clock the 670 does not have 20% lower compute performance than the 680 (let alone 20% less SPs) - its barely 10% slower compute (gflops) clock for clock (~12% less SPs) its only 20% less gflops if you compare out the box stock clocks without taking into account boost clocks.

Compute performance has little to do with game performance unless its a compute heavy game. like TombRaider

We are talking about up stepping from the GTX 680 to GTX Titan and why the scaling is not 100% for 100%
The way I worded it, I.E. wrongly is where I confused you, I apologise. the GTX 680 has 15% more SP's (I think I said 20, I meant 15)
But your also ignoring the performance difference between the GTX 670 and GTX 660TI despite the same amount of SP's, its the 192Bit bus that slows the GTX 660TI from the GTX 670.

This will depend where the bottleneck is both at a software level and the ratios on the hardware. i.e. tomb raider I can raise the core clocks up 10% and get a 10% increase in performance without even touch the memory clocks until I'm considerably above stock clocks because its not memory bandwidth limited.

You may well be seeing 10% for 10% in some games, you may see it in Dirt Showdown, Sleeping Dogs and TombRiader, the reason probably is that they are GPU compute heavy games, (which is why they run so fast on AMD GPU's compared with Nvidia) your not going to see that in BF3, or the large majority of games, at least not yet, not until AMD have all those developers pie's they now have their fingers in going down the in-game GPU compute rout.

humbug said:
AMD's GPU's have more stream processors, they have a higher compute performance rating than the GTX 680, about 3 times as powerful, you wouldn't think it looking at the figures (3Tflops vs 4) yet in practice they are.

Its because they use those resources differently and are geared to work with different instructions, CUDA vs OpenCL for example.

AMD's 7970 has 25% more SP's than the GTX 680, yet it is not any better at rendering GFX.
The closest example would be my 7870 having 1536 SP's, 32 ROP's and a 256Bit bus, exactly the same as the GTX 680, yet in GFX rendering the GTX 680 eats my 7870 for breakfast, some 25% faster. but with that same number of SP's my 7870 eats the GTX 680 alive in compute.

They are different GPU's with different ways of getting to result A.

melmac · 7 Jun 2013 at 00:45

Rroff said:
My gut instinct is we aren't seeing the full performance GK110 can bring to gaming... IMO we should be seeing closer to that ~70% (but at the current clocks rather than clock for clock) that humbug originally mentioned not the ~50% it is - not sure if I can explain this very well...

When your dealing with crunching through gigabytes of data the optimal approach is often to batch up large amounts of data at once, delay/reorder some operations to get the best long term performance which is great for plowing through lots of data but not so optimal for typical gaming scenarios where you want to quickly process smaller amounts of data - in a simplistic sense this is why AMD's old vliw architecture had such high theoretical performance and does so well at some things but struggles to bring that level of performance to gaming.

I think we are seeing something similiar with Titan, it might be that some level of performance is unavoidably lost by dispatching data sub-optimally for game type processing on a compute focused design - its possible tho they still haven't fully optimised to get the best out of it and in shader heavy games/benchmarks we are likely to see upto ~20% increases (upto 60% in context with figures humbug mentioned) with future drivers.

I agree with some of what you are saying and it ties in to everything that some of us have been saying about the GK110 chip, well what nvidia themselves have said about the GK110 chip, that it was primarily designed as compute part and that impacts the performance it can bring to games.

And this is where I disagree with you a little, I don't think there is much performance to be got from drivers. Well, there will be no increase like AMD got with GCN. The core architecture is still kepler. Just like there was no driver that gave a massive increase to the second generation of fermi cards. That driver came after the release of the 480/470. Fermi was completely new, Kepler is just an evolution of Fermi (well that's what I hear all the experts say)

If you read the whitepaper on the Gk110 you will see that a lot of the changes are at a hardware level. All changes geared towards making the card better at compute. I would say that will be drivers released that will increase the compute performance but not sure there will be one to increase gaming performance (unless said game uses compute of course

)

melmac · 7 Jun 2013 at 00:51

bru said:
Don't be silly Roff, how dare we even suggest that Nvidia may be capable of doing the same thing that AMD did and bring out a driver that gives a nice boost to performance.

For the record you didn't say "may" And if you don't like our reasons why Nvidia won't have a driver to increase performance like AMD did for GCN, then please share with us your reasons why they will.

Rroff · 7 Jun 2013 at 00:54

Thats where I was going with the 3rd paragraph of the bit you quoted - I'm not sure if its even possible to bring any gains from what I'm saying via drivers or not.

bru · 7 Jun 2013 at 02:35

melmac said:
For the record you didn't say "may" And if you don't like our reasons why Nvidia won't have a driver to increase performance like AMD did for GCN, then please share with us your reasons why they will.

nope I didn't say may I said

bru said:
Well AMD squeezed out a driver that gave a nice solid boost to performance, so there is no reason why Nvidia cannot do the same.

I'm not saying that they defiantly will I'm just saying that it is possible.

but of course I will now be pulled up on the semantics of what I said rather than the gist of the comment as is the way these things invariable work here. :rolleyes:

Duff-Man · 7 Jun 2013 at 02:51

Mr humbug... what are you chatting about?!

humbug said:
Its been said already, and not just by me, you can't compare different architectures.

[etc]

That was to demonstrate how the computational performance is calculated, so you can work out the relative performance between cards - instead of using the clumsy "oh this is 100%, so this one is about 120%" method

Quite obviously performance is different across different architectures - they have entirely different internal pathways, scheduling, and internal inefficiencies. If that wasn't the case, then we would just be comparing total compute performance and not bothering with benchmarks wouldn't we?!

When comparing GK104 to GK110 it's a reasonably valid comparison. While GK110 has a lot of additional transistors dedicated to general purpose computing applications, the core architecture used for simple and predictable parallel workloads (as we encounter in gaming) is very similar.

Compute performance has little to do with game performance unless its a compute heavy game. like TombRaider

Not sure if you're joking or not here

You realise that the process of applying shaders to pixels is a series of multiply-add ("MADD") computations, right? And that's the vast majority of what games do these days (lighting, reflections, refraction, translucency effects, subsurface scattering - all shaders). GPUs are massively parallel computing devices, designed to perform simple computational operations very quickly - it's why they're ideal for processing graphical effects.

Other processes such as geometry setup or tessellation are all floating point computations as well, but they're generally slightly less well-ordered than shader data, and so are more susceptible to inefficiencies in the architecture of the GPU.

Physics would be considered further along the scale of "complexity", in that the data coming in is less predictable and more "lumpy". To perform physics computations effectively on a GPU you require access to a much wider range of data, which requires better branch prediction, and needs the GPU (and perhaps more importantly the the controlling software) to be designed in such a way as to handle it efficiently. This is why CPU-based physics can still compete with GPU physics, but if you tried to render or pixel-shade a typical graphics scene the CPU would do so at well under 1fps. CPUs are good at handling complexity in data-structures - GPUs prefer a steady stream of predictable data (as generally encountered in gaming).

At the far end of the scale you have "general purpose compute", or GPGPU activities. These can cover a vast range of scientific and financial simulations, and due to this wide range of different data requirements, improving performance in these areas is the most challenging task. GPUs need complex internal links between components to allow the data-structures that are stored in the GPU memory to be assigned efficiently to the optimal pipelines. Cache and fast interconnects, as well as efficient scheduling and branch prediction are key to performance in these areas (something Nvidia first took seriously with Fermi).

... So, I'd love to know what you mean by "a compute heavy game like tomb raider". They're all "compute heavy". That's what modern games are!

humbug · 7 Jun 2013 at 03:20

Duff-Man said:
Mr humbug... what are you chatting about?!

That was to demonstrate how the computational performance is calculated, so you can work out the relative performance between cards - instead of using the clumsy "oh this is 100%, so this one is about 120%" method Quite obviously performance is different across different architectures - they have entirely different internal pathways, scheduling, and internal inefficiencies. If that wasn't the case, then we would just be comparing total compute performance and not bothering with benchmarks wouldn't we?!

When comparing GK104 to GK110 it's a reasonably valid comparison. While GK110 has a lot of additional transistors dedicated to general purpose computing applications, the core architecture used for simple and predictable parallel workloads (as we encounter in gaming) is very similar.

Not sure if you're joking or not here

You realise that the process of applying shaders to pixels is a series of multiply-add ("MADD") computations, right? And that's the vast majority of what games do these days (lighting, reflections, refraction, translucency effects, subsurface scattering - all shaders). GPUs are massively parallel computing devices, designed to perform simple computational operations very quickly - it's why they're ideal for processing graphical effects.

Other processes such as geometry setup or tessellation are all floating point computations as well, but they're generally slightly less well-ordered than shader data, and so are more susceptible to inefficiencies in the architecture of the GPU.

Physics would be considered further along the scale of "complexity", in that the data coming in is less predictable and more "lumpy". To perform physics computations effectively on a GPU you require access to a much wider range of data, which requires better branch prediction, and needs the GPU (and perhaps more importantly the the controlling software) to be designed in such a way as to handle it efficiently. This is why CPU-based physics can still compete with GPU physics, but if you tried to render or pixel-shade a typical graphics scene the CPU would do so at well under 1fps. CPUs are good at handling complexity in data-structures - GPUs prefer a steady stream of predictable data (as generally encountered in gaming).

At the far end of the scale you have "general purpose compute", or GPGPU activities. These can cover a vast range of scientific and financial simulations, and due to this wide range of different data requirements, improving performance in these areas is the most challenging task. GPUs need complex internal links between components to allow the data-structures that are stored in the GPU memory to be assigned efficiently to the optimal pipelines. Cache and fast interconnects, as well as efficient scheduling and branch prediction are key to performance in these areas (something Nvidia first took seriously with Fermi).

... So, I'd love to know what you mean by "a compute heavy game like tomb raider". They're all "compute heavy". That's what modern games are!

I don't know why you feel the need to expand on it to such an extent, all it does is confirm that Stream Processors are not the be all and end all of GPU's. hence 100% SP's on the GTX Titan do not = 100% performance scaling.

Do you realise that in very long widened way your agreeing with me?

They don't all use the same levels of GPU compute, "a compute heavy game like TombRaider" means exactly that, they use a lot of it. I don't know how you can see that as anything else, it affects one aspect of the GPU more than usual, assuming the usual is not to such level's, high dependant compute lighting, shadows, physics, 'realistic hair rendering'