single/double point accuracy?

twist3d0n3 · 18 Nov 2009 at 14:21

hey all, read a bit about fermi where the article said it had impressive double point accuracy but not so with single point... i'm lost with this; can someone explain it for me please!

Trunks9486 · 18 Nov 2009 at 14:33

Its for GPGPU, gaming uses single precision.

twist3d0n3 · 18 Nov 2009 at 14:34

can you (or anyone else) elaborate on that for me?

i knew games used single precision, but i'm in the dark as to what it really means, and its relation to double precision

Lightnix · 18 Nov 2009 at 14:40

The floating point precision I believe refers to the amount of bits the processor is operating on in a floating point operation.

Floating points are essentially numbers with decimal points, single precision are 32-bit and double precision are 64-bit. As far as we're concerned, it's the amount of figures the processor will keep track of. Obviously it's slower to do double precision calculations the whole time because the processor then has to work with more numbers.

For example, say I run this:

Code:

float singlep =    123123123123.123123123123; //store this as single precision 
double doublep =   123123123123.123123123123; // store this as double precision

When the code is actually run, what the computer sees is:

Code:

singlep = 1.2312312e+011
doublep = 123123123123.12312

You can see that the single precision number has been truncated and estimated with an exponent.

Ultimately for anyone who's not a programmer, you probably don't have to worry about double precision performance for anything. Most if not all games only use single precision for shader effects on the GPU at this point, and things that use double precision in consumer applications for the most part aren't performance critical (again, could be wrong).

NickK · 18 Nov 2009 at 15:23

IEEE compliance is the big thing. It states all the sizes and the behaviours for exceptions.

More indepth here: http://en.wikipedia.org/wiki/IEEE_754-2008

Basically your gaming texture shaders use floating point numbers for the colours on the texture. The shader program changes those colours depending on light etc, and then the GPU transforms them before rendering to the frame buffer (screen). These texture pixel colours are stored as 4 floating point numbers.
Historically, for speed they'd be 16bit or 24bit without any error handling (divide by zero for example) that a normal CPU would perform.

So if you remember all the fights about 16bit vs 24bit vs 32bit shaders then that's where the 'bitness' comes in.

A shader is a very simple, very specialised and highly efficient number processor. It's only got a few operations compared to a CPU. However it when it does a calculation it does the four channels (R/G/B/A) of the pixel in one go. One such operation is multiply and add, Result:=A*B+C. This is the most common maths matrix functions multiply and add... we'll come back to matrices later..

Now a GPU uses a massive number of shaders to process each texture pixel in parallel. So a texture may be 4096x4096 pixels where each pixel has red, green, blue and alpha values stored as floating point numbers.
So if a GPU has 512 shaders then it can really process a lot of numbers at one go!

In GPGPU the technique is to abuse this number crunching power for highly mathematic purposes.
What you do is load your numbers into the RGBA for each pixel in the texture and then load a shader with the processing you want todo. You then render the rectangle texture to the framebuffer without displaying to screen. This frame buffer holds all your results that you then read back from the GPU into main memory to get the results back.
The result is that you can then use your 512 shaders to process 4096x4096x4 floating point numbers in one go! That's 67,108,864 floating point numbers in one go!*
Compare that to you CPU SSE instructions that do 8 numbers in one go!

Now with games you can use those that power todo proper collision detection, physics calculations etc. The games only require a low precision as they don't have to be accurate thus faster single point is acceptable.

Now in super computing and modelling, accuracy and precision (yes they're two different things) are important. A minor error will, as the model progresses increase in size causing problems. Now models usually have error calculations too but the behaviour of the number processing needs to be predictable. Hence the IEEE behaviour compliance is required.
Supercomputing tends to be high precision too - this means the numbers are 64 or even 128bits in size. However there are techniques that can be used (error calculations etc) that can be used to reduce the precision required as a trade off for speed to get results back quickly - often these are small runs on smaller subsets of data and will be done by the coder's PC overnight or a dedicated PC run for a week for a test.

You have to remember supercomputing performs trillions of matrix multiplications and additions/subtractions on terabytes of data and the machines that do this are hideously expensive so time on them is not available for 'testing' and the cost for a "run" is also high.

Having 64 bit IEEE double precision floating point numbers allows Fermi to target this area of supercomputing. Areas such as analysis of readings from ground or ship based radar to detect oil for example are all areas where a quick estimation on the ship allows the company todo make many quick checks rather than having to wait for supercomputer time todo their calculations.

Ok, a bit war and peace but I hope that helps.

* yes this is simplified but it shows the spirit of the use rather than get into complexities of SPMD, shader groups etc. Also "a go" means one execution command by the CPU to the GPU.. thus 512 or less are actually in parallel depending on the shader program however the GPU will still finish all the texture pixels before completing the shader program (thus all 67M numbers).

twist3d0n3 · 18 Nov 2009 at 15:36

wow very concise explainations guys

i'm starting to understand this a bit more now.

as has been said for supercomputers, they do massive calculations, which need double precision. from what i've read in the past, supercomputers use cpus, and thousands of them to deal with the massive throughput they're designed to.. am i right in thinking they don't use gpus?

thanks very much for the answers so far guys

D.P. · 18 Nov 2009 at 15:38

twist3d0n3 said:
wow very concise explainations guys i'm starting to understand this a bit more now.

as has been said for supercomputers, they do massive calculations, which need double precision. from what i've read in the past, supercomputers use cpus, and thousands of them to deal with the massive throughput they're designed to.. am i right in thinking they don't use gpus?

thanks very much for the answers so far guys

Traditionally that is right, but under some algorithms the GPU can be used instead of the CPU. This can be up to 50-100X faster

twist3d0n3 · 18 Nov 2009 at 15:52

so are things moving towards an equilibrium as such between workload on gpus and cpus with SCs?

drunkenmaster · 18 Nov 2009 at 16:43

NickK said:
IEEE compliance is the big thing. It states all the sizes and the behaviours for exceptions.

One of the least reported things is that AMD's 5XXX series is IEEE 754-2008 compliant aswell. Though from what I've read around the place the reason the last cards weren't, was because the people they spoke to wanted some of the features of 754-2008, but didn't need some, and most didn't need some of them so a few were left out.

Either way, this is why I say AMD have rather stumbled onto a GPGPU monster. Because they had better double precision performance than Nvidia last gen, they didn't spend 10's of millions promoting it like Nvidia, for the miniscule market it has. Nvidia spent millions and millions in advertising, pushing GPGPU, making cards, all for $80mil revenue, that made them basically zero profit. AMD are doing the smart thing, making the cards, making them fully compliant and very powerful, adopting open standards, and letting Nvidia pay to expand the market, once its a big enough market to be profitable, AMD come in and say, btw, our card does everything theres does, for half the cost, on time, and faster

Nvidia are making a massive push on all these innovative feature the Fermi has, which AMD had most of last gen. Nvidia's had a 10 fold increase in double precision, but thats because they had 1/5 the double precision performance of AMD last gen, which has since doubled and they both have similar theoretical performance. AMD can basically do everything Nvidia can do in terms of GPGPU.

Likewise Nvidia really haven't drastically changed their cards to be GPGPU focused over graphics, its just the only thing they can talk about without working hardware at final speeds. Everything they have is normal, even the next gen S3 products, crap as they are, should support all these features and run opencl/direct compute fine. Its basically the standard for this generation coming, but nvidia seem to be proclaiming it as new innovative features no one else has. Even though AMD have them already on shipping cards.

markiejt · 18 Nov 2009 at 16:50

drunkenmaster said:
One of the least reported things is that AMD's 5XXX series is IEEE 754-2008 compliant aswell. Though from what I've read around the place the reason the last cards weren't, was because the people they spoke to wanted some of the features of 754-2008, but didn't need some, and most didn't need some of them so a few were left out.

Either way, this is why I say AMD have rather stumbled onto a GPGPU monster. Because they had better double precision performance than Nvidia last gen, they didn't spend 10's of millions promoting it like Nvidia, for the miniscule market it has. Nvidia spent millions and millions in advertising, pushing GPGPU, making cards, all for $80mil revenue, that made them basically zero profit. AMD are doing the smart thing, making the cards, making them fully compliant and very powerful, adopting open standards, and letting Nvidia pay to expand the market, once its a big enough market to be profitable, AMD come in and say, btw, our card does everything theres does, for half the cost, on time, and faster

Nvidia are making a massive push on all these innovative feature the Fermi has, which AMD had most of last gen. Nvidia's had a 10 fold increase in double precision, but thats because they had 1/5 the double precision performance of AMD last gen, which has since doubled and they both have similar theoretical performance. AMD can basically do everything Nvidia can do in terms of GPGPU.

Likewise Nvidia really haven't drastically changed their cards to be GPGPU focused over graphics, its just the only thing they can talk about without working hardware at final speeds. Everything they have is normal, even the next gen S3 products, crap as they are, should support all these features and run opencl/direct compute fine. Its basically the standard for this generation coming, but nvidia seem to be proclaiming it as new innovative features no one else has. Even though AMD have them already on shipping cards.

wow they guy just asked what it was?

why did you write a massive post about how rubbish Nvidia is and all hail ATI?

Lightnix · 18 Nov 2009 at 16:54

twist3d0n3 said:
wow very concise explainations guys i'm starting to understand this a bit more now.

as has been said for supercomputers, they do massive calculations, which need double precision. from what i've read in the past, supercomputers use cpus, and thousands of them to deal with the massive throughput they're designed to.. am i right in thinking they don't use gpus?

thanks very much for the answers so far guys

Some more recent ones do. The 5th most powerful supercomputer in the world uses Radeon HD 4870 X2's.

http://www.top500.org/list/2009/11/100

NickK · 18 Nov 2009 at 17:00

CRAY has used opteron with specialist FPGA chips. So the programmer not only programs the C program that's run on the opteron but also programs the FPGA description for specialist operations. Thus making the system even faster.

The structures of the supercomputers varies wildly and are usually so custom they're built for a specific purpose - ie the met office ones are specifically tailored to their climate model.

In super computing every operation is sacred and every data data load/store is usually checked carefully. So a:=b for example is a load and store.
The net effect of a one cycle saving usually magnified millions of times because of the amount of data being processed. This could lead to hours or even days from the time it takes to complete a run.

The smaller areas that Fermi are looking at are the COTS products such as aerodynamics products that could use it. The problem is that to use a GPU rather than a CPU isn't just a case of compiling, in fact there's no compiler that could do that at the moment (although it's a hotbed of activity at the moment).
Instead the data structures and operations on those need to be altered to suit the "texture" format. This means you have to carefully look at the sparseness of data in memory for the operation (dense is better) and also look at the interdependences of the operations carefully as GPUs as well as the concurrency aspects. Processing in parallel works best when there's no dependancies and no concurrency.

GPU's shaders work independently. This means if a shader for pixel 29 writes to position 1 of the output and the shader for pixel 57 writes to position 1, then it's undefined as to which value is present in position 1 of the output. There's no locking or atomic operations.

It's better to think of a GPU processing texture pixels with a shader program like this:

Code:

DO IN PARALLEL pixel OF LOCATION 0 TO 4096x4096
    shaderProgram(pixel, inputTextureA, .. inputTextureH, outputTexture)
END PARALLEL

So a shader has to calculate from the inputs then write to the output once. It's given a reference "pixel", it knows that the inputs are read only and the the output is writable only. So you can't specify the output as an input (it's effect is completely undefined).

Shared memory areas have appeared in GPUs in the last generation to aid the problems of concurrency and dependant data being used. However this doesn't change the fundamental point if you lock something then you would bring the entire parallel processing from 512 down to one pixel at a time.. in short there's a different way of coding, arranging the processing of data and specifying programs. It's not like CPU threading with locks etc (which are bad and show bad data structures for parallel processing!). A different mindset.

I should probably point out I took numeric computation (ie supercomputers) for my degree and have been involved (ATI researcher) since 2006.. it's not something that I bring up at parties.. but with the job issues this has taken a back seat (hence I'm out of touch with the latest stuff)

twist3d0n3 · 18 Nov 2009 at 17:00

markiejt said:
wow they guy just asked what it was?

why did you write a massive post about how rubbish Nvidia is and all hail ATI?

it's still an interesting read

i started with a simple 'wut is it?' but as conversations go, they evolve, and it's interesting to see what effect it has on vendors' products

NickK · 18 Nov 2009 at 17:25

In the end it all boils down to the cost to re-implement existing functionality, viability and risk of new technology*.

* I should point out that SGI was abusing their 3D graphics chips in this way in the 80s..

drunkenmaster · 18 Nov 2009 at 17:26

twist3d0n3 said:
wow very concise explainations guys i'm starting to understand this a bit more now.

as has been said for supercomputers, they do massive calculations, which need double precision. from what i've read in the past, supercomputers use cpus, and thousands of them to deal with the massive throughput they're designed to.. am i right in thinking they don't use gpus?

thanks very much for the answers so far guys

the top 4 supercomputers in the world today use multiple AMD hexcore chips, the 5th fastest is a new one from China which has Intel Xeons but does use 5100ish 4870's for the processing. So GPU's are certainly being used for supercomputers, I also find it funny they use AMD cards and not Nvidia ones

It completely and entirely depends on the calculations being done, the massively paralel nature of GPU's is very suitable for certain calculations and absolutely hopeless for others, the Interger power of CPU's is massively, massively more powerful than that of GPU's, GPU's are all about FPU power. If you've got a program/algorithm/whatever you're using it for that relies on interger performance a CPU is still the way to go, GPU's, AMD who are supposedly not very into GPGPU, are used for the 5th most powerful computer on the planet, you could put a pretty decent bet on the fact its doing hugely complex FPU computing.

Bulldozer, AMD's next gen cores for 2011, look set to increase the ondie ratio of power in favour of Interger, this is planning for both the increase in offloading FPU work to gpu's, and when those shrink down to the next process size, also having some small gpu's on die, which can handle a heck of a lot of FPU throughput. I've not seen what Intel's post Nehalem architecture is yet, but I'd put serious money on them moving the same direction, we know they are putting gpu's on die, they are FPU monsters, it makes very little sense to dedicate that much fpu power in the "cpu" parts of the die.

Looking even further into the future you'll see more "custom designable" versions of cpu's, where supercomputer builders can probably choose between different versions, where you have a 16/32 core "cpu", but you can choose a version that has maybe 15 Interger units and 1 FPU unit, prefering to add discrete GPU's for FPU power, or maybe just not needing FPU power, or getting a 4 Interger, 12 FPU version thats all about on die FPU power.

When we get to that point, we'll probably see the end of the requirement for offloading fpu work to discrete GPU's when you can just have more on die.

NickK · 18 Nov 2009 at 17:40

I don't think hardware is the issue to be honest. I think it's more the compiler and program design paradigms.

Programmers are used to working in a serial paradigm with threading being an extension of that serial paradigm.

The latest parallel compilers are also focusing on the problem that when you run an application (parallel) on a punter A's computer the architecture requires a specific form of data layout and processing to get the best out of it. That 'form' may not be the best for punter B's computer. Hence the pressure to create JIT processing for parallel (also the speculation for nV's recruitment of hardware/software engineers that have experience in the area of transforming code to work on different structures).

Currently compilers, such as GCC, are still based on stone age code analysis for the serial paradigm. This is the reason why CUDA is actually a pre-processor that processes the program before the CPU compiler.
CUDA requires the programmer to use specific CUDA keywords as well as define the data structures in a friendly way. The result is that the CUDA pre-processor steals these areas and makes them shader programs and substitutes CUDA library calls before passing on to the CPU compiler.

It's well known that C is not a parallel friendly language. C++ is even worse. Hence the appearance of specialist parallel languages.

Make no mistake supercomputing is designing the computer around the application - down to the requirements of integer/float mix and even the data structures to minimise the cross-communication between processing nodes. In short attempting to maintain a high level of independence so the system works as parallel as possible.

Rroff · 18 Nov 2009 at 17:45

NickK said:
I should probably point out I took numeric computation (ie supercomputers) for my degree and have been involved (ATI researcher) since 2006.. it's not something that I bring up at parties.. but with the job issues this has taken a back seat (hence I'm out of touch with the latest stuff)

maybe so but you deff. seem to know 10x more about it than anyone else on this forum...

drunkenmaster · 18 Nov 2009 at 17:47

markiejt said:
wow they guy just asked what it was?

why did you write a massive post about how rubbish Nvidia is and all hail ATI?

Wow, if you read what I wrote, it was interesting, as a poster later said, it was on the subject, discussions evolve. IEEE 754-2008 spec is pretty important as someone pointed out, its entirely involved in the GPGPU part of the cards which is basically where single/double point accuracy comes into the equation which is what the OP was about.

I wasn't being anti Nvidia, if you notice, they'll HAVE what others have this generation, I didn't say they wouldn't have it, I wasn't saying they'd be crap at it, though they will be late with it. I just found it funny they are promoting it as innovation and new, when, errm, everyone has it. In other posts on this forum, while some people were taking AMD's DP numbers as solid and Nvidia's as bad in comparison I was pointing out that in reality Nvidia will end up quite a bit more powerful in double precision than AMD, I guess I was being pro AMD there by calling Nvidia faster, boo me and my fanboy ways of complementing the opposition.

Every company talks up their features, I'm just laughing about it, this time around its akin to Addidas coming out with a new pair of shoes with that new fangled Velcro stuff on it, woo, how new.

You really can't not mock/have a go at Nvidia these days even when talking nicely about them, even if Fermi's the best GPGPU ever, never bettered, its still late, big, low yield and expensive as all hell with few if any features other cards don't have. Thats simply the truth, thats how it is, its not even that much their fault as I'm pretty much the only one who'll point out its TSMC's fault rather than Nvidia's. Though that goes both ways, the late R600 was also 90% Tsmc and very little to do with AMD, though I don't remember any Nvidia fanboys back there blaming anyone but AMD and how crap they are.

Likewise when Nvidia have a card 20-30% lower clocked than they wanted, it will be largely TSMC's fault and if they underperform, like the R600, it won't be down to them, I won't forego that little fact to stick it to Nvidia.

drunkenmaster · 18 Nov 2009 at 17:57

NickK said:
I don't think hardware is the issue to be honest. I think it's more the compiler and program design paradigms.

.

I agree completely in general except on that first part. Coding needs to shift to be more paralel but the keyword I guess would be, where possible. There will always be different types of data and certain things will always need to be sequential. Obviously with basic cpu's being around so long sequential code was the focus and due largely to the massive use of computers everywhere these days a significant switch in coding style and computer architecture is just ridiculously hard to implement.

Then you hit a bit of a catch 22 situation, coders are practiced in current languages and styles of programming, even if you teach them better methods you still have lots and lots of programs based on the older style that need continuous support and updating that can't just change instantly so even if everyone was fantastic at programming for paralel and multicore use, they'd still be mostly working in sequential programs and so they'd still be better and more practiced.

I'm not really sure how you can avoid that either, in reality most programs are good at programming how they have been for years, so when they want to make something new, they might try and make something work better and more paralel but in the end if they can make the same program quicker and easier for them in their usual style, they will. Which continues them updating supporting that type of code and not moving on.

Its akin to being stuck at 32bit for so long, it would have been nice if Vista was 64bit only with no hint of anything 32bit in sight, just move on and forward quickly, but its just not possible.

But even after all that, even after everyones great at programming code for use on GPU's, you'll still have certain programs and data that runs better sequentially and does better on a smaller cpu at a much higher frequency.

Thats why a more modular approach to cpu design in the future seems incredibly likely, but when that can happen and also when they can be effectively programmed for isn't clear.

Pendu · 18 Nov 2009 at 18:07

Parallel and concurrent programming is quite hard, one of my modules was based around programming with threads using java.

The amount of planning that goes in before you even type in any syntax was time consuming. Avoiding problems like starvation and deadlock can be tricky. I salute anyone who does this type of programming on a daily basis.