Let Battle Commence

Final8y · 26 Sep 2009 at 19:49

dangerstat said:
So you will not be using any sort routines on your computer. Excellent, good luck with that Final8y

I'm not using any CUDA routines.
The thing is your purely talking about could be & not what is & most people like what is before parting with cash.
Unless your saying that CUDA is running now on my ATI card with out me realising it.
I will not bother sticking my tongue out it would make me look childish.

Duff-Man · 26 Sep 2009 at 20:24

Sheesh, this thread is painful to read. A lot of opinions being expressed based on poor information.

To those talking about CUDA and other GPGPU applications - how many of you have actually written anything in CUDA? I have, and I can assure you that while it is a very powerful tool for interfacing with the GPU, and very convenient due to its similarity to C (which everyone is already familiar with), it is certainly not the "magic bullet" of computing. Far from it.

Anyone suggesting that a GPU can be used to accelerate "almost anything" doesn't understand parallelism at the algorithmic level, or how GPUs operate internally. There are a very limited number of applications that a GPU can accelerate. Very few algorithms can be distributed without massive loss of efficiency due to the absence of rapid internal communication.

While CUDA is convenient, it's also highly limited in scope (even with respect to what is possible with GPUs). OpenCL offers a much wider scope, although it is more difficult to get into initially. As with all hardware-specific and/or propriatory formats, CUDA is destined for a short lifespan. OpenCL offers the most logical, portable and hardware independent access to the parallel floating point power of a GPU. There is very little reason to write anything in CUDA now, except for convenience. As far as scientific computing goes (which is my field), the nice thing about CUDA so far has been the availablity of standard linear algebra packages (i.e. BLAS) as a package on CUDA. Since these can now be executed in OpenCL, most of the attention is now being focussed here. No-one wants to deal with a hardware-specific API unless they have no other choice.

Anyway, do you guys think there is a chance you could stop just urinating into the wind, and start discussing the upcoming nvidia hardware for a change? Do we have any more information (or even speculation) on the spec? Have there been any new hints as to whether nvidia have gone for a MIMD approach? If they HAVE then this would be a massive coup as far as GPGPU computing goes, and would greatly expand the number of algorithms (and hence end-user applications) which could benefit from acceleration with a GPU. A MIMD approach would truly give people interested in GPGPU applications a reason to go for nvidia. But CUDA is certainly not that - at least not any more.

Rroff · 26 Sep 2009 at 20:29

Duff-Man said:
Do we have any more information (or even speculation) on the spec? Have there been any new hints as to whether nvidia have gone for a MIMD approach? If they HAVE then this would be a massive coup as far as GPGPU computing goes, and would greatly expand the number of algorithms (and hence end-user applications) which could benefit from acceleration with a GPU.

nVidia always planned on a MIMD architecture for the 300 series and all the work they've done has been based on that approach so would be pretty odd if they came out without it...

Final8y · 26 Sep 2009 at 20:31

Duff-Man said:
Sheesh, this thread is painful to read. A lot of opinions being expressed based on poor information.

To those talking about CUDA and other GPGPU applications - how many of you have actually written anything in CUDA? I have, and I can assure you that while it is a very powerful tool for interfacing with the GPU, and very convenient due to its similarity to C (which everyone is already familiar with), it is certainly not the "magic bullet" of computing. Far from it.

Anyone suggesting that a GPU can be used to accelerate "almost anything" doesn't understand parallelism at the algorithmic level, or how GPUs operate internally. There are a very limited number of applications that a GPU can accelerate. Very few algorithms can be distributed without massive loss of efficiency due to the absence of rapid internal communication.

While CUDA is convenient, it's also highly limited in scope (even with respect to what is possible with GPUs). OpenCL offers a much wider scope, although it is more difficult to get into initially. As with all hardware-specific and/or propriatory formats, CUDA is destined for a short lifespan. OpenCL offers the most logical, portable and hardware independent access to the parallel floating point power of a GPU. There is very little reason to write anything in CUDA now, except for convenience. As far as scientific computing goes (which is my field), the nice thing about CUDA so far has been the availablity of standard linear algebra packages (i.e. BLAS) as a package on CUDA. Since these can now be executed in OpenCL, most of the attention is now being focussed here. No-one wants to deal with a hardware-specific API unless they have no other choice.

Anyway, do you guys think there is a chance you could stop just urinating into the wind, and start discussing the upcoming nvidia hardware for a change? Do we have any more information (or even speculation) on the spec? Have there been any new hints as to whether nvidia have gone for a MIMD approach? If they HAVE then this would be a massive coup as far as GPGPU computing goes, and would greatly expand the number of algorithms (and hence end-user applications) which could benefit from acceleration with a GPU. A MIMD approach would truly give people interested in GPGPU applications a reason to go for nvidia. But CUDA is certainly not that - at least not any more.

Indeed the GPU has manly limitation that i read about a few months ago.

when you look at the OpenCL API, you ll understand very quickly that CPU has some skill today that the GPU does not have yet ... With many cores, you need a lot of Caches, to store the data between the steps of your OpenCL procedure calls. If you go out of the Socket or your GPU, your performance sucks ... The tricks that nVidia use with CUDA only works if you touch your data 1 time. NV uses the thread scheduler to freeze a thread when you have a stall because memory access, and move to the next thread and come back to the 1st one when the memory request is done. They use an hardware scheduler.
This works when your algorythm is not f(n-1), when each loop has no relation between loop n and loop (n-1)
what ever you saw yet out of CUDA is bunch of cornet case algorithm, and to prove that it is not generic, you can not get any version of the Spec_int or Spec_fp on those GPUs , while some part of Spec are "very parralelizable"

AMD found that x86 is very flexible for 128bits vector, with many loop using each other results ... And that is the base of x86 (via SSE2)! I am sure they will add more support for the GPU when it does make sense, when it is massively parallel, and take advantage of the GPU acceleration. With AVX coming, it will be a really big challenge to beat the processor at 256 bits FLOATs and Double processing.

the other part is the memory size limit, the GPUs today are seriously limited, places were you need TFLOPS requires a lot of Memory ... right now, the GPGPUs are using their DDR5 as a cache to the main memory ... PCIexpress is a very poor cache protocol ...

The GPUs are in a world that requires the programmer to make its code 100% perfect, load have to be aligned 100%, store too, vectorization has to be done perfectly ... It is a world of Pain, and if you don t have a programmer with a PhD in parralelism or SIMD, you will not see the end of the project ... If you want OpenCL to perform well on GPU, you still have to plan for all of those tricks.

So, AMD will probably increase their support for GPGPU, over time, they got their own stable starting point, and it is always x86.
you will see the CPU wining all the very iterative workloads on OpenCL.

http://www.xtremesystems.org/forums/showpost.php?p=3952704&postcount=11

Duff-Man · 26 Sep 2009 at 20:35

dangerstat said:
yeah because a sort function is ooohhhh so specialised.

Dangerstat... If you actually think you can design an algorithm to perform data sorting, which can scale to arbitrary numbers of independent threads, then go for it. Publish an academic paper - everyone will want to kiss your arse. Or better yet - keep it to yourself and design some useful software with it, then sell it for millions.

I'd be interested to find out why you think data searching / sorting algorithms are suitable for GPU implementation :confused:

Anyway, as it is we have plenty of N*log(N) complexity algorithms which can be executed efficiently in real time on a CPU for lists up to N in the billions.

Edit - good news regarding MIMD, Roff

I have my fingers crossed!

Rroff · 26 Sep 2009 at 20:40

The GPUs are in a world that requires the programmer to make its code 100% perfect, load have to be aligned 100%, store too, vectorization has to be done perfectly ... It is a world of Pain, and if you don t have a programmer with a PhD in parralelism or SIMD, you will not see the end of the project ... If you want OpenCL to perform well on GPU, you still have to plan for all of those tricks.

You just need a programmer with good instincts who will know what parts of the code can take advantage of the GPU along with the limitations... it doesn't take that much knowledge, just a bit of imagination... unfortunatly theres waaaay too many career programmers flooding the market who are very very good at the theory but generally rubbish at thinking outside the box or having a real feel for programming.

Rroff · 26 Sep 2009 at 20:47

Duff-Man said:
Dangerstat... If you actually think you can design an algorithm to perform data sorting, which can scale to arbitrary numbers of independent threads, then go for it. Publish an academic paper - everyone will want to kiss your arse. Or better yet - keep it to yourself and design some useful software with it, then sell it for millions.

I'd be interested to find out why you think data searching / sorting algorithms are suitable for GPU implementation Anyway, as it is we have plenty of N*log(N) complexity algorithms which can be executed efficiently in real time on a CPU for lists up to N in the billions.

Edit - good news regarding MIMD, Roff I have my fingers crossed!

Depends what kinda sorting your talking about... something like a basic bubble sort routine can be expanded for useage on CUDA with a fair degree of scalability* and massive performance gains... but for anyone interested look up the studies on using the smith waterman algorithm on GPGPU and see just how complicated it can be to port it over, but its also a good example of just how fast GPGPU can be...

EDIT: * I guess in real industrial useage tho you'd be dropping to main memory relatively quickly losing your main performance gains.

Duff-Man · 26 Sep 2009 at 20:51

Final8ty - what you have quoted touches on the key issue for efficiency of parallelisation. If your algorithm relies on constant access to updated data, then it is very inefficient to make a parallel implementation of it (irrespective of the API used). You will spend all your time waiting for data to be updated, and only ever use one, or a very small number, of pipelines.

Algorithms which can be broken into separate and semi-independent "chunks" are suitable for GPU implementation. Pixel processing is one of these, which is to be expected since it drove the development of GPUs long before the days of programmable shaders etc. Others include computational modeling, where individual cells or elements can require a large amount of computation before they need to be passed updated data.

The natural progression for this technology is the merging of CPU and GPU onto one piece of silicon. This will allow for latencies to be minimised (waiting for data stored on the same piece of silicon is a hell of a lot faster than waiting for it to travel through the PCI-e port, then the GPU VRAM etc). Once we have this at a hardware level, then the real difficulty will fall to the design of suitable compilers. These compilers will need to analyse the structure of the program, and distribute the data to the best efficiency, taking into account multiple CPU cores and the massive number of GPU pipelines.

I believe that combined CPU-GPUs will be commonplace within 5-10 years, but I honestly think that designing effective systems of automatically compiling data will be one of the major computing challenges of the 21st centuary. We will reach a plateu in hardware long before we take full advantage of its power for general applications.

dangerstat · 26 Sep 2009 at 20:56

Duff-Man said:
Sheesh, this thread is painful to read. A lot of opinions being expressed based on poor information.

To those talking about CUDA and other GPGPU applications - how many of you have actually written anything in CUDA? I have, and I can assure you that while it is a very powerful tool for interfacing with the GPU, and very convenient due to its similarity to C (which everyone is already familiar with), it is certainly not the "magic bullet" of computing. Far from it.

Anyone suggesting that a GPU can be used to accelerate "almost anything" doesn't understand parallelism at the algorithmic level, or how GPUs operate internally. There are a very limited number of applications that a GPU can accelerate. Very few algorithms can be distributed without massive loss of efficiency due to the absence of rapid internal communication.

While CUDA is convenient, it's also highly limited in scope (even with respect to what is possible with GPUs). OpenCL offers a much wider scope, although it is more difficult to get into initially. As with all hardware-specific and/or propriatory formats, CUDA is destined for a short lifespan. OpenCL offers the most logical, portable and hardware independent access to the parallel floating point power of a GPU. There is very little reason to write anything in CUDA now, except for convenience. As far as scientific computing goes (which is my field), the nice thing about CUDA so far has been the availablity of standard linear algebra packages (i.e. BLAS) as a package on CUDA. Since these can now be executed in OpenCL, most of the attention is now being focussed here. No-one wants to deal with a hardware-specific API unless they have no other choice.

Anyway, do you guys think there is a chance you could stop just urinating into the wind, and start discussing the upcoming nvidia hardware for a change? Do we have any more information (or even speculation) on the spec? Have there been any new hints as to whether nvidia have gone for a MIMD approach? If they HAVE then this would be a massive coup as far as GPGPU computing goes, and would greatly expand the number of algorithms (and hence end-user applications) which could benefit from acceleration with a GPU. A MIMD approach would truly give people interested in GPGPU applications a reason to go for nvidia. But CUDA is certainly not that - at least not any more.

I write CUDA applications, in fact I have done so for about the last year. This "Anyone suggesting that a GPU can be used to accelerate "almost anything" doesn't understand parallelism at the algorithmic level" is quite frankly crap. I've not found anything of any importance that I have been unable to accelerate with a bit of crafty coding utilising a GPU with a decent amount of cores, whilst it might not be pretty or overly efficient the resultant application *is* quicker. Those that have proved difficult are things that requires too much host-card communication or things that require features only available in 200 series cards, and even then it *very* rarely needs massive amounts of work to fudge a solution.

Like I've said (numerous times) there are certain applications where the speed up is significant, but I've not found anything yet that I've had problems speeding up. It usually just takes a slight rethink of your approach,

Perhaps you should stop "urinating into the wind"

Final8y · 26 Sep 2009 at 20:56

@ Duff-Man And the sooner everyone works together on it the better.

Duff-Man · 26 Sep 2009 at 20:57

Rroff said:
Depends what kinda sorting your talking about... something like a basic bubble sort routine can be expanded for useage on CUDA with a fair degree of scalability* and massive performance gains... but for anyone interested look up the studies on using the smith waterman algorithm on GPGPU and see just how complicated it can be to port it over, but its also a good example of just how fast GPGPU can be...

A fair point, but if you're dealing with lists small enough for things like bubble-sort algorithms (which are order N^2) to be effective then there is no need to even consider a GPU implementation. If you're dealing with larger datasets, then switching to a more efficient algorithm (eg one of the N*log(N) ones) is going to give you a far bigger improvement as the dataset size grows.

Regarding the smith-waterman algorithm (which isn't really a 'search' algorithm but is admittedly applied to a similar problem) wikipedia has this to say about its GPU implementation:

A GPGPU implementation of the algorithm in the CUDA language by NVIDIA is also available.[9] When compared to the best known CPU implementation (using SIMD instructions on the x86 architecture), by Farrar, the performance tests of this solution using a single NVidia GeForce 8800 GTX card show a slight increase in performance for smaller sequences, but a slight decrease in performance for larger ones. However the same tests running on dual NVidia GeForce 8800 GTX cards are almost twice as fast as the Farrar implementation for all sequence sizes tested.

So, a slight improvement. Nothing near the x100 - x1000 improvements which can be seen with truly parallelisable algorithms though.

blackninja · 26 Sep 2009 at 21:05

Rroff said:
You just need a programmer with good instincts who will know what parts of the code can take advantage of the GPU along with the limitations... it doesn't take that much knowledge, just a bit of imagination... unfortunatly theres waaaay too many career programmers flooding the market who are very very good at the theory but generally rubbish at thinking outside the box or having a real feel for programming.

Sorry, but I disagree with the 'You just need a programmer with good instincts ... ' etc.. After defining the problem via a mathmatical model and then the numerical model you probably have to refactor the discretised equation sets to account for highly parallel operation and then implement that lot in the most effective way possible... I've met some 'very good programmers' who just didn't have a clue how to visualise this stuff! (My thesis was in Parallel Soluiton of Computational Fluid Dynaimcs Problems - wish I had a a GTX295 back then

) Now, a highly trained programmer with good hardware knowledge might just do a good job .. libraries will go a long way but not ALL the way.

Duff-Man · 26 Sep 2009 at 21:11

blackninja said:
Sorry, but I disagree with the 'You just need a programmer with good instincts ... ' etc.. After defining the problem via a mathmatical model and then the numerical model you probably have to refactor the discretised equation sets to account for highly parallel operation and then implement that lot in the most effective way possible...

Definitely. It's a question of the nature of the mathematical algorithm, not just the talent of the programmer. In some cases it's possible to adapt the algorithm to be more conducive to parallelisation, for a reasonable reduction in efficiency, but usually it's not realistic.

...as a side note, I could really do with someone with your background working on my current project. We had someone similar, but he left last November. We've been missing someone specialised in parallel implementations ever since, and we've suffered for it! Guys who know this stuff well and also understand the basics of CFD and other computational modelling are like fekking hens teeth!

TheGamingGeneral · 26 Sep 2009 at 21:14

fross said:
I think ATi is preparing themselves for a price drop. They positioned the 5870 next to the GTX 285 - about the same price, and better performance. It's not as fast as a 295 overall (though it's close in some cases), but it is significantly cheaper. I expect their 5870x2 will beat the 295 quite comfortably.

However, the 4890 is now HALF the price of the 5870, and as it's around the performance of between a 275/285 at the moment, that makes it an incredibly good deal. I think this means ATi has room to maneouvre to bring the price on the 5870 down. Something like:

Late September: 5870 launches, loads of early adopters buy it.
Early November: GT300 launches, is faster than the 5870, but more expensive (say £350).
Middle November: 5870 drops to £250, releases 5870x2 at £400, everyone goes crazy in their christmas shopping.

(This is pure speculation, I have no insider information )

I think ATi got the timing right on this, putting pressure on nVidia to come in at an incredibly attractive pricepoint, to be able to even compete.

One of the tech sites was saying Nvidia are gunning for black friday. Thanks giving is a wednesday this year, so it will be that friday (27-11-2009). Atleast, they would hope to be out by then.

Amd are also supposedly already working on the 5890. I really doubt we will see a ''GTX 380'' for less than £399 this year, unless the 5870X2 pulls off some biblical results.

When was the last time they released a flagship card sub £400 on launch anyway?

Rroff · 26 Sep 2009 at 21:15

Duff-Man said:
So, a slight improvement. Nothing near the x100 - x1000 improvements which can be seen with truly parallelisable algorithms though.

The latest set of info on implementing those kinda algorithms showed upto 50x gains on a G92 core, depending on many factors but even worse case was still over twice as fast as a quad core cpu.

blackninja said:
Sorry, but I disagree with the 'You just need a programmer with good instincts ... ' etc.. After defining the problem via a mathmatical model and then the numerical model you probably have to refactor the discretised equation sets to account for highly parallel operation and then implement that lot in the most effective way possible... I've met some 'very good programmers' who just didn't have a clue how to visualise this stuff! (My thesis was in Parallel Soluiton of Computational Fluid Dynaimcs Problems - wish I had a a GTX295 back then ) Now, a highly trained programmer with good hardware knowledge might just do a good job .. libraries will go a long way but not ALL the way.

Problem visualisation is the key... but a good instinctive programmer can shortcut (admittedly its not good practise) all the modelling and just know what can be done and what can't... whereas career programmers have to rely on what they've been taught which usually means feasibility studies, problem modelling, etc. and a lot of time wasted to get to the same place that a more natural programmer would have reached in little more than a split second.

dangerstat · 26 Sep 2009 at 21:22

Rroff said:
The latest set of info on implementing those kinda algorithms showed upto 50x gains on a G92 core, depending on many factors but even worse case was still over twice as fast as a quad core cpu.

IIRC didn't Nvidia demonstrate Radix sort algorithim upto 4 times faster than using a multicore CPU setup too?

sturmtruppe · 26 Sep 2009 at 21:25

I hve visited the future and all looks good.

Looks like its 2006 all over again for ATI

kylew · 26 Sep 2009 at 22:06

sturmtruppe said:
*Anal leakage*

Would you so kindly retreat back underneath your bridge please? Thanks

Wilderbeast · 27 Sep 2009 at 01:33

Speaking of anal leakage.....

kylew said:
Quote:
Originally Posted by Wilderbeast
In £ thats £199 and £299.

Did you pull those figures out of a dark crack?

When the 3800s and 4800s came out, they were $199 and $299.

People need to get over this 'rip off Britain' nonsense.

We pay the exchange rate equivalent + VAT.

That makes a 5850 £140 and a 5870 £210.

What went wrong?

Lightnix · 27 Sep 2009 at 01:44

Wilderbeast said:
Speaking of anal leakage.....

What went wrong?

I think the rumoured prices at that time were $299 and $199 as opposed to $399 and $299. Rumours are rumours I guess.