CUDA: Concurrent Communicating Kernels in CUDA?

panyan · 9 Feb 2016 at 12:23

Lets say I have a device with 2880 cuda cores.

I want to run a Monte Carlo simulation where:
-2000 threads are each running a sample
-880 threads are generating random numbers

This is because:
-I only want 2000 samples therefore the other 880 would be sitting idle
-I know that generating random numbers can be slow

Therefore I want to make a pool of random numbers that is replenished continuously by the 880 threads which the 2000 sample threads can take when required.

Is this possible? If so, please provide an example.

Scotty123 · 9 Feb 2016 at 15:16

Yes of course, although you are not too worried about optimisation at this point I am assuming? I.e. avoiding warp divergence, minimising global memory accesses and so forth.

As a start, seeing as each thread has an id (as you know, from a combination of the block index, thread index etc), if the id is less than a certain number, do your monte carlo (countless papers on CUDA and Monte Carlo methods no doubt go into this in detail), else do your random number generation.

Could keep this pool of random numbers in global memory (as a start I know it's the slowest etc). I suppose you could push it into the shared memory for each block ready for a thread to take it within that block when ready.

Do you really need separate kernels as in your thread title, instead of just doing the monte carlo or the number generation based upon thread position? You can launch and queue kernels up anyway, ready to be taken in as needed.

Been 8 months since I was using CUDA...boy do I miss it

.

panyan · 9 Feb 2016 at 15:56

Scotty123 said:
Yes of course, although you are not too worried about optimisation at this point I am assuming? I.e. avoiding warp divergence, minimising global memory accesses and so forth.

The modelling method has inherent warp divergence unfortunately. I have also tried to minimise the amount of memory the algorithm uses. I thought this would also be a good time saving feature.

As a start, seeing as each thread has an id (as you know, from a combination of the block index, thread index etc), if the id is less than a certain number, do your monte carlo (countless papers on CUDA and Monte Carlo methods no doubt go into this in detail), else do your random number generation.

Ah! thats a clever way of going it. The common way is to make the thread id and then only perform the code if it is within the bounds. But we can use the ones that go over that bound to make random numbers for us.

Could keep this pool of random numbers in global memory (as a start I know it's the slowest etc). I suppose you could push it into the shared memory for each block ready for a thread to take it within that block when ready.

The problem is I dont know how many random numbers will be required, so i would like a buffer. Maybe if we were to utilise the shared memory then we keep the same number of thread blocks as random numbers in an array i.e:

block size = 200, therefore generate random numbers when the size of the array < 200

Do you really need separate kernels as in your thread title, instead of just doing the monte carlo or the number generation based upon thread position? You can launch and queue kernels up anyway, ready to be taken in as needed.

No, I dont need separate kernels. I just didnt know a better way of doing it.

Been 8 months since I was using CUDA...boy do I miss it .

thanks

Scotty123 · 9 Feb 2016 at 16:09

Also are you thinking 2880 cuda cores ( e.g the tesla k40) means 2880 threads? As that is not the case I believe if I recall correctly.

Even so, you don't usually use all the threads at once, due to block sizes, streaming multi processors and more things I don't quite remember. There's occupancy rates and so on.

panyan · 9 Feb 2016 at 16:14

Scotty123 said:
Also are you thinking 2880 cuda cores ( e.g the tesla k40) means 2880 threads? As that is not the case I believe if I recall correctly.

Even so, you don't usually use all the threads at once, due to block sizes, streaming multi processors and more things I don't quite remember. There's occupancy rates and so on.

I dont really understand how threads map to cuda cores.

But I want to try to use the full gpu to make the code as fast as possible.

(naive CUDA novice here

)

EDIT:

I made a simple code (GTX 580 - 512 cuda cores) and it seemed to perform the best when like this:

testrand2 << <1, 512 >> >(d_simulations_each);

Therefore I assumed that youd get the best performance when you made 1 thread = 1 cuda core

Scotty123 · 9 Feb 2016 at 16:23

It really depends on a lot of factors, things can be memory bandwidth limited (global memory is very slow to access, so cuda tries to hide memory latency by loading things in as other things are calculated).

Still, even a "noob" implementation of CUDA is way faster than multicore CPU equivalents for certain problems (not all, some problems have the memory access destroy their performance, for example unstructured meshes in fluid dynamics).

My first CUDA implementation was around 200x faster than solving the problem on 4 cores. After some tuning and learning, it was around 512x (this was on a tesla k40).

NickK · 9 Feb 2016 at 16:40

Just think of the data in as a set of data pieces that need processing by a 'thread'. The GPU doesn't have a full 1:1 data to GPU core so it abstracts using 'threads'.

I've been doing OpenCL rather than CUDA but I did do parallel and distributed systems as the main specialism of my degree. Decided against doing the PHD in supercomputing with perfume movement through rooms and air.

Any parallel problem will have data dependences for each operation - that gather and scatter (whether singular or multiple) and the state for each data element in totality.
If you're just running a serial program in parallel.. that will work if there's no data concurrency issues.

I'll not mention the IEEE floating point issues GPUs have

Although they are better now they're still not perfect..

For real parallel you end up breaking the existing way of working down so that it's got the minimal interdependencies and the integrity remains true over time.

I assume the assignment is "just get this running on the GPU"

It used to be that all cores were running at the same program counter - so you had to minimise the number of branches/loops created by data - often it meant splitting the data into different blocks so you could do a full parallel process on each without having one 'thread' or one core holding the others up waiting to finish.

That same issue causes issue if the data I/O is strangled - it's better for the task to have a smaller number of cores that can saturate without having one of the cores causing over saturation and causing stalling. Having the data in the right format for the best read/write access is also a big point for processing.

nVidia confuse things - they've made a virtual model that you program to (kind of like java) and then the drivers have top map that to the hardware available on the card.

GPU cores now are grouped - offering shared memory between the grouped cores.. this is faster but can cause slow down of the cores due to concurrency or replication issues.

Last thing I did on the GPU was a self-modelling 2D Finite Impule Response filter that could deconvolute the image given an 2D non-symmetrical airy disk. It's so fast because everything is processing in parallel and using local memory where possible, even with splitting the filter to maximise parallel operations and data independence.

NickK · 9 Feb 2016 at 16:44

Personally - you could create a pool/cache of random numbers ahead of use. Then do the simulation and then for the next group of 2800, create the random numbers then the simulation..

The problem is you've got the situation where you're attempting to halt the PC if the consuming threads run out of random numbers.

panyan · 11 Feb 2016 at 11:44

NickK said:
Just think of the data in as a set of data pieces that need processing by a 'thread'. The GPU doesn't have a full 1:1 data to GPU core so it abstracts using 'threads'.

I've been doing OpenCL rather than CUDA but I did do parallel and distributed systems as the main specialism of my degree. Decided against doing the PHD in supercomputing with perfume movement through rooms and air.

Any parallel problem will have data dependences for each operation - that gather and scatter (whether singular or multiple) and the state for each data element in totality.
If you're just running a serial program in parallel.. that will work if there's no data concurrency issues.

I'll not mention the IEEE floating point issues GPUs have Although they are better now they're still not perfect..

This is a monte carlo analysis so thankfully we dont need a huge amount of accuracy as it will all be averaged anyway.

For real parallel you end up breaking the existing way of working down so that it's got the minimal interdependencies and the integrity remains true over time.

Yes, I am already thinking about how to re-write the code

I assume the assignment is "just get this running on the GPU"

We do want a good amount of speed up as well, but it doesnt need to be the most optimised thing in the world

It used to be that all cores were running at the same program counter - so you had to minimise the number of branches/loops created by data - often it meant splitting the data into different blocks so you could do a full parallel process on each without having one 'thread' or one core holding the others up waiting to finish.

it used to be? has that changed? I thought branching was still a major issue

That same issue causes issue if the data I/O is strangled - it's better for the task to have a smaller number of cores that can saturate without having one of the cores causing over saturation and causing stalling. Having the data in the right format for the best read/write access is also a big point for processing.

nVidia confuse things - they've made a virtual model that you program to (kind of like java) and then the drivers have top map that to the hardware available on the card.

GPU cores now are grouped - offering shared memory between the grouped cores.. this is faster but can cause slow down of the cores due to concurrency or replication issues.

Last thing I did on the GPU was a self-modelling 2D Finite Impule Response filter that could deconvolute the image given an 2D non-symmetrical airy disk. It's so fast because everything is processing in parallel and using local memory where possible, even with splitting the filter to maximise parallel operations and data independence.

Wow - impressive

NickK said:
Personally - you could create a pool/cache of random numbers ahead of use. Then do the simulation and then for the next group of 2800, create the random numbers then the simulation..

the problem is I dont know how many random numbers each sample is going to need - it could be 5, it could be 200. So I cant just generate a set amount as that will either be too few or wasting time with too many.

The problem is you've got the situation where you're attempting to halt the PC if the consuming threads run out of random numbers.

NickK · 12 Feb 2016 at 11:15

it used to be? has that changed? I thought branching was still a major issue

In the old GPUs all the core program counters were tied together. In modern GPUs, often the cores are grouped - each group has it's own program counter that ties only the group's cores. It means that you get more parallel execution in sequence when you have loops etc. This allows loops etc but then requires concurrency synchronisation points that are needed to be coded into the kernel.

the problem is I dont know how many random numbers each sample is going to need - it could be 5, it could be 200. So I cant just generate a set amount as that will either be too few or wasting time with too many.

When compressed into RGBA, your 200 is very small.. The images I was processing were tens of megabytes.. so caching generated random numbers is an option.

If your GPU can support grouped cores, then you could use a group to generate random numbers.

panyan · 12 Feb 2016 at 11:21

NickK said:
When compressed into RGBA, your 200 is very small.. The images I was processing were tens of megabytes.. so caching generated random numbers is an option.

If your GPU can support grouped cores, then you could use a group to generate random numbers.

This sounds very interesting, could you flesh out some more details please?

So, could I use one grouped core to generate random numbers whilst the others are calculating?

I would like to run 2000 samples and each one will need many (~200-500) random numbers (floats)