Just think of the data in as a set of data pieces that need processing by a 'thread'. The GPU doesn't have a full 1:1 data to GPU core so it abstracts using 'threads'.
I've been doing OpenCL rather than CUDA but I did do parallel and distributed systems as the main specialism of my degree. Decided against doing the PHD in supercomputing with perfume movement through rooms and air.
Any parallel problem will have data dependences for each operation - that gather and scatter (whether singular or multiple) and the state for each data element in totality.
If you're just running a serial program in parallel.. that will work if there's no data concurrency issues.
I'll not mention the IEEE floating point issues GPUs have
Although they are better now they're still not perfect..
This is a monte carlo analysis so thankfully we dont need a huge amount of accuracy as it will all be averaged anyway.
For real parallel you end up breaking the existing way of working down so that it's got the minimal interdependencies and the integrity remains true over time.
Yes, I am already thinking about how to re-write the code
I assume the assignment is "just get this running on the GPU"
We do want a good amount of speed up as well, but it doesnt need to be the most optimised thing in the world
It used to be that all cores were running at the same program counter - so you had to minimise the number of branches/loops created by data - often it meant splitting the data into different blocks so you could do a full parallel process on each without having one 'thread' or one core holding the others up waiting to finish.
it used to be? has that changed? I thought branching was still a major issue
That same issue causes issue if the data I/O is strangled - it's better for the task to have a smaller number of cores that can saturate without having one of the cores causing over saturation and causing stalling. Having the data in the right format for the best read/write access is also a big point for processing.
nVidia confuse things - they've made a virtual model that you program to (kind of like java) and then the drivers have top map that to the hardware available on the card.
GPU cores now are grouped - offering shared memory between the grouped cores.. this is faster but can cause slow down of the cores due to concurrency or replication issues.
Last thing I did on the GPU was a self-modelling 2D Finite Impule Response filter that could deconvolute the image given an 2D non-symmetrical airy disk. It's so fast because everything is processing in parallel and using local memory where possible, even with splitting the filter to maximise parallel operations and data independence.
Wow - impressive