Pass Array of Structures (AoS) to CUDA Kernel? Only element 0 works

panyan · 17 Feb 2016 at 11:37

I want to pass an array of structures to a CUDA Kernel.

My structure:

Code:

    struct Group_Output_Places
{
    float Parameter[3];
    int Place_ID[3];
};

Host device makes the AoS

Code:

struct Group_Output_Places Group_Places[31]; // 31 places

    Group_Places[0].Parameter[0] = 360.2f; // f at the end tells it it is a float so it doesnt complain about it being a double
    Group_Places[0].Place_ID[0] = 1;

    Group_Places[0].Parameter[1] = 128.4f;
    Group_Places[0].Place_ID[1] = 2;
...
struct Group_Output_Places *Dev_Group_Places;

cudaMalloc((void**)&Dev_Group_Places, sizeof(struct Group_Output_Places)* 31);

cudaMemcpy(Dev_Group_Places, &Group_Places, sizeof(struct Group_Output_Places)* 31, cudaMemcpyHostToDevice); // sizeof(Group_Output_Places)* 31 becuase it is an array

AddInts << <1, 1 >> >(Dev_Group_Places);

And then to see if it copied to the Kernel correctly:

Code:

__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places){
    struct Group_Output_Places GPU_Group_Places;
    GPU_Group_Places = *Dev_Group_Places;
}

The problem is that only the first element of Group_Places gets to the Kernel. How can I get the whole AoS to go over to the kernel?

djabs786 · 17 Feb 2016 at 12:12

Don't know if this would make a difference, but do you need to use the struct keyword when getting the size of the struct?
it makes more sense to just have:

Code:

cudaMalloc((void**)&Dev_Group_Places, sizeof(Group_Output_Places)* 31);
cudaMemcpy(Dev_Group_Places, &Group_Places, sizeof(Group_Output_Places)* 31, cudaMemcpyHostToDevice);

And the same with your kernel argument and definition of GPU_Group_Places, assuming you have the struct defined in your kernel file.

panyan · 17 Feb 2016 at 12:20

djabs786 said:
Don't know if this would make a difference, but do you need to use the struct keyword when getting the size of the struct?

I dont really know. Ive been googling and some people seem to write it, some dont.

Scotty123 · 17 Feb 2016 at 12:46

Just make a vector of structs host side using thrust library (as I told you

). Then you can copy it in one simple statement - DVector = HVector. Then you'll have the array (since vectors don't exist in devices) in the device and can access it with normal array notation, e.g. MyArray[20].Parameter[2].

Thrust makes a lot of things easier and quicker to do. I highly recommend it. You can do iterator stuff, lambdas, reductions and more.

NickK · 17 Feb 2016 at 13:07

From a low level - array structures, alignment etc can all play a problem. Typically if element 0 works then it's a memory representation issue. Seems libraries exist to make that a non-issue

One thing to think about is the format of how the array is accessed by hardware - i.e. which indices are sequenced to get the maximum GPU performance. This may differ from the target hardware and between devices depending on memory subsystems. Sometimes it's better accessing the data grouped together so that each core isn't having to gather across a large memory area.

OpenCL is .. well like being in the stone age

Scotty123 · 17 Feb 2016 at 13:16

I found for lattice boltzmann it was actually better to have structure of arrays (x1, x2, y1, y2,) instead of array of structures (x1, y1, x2, y2). It's always problem dependant. No doubt someone's already figured out the best ways for your specific problem at hand. Google memory coalescing if you want to know more.

NickK · 17 Feb 2016 at 13:28

panyan said:
The problem is that only the first element of Group_Places gets to the Kernel. How can I get the whole AoS to go over to the kernel?

Hmm the way that GPUs work is that you don't pass a full array to the kernel to then iterate over. The parallelism occurs outside but the kernel knows it's offset within the group.

Example as a pseudo code:
Define Array a[1..100]
Define Array output[1..100]
Execute Kernel MyKernel(a, 1, 100, output)

Define MyKernel(Array *a, Array *out) {
position = find group position(); // get the kernel position in the 1 to 100 range
dataElement = a[position];
out[position] = dataElement
}

Here for each 1-100 value the kernel is executed by a core - this can be between 1-100 in parallel executions.
The find group position gives the index for that execution - typically everything is build around it. You can gather using any a[1.100] index you want but writing needs to really be parallel out[position] otherwise the memory may be over written if all the kernels simply wrote to out[0].

My experience is in OpenCL but this seems odd:
GPU_Group_Places = *Dev_Group_Places;

You're getting a location value then setting GPU_Group_Places - so how does the GPU core know which element to access.. this would (as a C program) simply read element 0 each time and copy to element 0.
Then every core will do the same.. so are you sure there's no index or subscript required for that?

panyan · 17 Feb 2016 at 15:23

Scotty123 said:
Just make a vector of structs host side using thrust library (as I told you ).

The problem is that my knowledge of C/C++/CUDA is very limited. I appreciate you telling me what to do, the problem is that I didnt understand and certainly didnt know how to implement it.

Then you can copy it in one simple statement - DVector = HVector.

Excellent, thats nice and easy.

Then you'll have the array (since vectors don't exist in devices) in the device and can access it with normal array notation, e.g. MyArray[20].Parameter[2].

Perfect, thats exactly what I want.

Thrust makes a lot of things easier and quicker to do. I highly recommend it. You can do iterator stuff, lambdas, reductions and more.

NickK said:
From a low level - array structures, alignment etc can all play a problem. Typically if element 0 works then it's a memory representation issue. Seems libraries exist to make that a non-issue

Yes, it seems the Thrust library is the one I should try to get working.

One thing to think about is the format of how the array is accessed by hardware - i.e. which indices are sequenced to get the maximum GPU performance. This may differ from the target hardware and between devices depending on memory subsystems. Sometimes it's better accessing the data grouped together so that each core isn't having to gather across a large memory area.

Something to consider once im past the current hurdle I think

OpenCL is .. well like being in the stone age

Scotty123 said:
I found for lattice boltzmann it was actually better to have structure of arrays (x1, x2, y1, y2,) instead of array of structures (x1, y1, x2, y2). It's always problem dependant. No doubt someone's already figured out the best ways for your specific problem at hand. Google memory coalescing if you want to know more.

Structures of arrays? Were they better than device vectors from the thrust library?

NickK said:
Hmm the way that GPUs work is that you don't pass a full array to the kernel to then iterate over. The parallelism occurs outside but the kernel knows it's offset within the group.

I understand that. For this particular problem the threads need to read numbers from a database (array/vector/structure) to then do their computation.

Example as a pseudo code:
Define Array a[1..100]
Define Array output[1..100]
Execute Kernel MyKernel(a, 1, 100, output)

Define MyKernel(Array *a, Array *out) {
position = find group position(); // get the kernel position in the 1 to 100 range
dataElement = a[position];
out[position] = dataElement
}

Here for each 1-100 value the kernel is executed by a core - this can be between 1-100 in parallel executions.
The find group position gives the index for that execution - typically everything is build around it. You can gather using any a[1.100] index you want but writing needs to really be parallel out[position] otherwise the memory may be over written if all the kernels simply wrote to out[0].

My experience is in OpenCL but this seems odd:
GPU_Group_Places = *Dev_Group_Places;

You're getting a location value then setting GPU_Group_Places - so how does the GPU core know which element to access.. this would (as a C program) simply read element 0 each time and copy to element 0.
Then every core will do the same.. so are you sure there's no index or subscript required for that?

Scotty123 · 17 Feb 2016 at 18:39

It may be MyArray[20]->Parameter[2] not MyArray[20].Parameter[2] actually. I've not done C++ since summer 2015. My excuse

. It's a pointer to a structure so should be arrow notation. I've been doing too much Python and C# recently...

azteched · 17 Feb 2016 at 19:15

panyan said:
And then to see if it copied to the Kernel correctly:

Code:

__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places){ struct Group_Output_Places GPU_Group_Places; GPU_Group_Places = *Dev_Group_Places; }

The problem is that only the first element of Group_Places gets to the Kernel. How can I get the whole AoS to go over to the kernel?

What NickK said.. this code just dereferences the pointer to get the first structure, surely. Unless there's some CUDA magic (I've only used it very lightly ages ago, mostly do plain old CPU threading).

What happens if you access another structure at *(Dev_Group_Places + <nonzero offset>)?

panyan · 17 Feb 2016 at 21:25

Scotty123 said:
It may be MyArray[20]->Parameter[2] not MyArray[20].Parameter[2] actually. I've not done C++ since summer 2015. My excuse . It's a pointer to a structure so should be arrow notation. I've been doing too much Python and C# recently...

azteched said:
What happens if you access another structure at *(Dev_Group_Places + <nonzero offset>)?

Ill test both these things tomorrow and report back

NickK · 17 Feb 2016 at 21:35

Code:

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) {
int i = blockDim.x * [B]blockIdx.x[/B] + [B]threadIdx.x[/B];
if (i < N) C[i] = A[i] + B[i];
}

CUDA still needs a the subscripting - blockIdx & threadIdx. So your kernel code needs updating.
Also I think your call to kick the kernel is also <<<1,1>>> which is likely to only execute one kernel.

Code:

// Host code int main() {
int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
// Initialize input vectors ...
// Allocate vectors in device memory
float* d_A; cudaMalloc(&d_A, size); float* d_B;
cudaMalloc(&d_B, size); float* d_C;
cudaMalloc(&d_C, size);
// Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Invoke kernel int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

VecAdd<<<[B]blocksPerGrid[/B], [B]threadsPerBlock[/B]>>>(d_A, d_B, d_C, N);

// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
// Free host memory ...
}

So you need the following to cover 1-30... something like this:

AddInts << <1, 30 >> >(Dev_Group_Places);

Code:
__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places){
struct Group_Output_Places GPU_Group_Places;
GPU_Group_Places[threadIdx.x] = *Dev_Group_Places[threadIdx.x];
}

panyan · 18 Feb 2016 at 10:05

NickK said:
Code:

// Device code __global__ void VecAdd(float* A, float* B, float* C, int N) { int i = blockDim.x * [B]blockIdx.x[/B] + [B]threadIdx.x[/B]; if (i < N) C[i] = A[i] + B[i]; }

CUDA still needs a the subscripting - blockIdx & threadIdx. So your kernel code needs updating.
Also I think your call to kick the kernel is also <<<1,1>>> which is likely to only execute one kernel.

Code:

// Host code int main() { int N = ...; size_t size = N * sizeof(float); // Allocate input vectors h_A and h_B in host memory float* h_A = (float*)malloc(size); float* h_B = (float*)malloc(size); // Initialize input vectors ... // Allocate vectors in device memory float* d_A; cudaMalloc(&d_A, size); float* d_B; cudaMalloc(&d_B, size); float* d_C; cudaMalloc(&d_C, size); // Copy vectors from host memory to device memory cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); // Invoke kernel int threadsPerBlock = 256; int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; VecAdd<<<[B]blocksPerGrid[/B], [B]threadsPerBlock[/B]>>>(d_A, d_B, d_C, N); // Copy result from device memory to host memory // h_C contains the result in host memory cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); // Free host memory ... }

So you need the following to cover 1-30... something like this:

AddInts << <1, 30 >> >(Dev_Group_Places);

Code:
__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places){
struct Group_Output_Places GPU_Group_Places;
GPU_Group_Places[threadIdx.x] = *Dev_Group_Places[threadIdx.x];
}

Hi Nickk - I appreciate that you are trying to help with this matter, however I think there is some confusion.

At the moment I need the whole data set going to each thread. I do not want each thread to work on an individual element of the data set. I will probably need that at some point, but not at the moment.

From what I understand of your responses, what you are trying to explain is the standard CUDA way where you give each thread an element of a vector to do something with. What I need is the whole vector going to each thread, hence my test scenario using the <<<1,1>>> kernel configuration.

NickK · 18 Feb 2016 at 14:11

panyan said:
Hi Nickk - I appreciate that you are trying to help with this matter, however I think there is some confusion.

At the moment I need the whole data set going to each thread. I do not want each thread to work on an individual element of the data set. I will probably need that at some point, but not at the moment.

From what I understand of your responses, what you are trying to explain is the standard CUDA way where you give each thread an element of a vector to do something with. What I need is the whole vector going to each thread, hence my test scenario using the <<<1,1>>> kernel configuration.

Personally I'd switch to typedef struct a { elements; } type; I never liked seeing struct everywhere as you're using it as a type..

So you've created a block of data (assuming data alignment no problem), loaded the data, copied the data to the GPU's memory but in the kernel all you have is this:

GPU_Group_Places = *Dev_Group_Places;

The block you have coped is just that - a data block. It just so happens to contain a set of structures - it's not an array thus the system only knows it's a block.

When you do "GPU_Group_Places = *Dev_Group_Places" it will copy the structure point to. It will not copy the array (in fact it does not know it's an array nor does it know how big the array is).

Just like C - if you say struct and copy the struct by assigning (regardless of memory pointers) then you'll get the size of the single struct copied.

This is why it only works for the first point. To access the further structures you would need to assume the common data alignment between CPU and GPU infrastructures and try the very unsafe copy by adding a byte offset. The CUDA system is doing the alignment changes in the background (I hope).

Ie to test and use a single kernel core to copy all the data:
GPU_Group_Places = *Dev_Group_Places
becomes (assuming CUDA understands structs and alignment)
*(GPU_Group_Places + sizeof(struct ...) * member) = *(Dev_Group_Places+ sizeof(struct ...) * member)
Then you have to loop through 0-30 members of the blocked structures to copy each member.

Failing that:
(GPU_Group_Places + sizeof(struct ...) * member)->elementA =(Dev_Group_Places+ sizeof(struct ...) * member)->elementA;
(GPU_Group_Places + sizeof(struct ...) * member)->elementB =(Dev_Group_Places+ sizeof(struct ...) * member)->elementB;
etc then cycle through the member to create a form of array offset.

The difficulty is that this assumes the byte alignments between C and CUDA are the same.. like the old problem of putting an array of structs into a byte block, sending over TCP/IP to a system with a different architecture/compiler and then asking it to access the byte blocks in the same way using native offsets.. (can you tell I don't like it!)

To get away from this you'd need to work with CUDA's arrays which typically then show their ancestry of GPUs.. with dense-same type arrays.. (i.e. RGBA channels with int/floats only). You need to structure host->gpu and then gnu->host if you don't start adopting the changes through out your code..

panyan · 18 Feb 2016 at 15:06

NickK said:
Personally I'd switch to typedef struct a { elements; } type; I never liked seeing struct everywhere as you're using it as a type..

I dont know what typedef struct is - ill have a look into it. Before I knew about CUDA i thought i could just use a 3D vector to sort this problem out without any issues.

So you've created a block of data (assuming data alignment no problem), loaded the data, copied the data to the GPU's memory

At the moment the block of data isnt being populated correctly. But copying it to the GPUs memory seems fine.

but in the kernel all you have is this:

GPU_Group_Places = *Dev_Group_Places;

Yes, I thought i might have to de-reference the array to get it working in the kernel.

The block you have coped is just that - a data block. It just so happens to contain a set of structures - it's not an array thus the system only knows it's a block.

Right. I am much more similar with C++ vectors where when you pass a vector the whole thing is accessible by the function.

When you do "GPU_Group_Places = *Dev_Group_Places" it will copy the structure point to. It will not copy the array (in fact it does not know it's an array nor does it know how big the array is).

So I need to iterate through and copy the lot?

Just like C - if you say struct and copy the struct by assigning (regardless of memory pointers) then you'll get the size of the single struct copied.

Oh, so I need to copy it as if it were an array rather than a structure?

This is why it only works for the first point. To access the further structures you would need to assume the common data alignment between CPU and GPU infrastructures and try the very unsafe copy by adding a byte offset. The CUDA system is doing the alignment changes in the background (I hope).

Ie to test and use a single kernel core to copy all the data:
GPU_Group_Places = *Dev_Group_Places
becomes (assuming CUDA understands structs and alignment)
*(GPU_Group_Places + sizeof(struct ...) * member) = *(Dev_Group_Places+ sizeof(struct ...) * member)
Then you have to loop through 0-30 members of the blocked structures to copy each member.

Failing that:
(GPU_Group_Places + sizeof(struct ...) * member)->elementA =(Dev_Group_Places+ sizeof(struct ...) * member)->elementA;
(GPU_Group_Places + sizeof(struct ...) * member)->elementB =(Dev_Group_Places+ sizeof(struct ...) * member)->elementB;
etc then cycle through the member to create a form of array offset.

The difficulty is that this assumes the byte alignments between C and CUDA are the same.. like the old problem of putting an array of structs into a byte block, sending over TCP/IP to a system with a different architecture/compiler and then asking it to access the byte blocks in the same way using native offsets.. (can you tell I don't like it!)

Right. So iterating over it and copying each structure is a bad idea.

To get away from this you'd need to work with CUDA's arrays which typically then show their ancestry of GPUs..

CUDA's arrays? Can you explain that in more detail please?

with dense-same type arrays.. (i.e. RGBA channels with int/floats only).

So all the data types have to be the same.

You need to structure host->gpu and then gnu->host if you don't start adopting the changes through out your code..

I dont understand what you are saying here, sorry.

Just for reference, this is the current code:

Code:

__global__ void AddInts(Group_Output_Places *Dev_Group_Places){

		float a = Dev_Group_Places[0].Parameter[0];
}

vector<double> PN(vector<int> Insp) {

	// 1. Make the host structure
	
	thrust::host_vector<Group_Output_Places> Group_Places(31);			// host version
	thrust::device_vector<Group_Output_Places> d_Group_Places(31);		    // device version
	
	Group_Places[0].Parameter[0] = 102.3f; // f at the end tells it it is a float so it doesnt complain about it being a double
	Group_Places[0].Place_ID[0] = 1;

	Group_Places[0].Parameter[1] = 105.3f;
	Group_Places[0].Place_ID[1] = 2;

	Group_Places[0].Parameter[2] = 122.3f;
	Group_Places[0].Place_ID[2] = 6;

	Group_Places[1].Parameter[0] = 108.3f;
	Group_Places[1].Place_ID[0] = 2;

	Group_Places[1].Parameter[1] = 108.7f;
	Group_Places[1].Place_ID[1] = 6;

	Group_Places[1].Parameter[2] = 250.1f;
	Group_Places[1].Place_ID[2] = 7;

	d_Group_Places = Group_Places;

	AddInts << <1, 1 >> >(thrust::raw_pointer_cast(d_Group_Places.data()));

	vector<double> output(2, 888);

	return output;
}

NickK · 18 Feb 2016 at 21:23

Note thrust::raw_pointer_cast() returns int*

That may be confusing to the compiler if it creates prototypes at compile time.

Interesting: http://codeyarns.com/2011/04/09/how-to-pass-thrust-device-vector-to-kernel/

Looks like you're onto the right form.

I don't like having data formats anonymously embedded into byte arrays but that seems the way that CUDA does it.