Hello all,
So I've written a fair amount of OpenCL code to do various math operations on large integers. All those work fine with a single core of the CPU executing them (the functions implemented thus far are not to be paralleled, but used by the paralleled function). So today I was moving on to the part of the program that is to be parallel. First off I switched to using the GPU, and executed a hello world program just to make certain my host program was functioning correctly on the GPU. I figured all the rest would work as well, but that's when I hit problems....

I've narrowed down the problem to be a difference in sizes of integers (or so it seems). So I wrote a small kernel shown below, which accepts two inputs (an output buffer, and an input buffer). It is supposed to copy the input buffer to the output buffer, but that doesn't happen correctly. Can anyone help out here?

Code :
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
__kernel void hello(__global uint * out, __global uint * in) {
    size_t tid = get_global_id(0);
    out[tid] = in[tid];
}

For both the input and output buffers, in the host program I have defined them as follows:

Code :
unsigned int * outH = new unsigned int[2];
unsigned int * input = new unsigned int[2];
input[0]=555;
input[1]=666;

When the output comes out, the zero'th element is 6619691 and the first element is 0. Looking at the bit pattern from the zero'th element I see that the 12 least significant bits match the bit pattern of 555. The 4 bits more significant than those are 0. So if I cast the zero'th element to a short it displays 555 correctly. With this knowledge I looked into the device's CL_DEVICE_ADDRESS_BITS, which returns 32. My host system is a 64 bit system. I think this has something to do with my problem, but I can't justify it as I'm pretty certain that would not change the size of an unsigned integer. Can anyone offer some insight?

Thanks