Hey guys,

I have been looking for answers now for about a week and cant find anything useful, so here goes.

I have a kernel that takes a global float* as an input parameter, and another as an output. Due to the massive number of global accesses, the CPU is doing the algorithm quicker than the GPU, and I need it the other way around. I tried passing in a local float* to hold temp data from global to local, but it causes the code to error, and it outputs the exact same numbers it did last time I ran my program.

I tried this:

Code :
__kernel void simple(
	global const float* input1, //input
	global float* input2, //output
	constant float* input3, //another input
        local float* tempArg, //temp array
	private int numData,
	private int numData2)
{
int index = get_global_id(0);
...
//for testing purposes
tempArg[index] = index;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
...
 
output[index] = tempArg[index]; // this is where it breaks, giving me incorrect values
//output[index] = index //works, if I dont have the local arg in the kernel parameters
is it because I am running out of memory, or is it because something else is wrong? I am trying to make it faster, but it just keeps giving me crap values