I am attempting to implement an algorithm to solve ODEs in OpenCL. I've been struggling with the best way to do this, and I have an idea in mind that I think should work in theory, but when I actually try to start with just a simple case I'm having unexpected results. I would very much appreciate any help.

Here's what I have:

Code :
__kernel void calculate(__global float *n_array, __global float *n_avg, int step)
{
  int j = get_global_id(0);
  n_array[j] += n_array[j] * some constants;
  n_avg[step] += n_array[j];
}

I'm executing this with 1000 global work items (so that "j" ranges from 0-999) and the program is called 1000 times in a for loop (so that this kernel gets executes a total of 1000*1000 times). I place values in "n_array" initially with clEnqueueWriteBuffer and this is working correctly. The issue is that n_avg ends up with very unexpected values. I expect is has something to do with being shared memory and perhaps the timing is such that values aren't being written properly. When I pick out a single value of j (i.e. if(j==10) ), it works exactly as expected. The issues is adding all the values to the n_avg[step] variable, say I have 1000 values of n_array[j] = 1, I would expect n_avg[step] to equal 1000 (since I added 1 to it 1000 times), but this is not always what is occurring. I know I may have not been clear here, I am still learning OpenCL so I don't know exactly how to frame my question. If you think you can help but need more information, please let me know and I will do my best to provide it. Thank you very much for taking the time to look at my question.