PDA

View Full Version : Do I need atomic_add in that case?



linyufly
09-06-2012, 06:38 PM
Hi,

For example, given an integer array where every element is within [0..255], I want to get the number of repetitions of each value in [0..255].

__kernel void Histogram(__global int *arr, __global int *histo) {
int id = get_global_id(0);

histo[arr[id]]++;
}

Is it OK?

Thanks!

ibbles
09-07-2012, 12:51 AM
This sounds very much like a common histogram calculation. In OpenCL this is a non-trivial operation if it's to be done efficiently, but I think that a few of the SDKs include examples of this in their samples. However, an atomic add should work in you case, but it won't give very high performance.

linyufly
09-07-2012, 07:30 AM
Thanks ibbles!

Is it correct not to use atomic_add in this case?

ofer_rose
09-08-2012, 11:18 PM
Thanks ibbles!

Is it correct not to use atomic_add in this case?

No. you have to use some atomic mechanism to sync the reads and writes to histo[].

The operation of "histo[arr[id]]++;" (or any increment of global memory) is actually composed of three operations: read the value from global memory, increment it, write it back. In any multi-core device (even CPU) you have no guarantee that these three operations will be done atomically. :roll:

Using global atomic_inc in this case is feasible, but not the best way. As mentioned, there are examples in SDKs which show optimized ways to do histogram calculations (use local memory for middle-histogram calculations, do reduction in the end)

linyufly
09-09-2012, 08:23 AM
Thanks ibbles!

Is it correct not to use atomic_add in this case?

No. you have to use some atomic mechanism to sync the reads and writes to histo[].

The operation of "histo[arr[id]]++;" (or any increment of global memory) is actually composed of three operations: read the value from global memory, increment it, write it back. In any multi-core device (even CPU) you have no guarantee that these three operations will be done atomically. :roll:

Using global atomic_inc in this case is feasible, but not the best way. As mentioned, there are examples in SDKs which show optimized ways to do histogram calculations (use local memory for middle-histogram calculations, do reduction in the end)

Thanks ofer_rose!

I have a further question. If two threads write to one global memory address at the same time with different values, say, 3 and 4, will the result be anything other than 3 and 4?

Thanks again!

ofer_rose
09-10-2012, 11:21 AM
In theory you're right - but the "problem" is that GPUs are vector machines (also CPUs, if you use SSE/AVX).

The meaning is that threads are not executed solely, but in SIMD fashion (SIMT, if you insist on NVIDIA terminology) - a "packs" of threads, which is equal to the vector width, is running the same command for N threads simultaneously. For NVIDIA, this pack is called warps and the size is 32. For AMD its called wavefront and the size is 64.

So the sequence that I described (read/increment/write) is done by 32 or 64 threads simultaneously. Atomicity is preserved at the warp/wavefront granularity (read done simultaneously for all, increment done for all, etc.). the result is that the range of possible "jumps" in histogram bin value is more than 1 or 2 - pending on the number of threads from the same pack which hit the same bin.

linyufly
09-10-2012, 02:12 PM
In theory you're right - but the "problem" is that GPUs are vector machines (also CPUs, if you use SSE/AVX).

The meaning is that threads are not executed solely, but in SIMD fashion (SIMT, if you insist on NVIDIA terminology) - a "packs" of threads, which is equal to the vector width, is running the same command for N threads simultaneously. For NVIDIA, this pack is called warps and the size is 32. For AMD its called wavefront and the size is 64.

So the sequence that I described (read/increment/write) is done by 32 or 64 threads simultaneously. Atomicity is preserved at the warp/wavefront granularity (read done simultaneously for all, increment done for all, etc.). the result is that the range of possible "jumps" in histogram bin value is more than 1 or 2 - pending on the number of threads from the same pack which hit the same bin.

Hi ofer_rose,

Thanks!

I should have directly described my application. There are several properties, 0..N-1 and M objects. I have an int array for properties, A[0..N-1] and I use one thread for one object.

In the end of one thread I will know what property the object has, say, k, and then I let A[k] be the thread id (equal to the object id). If more than one object have the same property k, then A[k] can be any of them.

My original question should be: do I need atomic add when setting A[k] equal to the current thread id for correctness?

Thanks again!