Hi all,

let's assume the following kernel:

Code :
#define nx      (signed)get_global_id(0)
#define ny      (signed)get_global_id(1)
#define Nx      (signed)get_global_size(0)
#define Ny      (signed)get_global_size(1)
 
__kernel void parallelSum(__global float* matrix, __global float* sum)
{
    sum[0] += matrix[nx + Nx * ny];
}

All work items have to access sum[0] at some point. But this can happen only sequentially. So actually there is not much parallelization in this example, right?