The host provides a 3d vector field, i.e. a 4d-float-matrix:

field[Nx][Ny][Nz][3]

The first three dimensions represent a lattice and the fourth dimension of length 3 stores the three vector components x,y,z at a given lattice point. Before passing this structure to the kernel, it is flattened to a 1d array of length 3*Nx*Ny*Nz. Inside the kernel an iteration for each lattice point (i.e. each vector) has to be done for let's say 10 steps. BUT: For each iteration step the values of adjacent lattice points (6 for each lattice point) have to be considered. Without this restriction I can just let each worker do all 10 iteration steps for each lattice point as they are all independent. But with this restriction I have to wait for each lattice point to reach the current iteration step before the next step for any lattice point can be done.

Is there a way to cope with this? I'm not very experienced with OpenCL.