Dear OpenCL users, I recently ported a kernel from CUDA to OpenCL.
This kernel process a 2D image (~512) and for each pixel, fetch ~8000 coordinates in global memory.
Then for each pixel it will fetch ~8000 times in the 2D image using this coordinates.

The profiler says the bottleneck is mem fetches, not ALUs
On Nvidia 570, kernel has identical performances in CUDA or OpenCL
When running a Radeon 7850 (I think performances should be close to the GTX570), code is 5 times slower.

I changed my code to use shared memory and reduce the amount of global memory fetches.
Now the profiler says the bottleneck is ALU Ops.
But the 7850 is still 2.5x times slower that the GTX570.

Any tips regarding:
- the reason why ATI is slower for this kind of kernel
- optimization of this Kernel for ATI (my coordinates array is constant for all kernel launches)

PS: the 2D image is in fact a 32bit greyscale pic.
I'm currently using a CL_R - CL_SIGNED_INT32 image format.
Could this explain bad performances of my read_imagei() calls?

PPS: I changed this to a CL_ARGB, and updated the kernel to handle 4 consecutive pixels. Same performances

Thanks a lot for your help