One of the problems I've found optimizing OpenCL programs is the lack of explicit cache control. For instance, ATIs caches both global mem and texture. NVIDIA G80/GT200 only the texture. Fermi is configurable.
You should add a buffer flag hint to indicate if you want the hardware to cache read, writes or nothing. The implemention will have the last word but it will be good for future hardware. In that way the hardware can know better the programmer's intentions.
For example, allow us to use a new flag ( CL_CACHED ) with
to the existing CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY and CL_MEM_READ_WRITE.
If CL_CACHED if not present the buffer won't be cached as is logical.