I have some experience in NV CUDA and recently switched to OpenCL. So far I always targeted Nvidia architecture and optimized my kernels accordingly: coalesced memory access, no divergent branches inside warps, avoiding shared (or local in OpenCL terminology) memory bank conflits etc.
Now I would like to write OpenCL kernels in such way to achieve optimal performance on both NVIDIA's and ATI's architectures - is it even possible?
I don't know ATI architecture, ATI Stream, never used it. Is it similar to NVIDIA? Does both require from programmer the same optimization techniques? What are the main differences?