Hello,

I'm trying to optimize an algorithm performing some transformations on a matrix in OpenCL. I'm running it on a AMD GPU. The current version uses 1 x 256 threads (this is the local size of a...