In my code, the several threads need to read from the global memory a lot of variables with the same address. Unfortunately, the size of the varibles is too large in order to fit them all in local memory. As a consequence, reading these variables takes 80% of the time, even if it represents only less than 5% of the instructions.
Can anyone suggest a way to speed up the access to these shared variables?
(my procedure is somehow similar to the multiplication of two matrices)