I've read somewhere (some forum I cannot recall right now) that allocating local ("shared" in nvidia cuda nomenclature) memory statically like below should be avoided since it's implementation dependend:
The dynamic allocation using kernel args and clSetKernelArg should be used instead:Code :__local float s_elData;
and (in host code):Code :__kernel void kernelP1( __local float* s_elData, //...
Code :clSetKernelArg(kernel, 1, 32 * sizeof(float), NULL);
Unfortunately when I'm using the latter method my register usage increases from 14 to 19 - no other change in code, just the way of allocation. So I rather stick to the former - static - method of allocation - is it safe or really should be avoided?