Hi,

Before I start, just want to say that (almost) everything works fine so it's just a question about what you think on this particular subject.
Let's say that I have a 1D buffer of size N that I "cover" with 1D thread blocks of 128 threads (local size). Each thread divides the angle range [0, 2*pi] in 16 sectors. For each thread, I do something like that:
Code :
const float sector_size = sector_size = 2.f * M_PI_F / 16;
for (int i=0; i<16; ++i) {
    float sin_i = sin(i*sector_size);
    float cos_i = cos(i*sector_size);
    (...)
}
As you can see, it's pretty straightforward. No need to add more details.
Then I thought it was pretty stupid to compute sin and cos many times. I can just pre-compute them and put them into a local array that I fill in parallel using the first 16 threads of my group:
Code :
__local float sin_array[16];
__local float cos_array[16];
if (thread_id<16) {
    const float sector_size = 2.f * M_PI_F / 16;
    sin_array[thread_id] = sin(thread_id * sector_size);
    cos_array[thread_id] = cos(thread_id * sector_size);
}
barrier(CLK_LOCAL_MEM_FENCE);
 
(...)
 
for (int i=0; i<16; ++i) {
    float sin_i = sin_array[i];
    float cos_i = cos_array[i];
    ...
}
This works fine but it doesn't speed-up a thing. I'm used to be surprised in GPU coding. I guess here the barrier cancel the benefit of precomputing the array value.
Then I thought "why don't I initilialize the array by hand?". So I tried the following approach :

Code :
__local float sin_array[16] = { 0.000000f,  0.382683f,  0.707107f,  0.923880f,
                                1.000000f,  0.923880f,  0.707107f,  0.382683f,
                                0.000000f, -0.382683f, -0.707107f, -0.923880f,
                               -1.000000f, -0.923880f, -0.707107f, -0.382683f};
 
__local float cos_array[16] = { 1.000000f,  0.923880f,  0.707107f,  0.382683f,
                                0.000000f, -0.382683f, -0.707107f, -0.923880f,
                               -1.000000f, -0.923880f, -0.707107f, -0.382683f,
                               -0.000000f,  0.382683f,  0.707107f,  0.923880f};
 
(...)
 
for (int i=0; i<16; ++i) {
    float sin_i = sin_array[i];
    float cos_i = cos_array[i];
    ...
}
We don't have a barrier here so it should be faster no? Problem : sin_array and cos_array are not filled correctly. So this is my main question: Why?

The second question, if this one is solved, is: is it better to let theses arrays in the
__local memory or should I put it in the __constant memory (for instance before the function definition)?

Many thank,

Vincent