Hi,

Before I start, just want to say that (almost) everything works fine so it's just a question about what you think on this particular subject.

Let's say that I have a 1D buffer of size N that I "cover" with 1D thread blocks of 128 threads (local size). Each thread divides the angle range [0, 2*pi] in 16 sectors. For each thread, I do something like that:

As you can see, it's pretty straightforward. No need to add more details.Code :const float sector_size = sector_size = 2.f * M_PI_F / 16; for (int i=0; i<16; ++i) { float sin_i = sin(i*sector_size); float cos_i = cos(i*sector_size); (...) }

Then I thought it was pretty stupid to compute sin and cos many times. I can just pre-compute them and put them into a local array that I fill in parallel using the first 16 threads of my group:

This works fine but it doesn't speed-up a thing. I'm used to be surprised in GPU coding. I guess here the barrier cancel the benefit of precomputing the array value.Code :__local float sin_array[16]; __local float cos_array[16]; if (thread_id<16) { const float sector_size = 2.f * M_PI_F / 16; sin_array[thread_id] = sin(thread_id * sector_size); cos_array[thread_id] = cos(thread_id * sector_size); } barrier(CLK_LOCAL_MEM_FENCE); (...) for (int i=0; i<16; ++i) { float sin_i = sin_array[i]; float cos_i = cos_array[i]; ... }

Then I thought "why don't I initilialize the array by hand?". So I tried the following approach :

We don't have a barrier here so it should be faster no? Problem : sin_array and cos_array are not filled correctly. So this is my main question: Why?Code :__local float sin_array[16] = { 0.000000f, 0.382683f, 0.707107f, 0.923880f, 1.000000f, 0.923880f, 0.707107f, 0.382683f, 0.000000f, -0.382683f, -0.707107f, -0.923880f, -1.000000f, -0.923880f, -0.707107f, -0.382683f}; __local float cos_array[16] = { 1.000000f, 0.923880f, 0.707107f, 0.382683f, 0.000000f, -0.382683f, -0.707107f, -0.923880f, -1.000000f, -0.923880f, -0.707107f, -0.382683f, -0.000000f, 0.382683f, 0.707107f, 0.923880f}; (...) for (int i=0; i<16; ++i) { float sin_i = sin_array[i]; float cos_i = cos_array[i]; ... }

The second question, if this one is solved, is: is it better to let theses arrays in the

__local memory or should I put it in the __constant memory (for instance before the function definition)?

Many thank,

Vincent