Hi,

If I run this test kernel, where input_data and output_data are pointers to global floats:

Code :
float test;
for ( int i = 0 ; i < 1024; i++ )
{
  barrier( CLK_GLOBAL_MEM_FENCE);
  float f = *(input_data + i);
  test = (float)get_local_id(0);
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;

then openclprof tells me that my global memory accesses are coalesced.

But if I run this kernel:

Code :
float test;
for ( int i = 0 ; i < 1024; i++ )
{
  barrier( CLK_GLOBAL_MEM_FENCE);
  float f = *(input_data + i);
  //test = (float)get_local_id(0);
  test = f;
}
barrier( CLK_GLOBAL_MEM_FENCE);
*(output_data + get_local_id(0)) = test;

then my global accesses aren't coalesced. This has me stumped -- I think all my threads are reading the same input_data addresses at the same time, and all my threads are writing data to adjacent output_data addresses at the same time too. I think my addresses are aligned correctly in both cases as well. So why does making that assignment to test inside the loop make my accesses uncoalesced?

What am I missing here?