I've got a simple test kernel that writes into local memory, and then copies the data to an output buffer in global memory:

Code :
__kernel void foo( __global float *debug_data, __local float *shared_segment)
{
   // works
   int tid = get_local_id(1);
   *(shared_segment + tid) = tid;
   *(debug_data + tid) = *(shared_segment + tid);

where it's a two dimensional grid of 16K by 256. The local work group size is 1x256. For the shared memory size, I've got 16K/256 = 64 bytes a thread in a workgroup. This behaves the way I think it should: I get back 256 floats in debug_data, with values starting at 0 and going to 255.

But now if I add a line to my test kernel:

Code :
__kernel void foo( __global float *debug_data, __local float *shared_segment)
{
   // doesn't work
   int tid = get_local_id(1);
   *(shared_segment + tid) = tid;
   *(shared_segment + tid + 256) = tid;       // new line
   *(debug_data + tid) = *(shared_segment + tid);

I get back 24 floats of valid values in debug_data, followed by zeroes. I'm completely stumped -- as far as I'm concerned, shared_segment should have room for 4096 4-byte values, and the index of my last shared_segment write is 511. I've double-checked my kernel arguments, and I think it's correct: a size_t set to ((16*1024)/256) for the size parameter, followed by a NULL.

Can anyone point out to me what I'm misunderstanding about allocating and using local memory?