The behavior of the test kernel below makes absolutely no sense to me. Any clues to what's going on will be much appreciated!

The kernel basically initializes an output array, C, with 1's. It furthermore contains a dummy for-loop that does noting but initialize two local data structures. If I only make a few iterations in the for-loop, C is correctly initialized with 1's, but with many iterations (e.g. 1000) something goes horribly wrong and C is not initialized with 1's. Why does the 1000-iteration loop have this odd effect?

Bonus info: for some reasons the problem does not occur when the input arrays are small. The problem do occur for a situation with
C: array of size 11,182,336
A: array of size 3,101,104
rowWidths and rowStartIdxs: arrays of size 3,344
I don't understand, why the array sizes have anything to do with this, but they do.

I would welcome an explanation of the above mysterious phenomenon.

(FYI: the kernel is a boil-down of a larger and more meaningful kernel that suffers from the same "bug")
Code :
#define BLOCK_SIZE 16
__kernel void test(__global int* C, int CSize,  __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
{
    int bi = get_group_id(0);
    int bj = get_group_id(1);
    int ti = get_local_id(0);
    int tj = get_local_id(1);
    int rowAIdx =  bi * BLOCK_SIZE + ti;
    int rowBIdx =  bj * BLOCK_SIZE + tj;
 
    int cOut = 1;
    for(int x=0; x<1000; x++) {
      __local int As[BLOCK_SIZE][BLOCK_SIZE];
      __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
      As[ti][tj] = 1;
      Bs[ti][tj] = 1;
      barrier(CLK_LOCAL_MEM_FENCE);
    }
    int c = rowBIdx * CSize + rowAIdx;
    C[c] = cOut;
}