hi, everybody,

I'm doing Matrix Multiplication with OPENCL,
I split the multiplication into some work groups,
then i add them into global memory.

the code below is the final step that sum the sub result to final result
for(k=0 ; k<group_num ; ++k)
{

region = (group_id+k)%group_num;
l=local_id;
while(l<matrix_size)
{
c_mat[(8*region+0)*matrix_size+l] += local_output_matrix[(8*region+0)*matrix_size+l];
c_mat[(8*region+1)*matrix_size+l] += local_output_matrix[(8*region+1)*matrix_size+l];
c_mat[(8*region+2)*matrix_size+l] += local_output_matrix[(8*region+2)*matrix_size+l];
c_mat[(8*region+3)*matrix_size+l] += local_output_matrix[(8*region+3)*matrix_size+l];
c_mat[(8*region+4)*matrix_size+l] += local_output_matrix[(8*region+4)*matrix_size+l];
c_mat[(8*region+5)*matrix_size+l] += local_output_matrix[(8*region+5)*matrix_size+l];
c_mat[(8*region+6)*matrix_size+l] += local_output_matrix[(8*region+6)*matrix_size+l];
c_mat[(8*region+7)*matrix_size+l] += local_output_matrix[(8*region+7)*matrix_size+l];
l=l+group_size;
}

barrier(CLK_GLOBAL_MEM_FENCE);
}
when the size is 64, this code worked,
but when size increased to 128,
the kernel failed and sent the message: fatal: si_isa_DS_WRITE_B32_impl: invalid address.

but if i write

c_mat[(8*region+0)*matrix_size+l] += const ; or

temp += local_output_matrix[(8*region+7)*matrix_size+l];

the kernel worked, but the answer is wrong obviously.

So do any body had met this fatal error code?