Hi there,

I was playing around with the memory model theses days until I saw an example
how to use local memory in matrix multiplication.

I got two kernels as follow:

Code :
// A[M][N] * B[N][P] = C[M][P]
kernel void mult_mem_global
    (
        const int Mdim,
        const int Ndim,
        const int Pdim,
        global int *A,
        global int *B,
        global int *C
    )
{
    int k, j, tmp;
    int i = get_global_id(0);
 
    if (i > Mdim) return;
 
    for (j = 0; j < Pdim; j++) {
        tmp = 0;
        for (k = 0; k < Ndim; k++)
            tmp += A[i*Mdim + k] * B[k*Ndim + j];
        C[i*Mdim + j] = tmp;
    }
}
 
kernel void mult_mem_private_local
    (
        const int Mdim,
        const int Ndim,
        const int Pdim,
        global int *A,
        global int *B,
        global int *C,
        local int *local_B_column
    )
{
    int k, j, tmp;
    int i = get_global_id(0);
    int iloc = get_local_id(0);
    int nloc = get_local_size(0);
    int private_A_line[1000];
 
    if (i > Mdim) return;
 
    /* private memory */
    for (k = 0; k < Ndim; k++)
        private_A_line[k] = A[i*Mdim + k];
 
    local int local_B_column[1000];
 
    for(j = 0; j < Pdim; j++) {
        for(k = iloc; k < Ndim; k += nloc) {
            local_B_column[k] = B[k*Ndim + j];
        }
 
        barrier(CLK_LOCAL_MEM_FENCE);
 
        for (k = tmp = 0; k < Ndim; k++)
           tmp += private_A_line[k] * local_B_column[k];
        C[i*Mdim + j] = tmp;
    }
}

The first kernel is a simple matrix multiplication and the second one
does the multiplication using local and private memory. The problem
is that the results are different.

I tried multiplying two 500x500 matrixes and using a local work size
of 250, so I have two work groups, as each work item computes one
row of the result. However, when comparing the results they are wrong
and if I use 50 as my local work size, the results are equal.

I tested it in two NVidia's Geforce 650M and 210. Does anybody know
what I'm missing?