Results 1 to 5 of 5

Thread: Local memory in matrix multiplication

  1. #1
    Junior Member
    Join Date
    Apr 2013
    Posts
    11

    Local memory in matrix multiplication

    Hi there,

    I was playing around with the memory model theses days until I saw an example
    how to use local memory in matrix multiplication.

    I got two kernels as follow:

    Code :
    // A[M][N] * B[N][P] = C[M][P]
    kernel void mult_mem_global
        (
            const int Mdim,
            const int Ndim,
            const int Pdim,
            global int *A,
            global int *B,
            global int *C
        )
    {
        int k, j, tmp;
        int i = get_global_id(0);
     
        if (i > Mdim) return;
     
        for (j = 0; j < Pdim; j++) {
            tmp = 0;
            for (k = 0; k < Ndim; k++)
                tmp += A[i*Mdim + k] * B[k*Ndim + j];
            C[i*Mdim + j] = tmp;
        }
    }
     
    kernel void mult_mem_private_local
        (
            const int Mdim,
            const int Ndim,
            const int Pdim,
            global int *A,
            global int *B,
            global int *C,
            local int *local_B_column
        )
    {
        int k, j, tmp;
        int i = get_global_id(0);
        int iloc = get_local_id(0);
        int nloc = get_local_size(0);
        int private_A_line[1000];
     
        if (i > Mdim) return;
     
        /* private memory */
        for (k = 0; k < Ndim; k++)
            private_A_line[k] = A[i*Mdim + k];
     
        local int local_B_column[1000];
     
        for(j = 0; j < Pdim; j++) {
            for(k = iloc; k < Ndim; k += nloc) {
                local_B_column[k] = B[k*Ndim + j];
            }
     
            barrier(CLK_LOCAL_MEM_FENCE);
     
            for (k = tmp = 0; k < Ndim; k++)
               tmp += private_A_line[k] * local_B_column[k];
            C[i*Mdim + j] = tmp;
        }
    }

    The first kernel is a simple matrix multiplication and the second one
    does the multiplication using local and private memory. The problem
    is that the results are different.

    I tried multiplying two 500x500 matrixes and using a local work size
    of 250, so I have two work groups, as each work item computes one
    row of the result. However, when comparing the results they are wrong
    and if I use 50 as my local work size, the results are equal.

    I tested it in two NVidia's Geforce 650M and 210. Does anybody know
    what I'm missing?

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Posts
    110
    You need another barrier(CLK_LOCAL_MEM_FENCE) just before C[i*Mdim + j] = tmp.

  3. #3
    Junior Member
    Join Date
    Apr 2013
    Posts
    11
    Thanks, it really worked, but I didn't get way the previous kernel didn't work =/
    I though the barrier was only necessary in that line to synchronize the work-items.
    The C[i*Mdim + j] = tmp is writing to global memory, so I thought it was not needed
    to put a barrier after that. Why is this happened?

  4. #4
    Senior Member
    Join Date
    Oct 2012
    Posts
    110
    You can easily visualize it if you mentally unroll the main loop:

    Code :
    for(k = iloc; k < Ndim; k += nloc)
         local_B_column[k] = B[k*Ndim + j];
     
    barrier(CLK_LOCAL_MEM_FENCE);
     
    for (k = tmp = 0; k < Ndim; k++)
        tmp += private_A_line[k] * local_B_column[k];
     
    C[i*Mdim + j] = tmp;
     
    for(k = iloc; k < Ndim; k += nloc)
        local_B_column[k] = B[k*Ndim + j];
     
    barrier(CLK_LOCAL_MEM_FENCE);
    ...
    You can see that a barrier is needed near "C[i*Mdim + j] = tmp" because the content of local_B_column[] being read just before can be overwritten by the following write to local_B_column[].

  5. #5
    Junior Member
    Join Date
    Apr 2013
    Posts
    11
    Yeah, you're right. Unrolling the loop makes it clear.
    Thanks again for your help =D

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •