Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: First local memory work

  1. #1

    First local memory work

    Hi,

    I do my first steps for local memory working and try to begin with this global memory working code :
    datas : a={0,1,2,3,....}, b={0,3,6,9,.....}

    Code :
     
    __kernel void localTest(__global const unsigned char *a,
                            __global const unsigned char *b,
                            __global unsigned char *c)
    {
        int x = get_global_id(0);
        c[x] = a[x]+ b[x];
    }

    , where it's OK. (c={0,4,8,12,...})

    The following code generates wrong value :
    Code :
    #define TILE_DIM 16
     
    __kernel void localTest(__global const unsigned char *a,
                            __global const unsigned char *b,
                            __global unsigned char *c)
    {
        int x = get_global_id(0);
     
        __local unsigned char localTemp[2];
     
        localTemp[0] = a[x]; //(*)
        localTemp[1] = b[x]; //(**)
     
        c[x] = localTemp[0] + localTemp[1];
    }

    If I only comment (**) and do (c[x]=localTemp[0], values are correct,(c={0,1,2,3,...})
    if I only comment (*) and do (c[x]=localTemp[1], values are correct too.(c={0,3,6,9,...})
    But the code above give : (c={15,18,21,24,...})

    Can you help me to resolve this issue?
    Thanks in advance.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: First local memory work

    Local memory is shared across all work-items in a work-group. Right now with that code, all work-items are fighting to read and write into the same localTemp[0] and localTemp[1].

    I suggest looking at some examples like convolution using local memory.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: First local memory work

    I'll do that but some are already complex for begin :
    http://developer.download.nvidia.com...e/samples.html

    I "find" how to do work my little code :

    Code :
    #define TILE_DIM 16
     
    __kernel void localTest(__global const unsigned char *a,
                            __global const unsigned char *b,
                            __global unsigned char *c)
    {
        int x = get_global_id(0);
        int lx = get_local_id(0);
     
        __local unsigned char localTemp1[TILE_DIM];
        __local unsigned char localTemp2[TILE_DIM];
     
        localTemp1[lx] = a[x];
        localTemp2[lx] = b[x];
     
    	c[x] = localTemp1[lx] + localTemp2[lx];
    }
    , but I ask me if it's the good way to do, I mean create 2 local array? And, of course, it's not my goal to do work this code.

    If I can, let me ask some questions :
    1 - Does work with local memory means (always or not) work with group size, group id? For the example above , 'a' or 'b' array is much bigger then local arrays (max local memory size), how to manage it? "Max group size" by "max group size" data processing on local memory?
    2 - Does a[x+TILE_DIM] not trying to write to same location localTemp1[lx], but it seems like it work ?

    Thanks.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: First local memory work

    Right. I should have provided some more info. The way you are doing it now looks good. Yes, the code looks good with two arrays since the data comes from two different places. You could put all the data into a single array, but it would not help.

    Does work with local memory means (always or not) work with group size, group id? For the example above , 'a' or 'b' array is much bigger then local arrays (max local memory size), how to manage it? "Max group size" by "max group size" data processing on local memory?
    The main idea with local memory is this: if you know that many work-items in the same work-group are going to read from the same locations in global memory, then put that data into local memory and read from there instead. Even though you pay the cost of copying data from global to local, it can be a big win if it saves you from having to read from global memory many times.

    Of course, local memory is much smaller than global memory, so you have to divide the work into pieces. Each work-group reads a small piece of data (some KB) from global memory into local memory, does some operations using that local memory and finally writes the result out.

    Do you remember the "local_work_size" parameter to clEnqueueNDRangeKernel()? It's very useful when your kernel uses local memory. In your example, you want local_work_size to be the same as TILE_DIM, so that work-size and local memory variables always match.

    Notice that in your example where you are adding two vectors together, local memory cannot improve performance because each work-item reads from two different places and writes into one. There's no overlap between work-items, so there's no benefit from local memory.

    Does a[x+TILE_DIM] not trying to write to same location localTemp1[lx], but it seems like it work ?
    Yes, you could do everything with a single variable but for what purpose? The code would look like this:

    Code :
    #define TILE_DIM 16
     
    __kernel void localTest(__global const unsigned char *a,
                            __global const unsigned char *b,
                            __global unsigned char *c)
    {
        int x = get_global_id(0);
        int lx = get_local_id(0);
     
        __local unsigned char myBigLocalVariable[TILE_DIM + TILE_DIM];
     
        myBigLocalVariable[lx] = a[x];
        myBigLocalVariable[TILE_DIM+lx] = b[x];
     
       c[x] = myBigLocalVariable[lx] + myBigLocalVariable[TILE_DIM+lx];
    }
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5

    Re: First local memory work

    Thank you very much for taking time to answer.

    [wisdom]
    I'll meditate it, and take time to practice.
    [/wisdom]

  6. #6

    Re: First local memory work

    I agree with you (I have no choice, you're right : ),and I agree that there is no interest to really use local memory in this case.

    For now, my questions are :
    Theoric:
    -> How to handle local memory in work-groups?
    -> How to handle local memory size versus group size?
    Technical:
    -> Is that all work-items is assigned a local id modulo of work-group size or global work size ?

    Here is an other exemple done after your explanation with using of local memory:
    Description :
    'a' buffer (size = 36):
    0 1 2 3 4 5
    6 7 8 9 10 11
    12 13 14 15 16 17
    18 19 20 21 22 23
    24 25 26 27 28 29
    30 31 32 33 34 35

    let's try to do : a[x] = a[x-1] + a[x] + a[x+1], except extremities.
    for exemple : a[5] = a[4] + a[5]+a[6] = 4+5+6 = 15.
    waiting result :
    'c ' buffer :
    0 3 6 9 12 15
    18 21 24 27 30 33
    36 39 42 45 48 51
    54 57 60 63 66 69
    72 75 78 81 84 87
    90 93 96 99 102 35

    code :
    Code :
     
    #define LOCAL_MEM_SIZE 4  
     
    __kernel void localTest(__global const short *a,
                            __global short *c)
    {
          unsigned int x = get_global_id(0);
        int lx = get_local_id(0);
     
     
        __local short myBigLocalVariable[LOCAL_MEM_SIZE * LOCAL_MEM_SIZE];
    //    barrier(CLK_LOCAL_MEM_FENCE);   
        unsigned int i;
     
     
        if(lx >0 && lx < LOCAL_MEM_SIZE * LOCAL_MEM_SIZE-1)
        {
            c[x] = myBigLocalVariable[lx] + myBigLocalVariable[lx-1] + myBigLocalVariable[lx+ 1];
        }
        else
        {
            c[x] = myBigLocalVariable[lx] ;
        }
    }

    This code returns : c =
    0 3 6 9 12 15
    18 21 24 27 30 33
    36 39 42
    15 16 17
    18 19 20 21 22 23
    24 25 26 27 28 29
    30 31 32 33 34 35

    Of course, it works well if LOCAL_MEM_SIZE 6, but I wanted to illustrate this situation where all local variable array are used and need to reassign values to local memory.

    Hope I managed to explain.

  7. #7
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: First local memory work

    How to handle local memory in work-groups?
    Sorry. I don't understand this question.

    How to handle local memory size versus group size?
    As a first step, try to use all your local memory and try to make your work-groups larger. Once you start fine tuning you will have to measure the performance you get with different work-group sizes. Not sure this answers your question.

    Is that all work-items is assigned a local id modulo of work-group size or global work size ?
    That is not technically necessary. However, for typical algorithms that's what you want to do. That way, each work-item is responsible for loading/storing one "piece" of local memory.

    Of course, it works well if LOCAL_MEM_SIZE 6, but I wanted to illustrate this situation where all local variable array are used and need to reassign values to local memory.
    I think there's something missing from the code you posted since myBigLocalVariable is never initialized. I don't see any difficulty making this work with local memory. Look at this:

    Code :
    #define LOCAL_MEM_SIZE 4  
     
    __kernel void localTest(__global const short *a,
                            __global short *c)
    {
        int x  = get_global_id(0);
        int lx = get_local_id(0);
     
     
        __local short localblock[LOCAL_MEM_SIZE];
     
        // Copy a portion of global memory into local memory.
        // Notice that each work-group will copy a different piece of local memory
        localblock[lx] = a[x];
     
        // Synchronize all work-items so that they all
        // see the latest state of "localblock"
        barrier(CLK_LOCAL_MEM_FENCE);   
     
        if(lx >0 && lx < LOCAL_MEM_SIZE -1)
        {
            // Fast case: all the desired data is already in local memory :)
            c[x] = localblock[lx] + localblock[lx-1] + localblock[lx+ 1];
        }
        else
        {
            // Slow case: one of the samples is not in local memory.
            // ...so we pick the data directly from global memory.
     
            // Handle extreme cases
            if(x > 0 && x < get_global_size(0)-1)
            {
                c[x] = a[x-1] + a[x] + a[x+1];
            }
            else
            {
                c[x] = a[x];
            }
        }
    }
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #8

    Re: First local memory work

    Yes, I had comments to delete and I delete initalizing line too, sorry:
    Code :
      ...
      unsigned int i;
      myBigLocalVariable[lx] = a[x];
     
      if(if(lx >0 && lx < LOCAL_MEM_SIZE * LOCAL_MEM_SIZE-1))
      {}
      .....

    You wrote good example for my question(s) : when I see fast and slow cases on your code, I understand that, in one work group(tell me if I'm right), you compute as much as possible in local memory (that's the goal), and the rest in global memory.
    -> Has all work-groups load only 4 datas in their local memory : for example if work-group = 512, 4 datas are compute in local memory and 508 on global memory?
    -> Is it like this that we should do ?Compute datas that are in local memory,and the rest in global memory ?
    -> Can't/shouldn't we reload/reassign remaining datas in local memory for work with ?
    -> Or should assign it before, one time, intelligently?

    And maybe the real question is
    -> Am I boring?

  9. #9
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: First local memory work

    Has all work-groups load only 4 datas in their local memory : for example if work-group = 512, 4 datas are compute in local memory and 508 on global memory?
    I'm not sure I understand the question. Generally what you do is to reduce the size of your work-groups until the point that you have enough local memory to do all your computations with it and you (almost) don't need to access global memory. Of course, if this means that your work-groups become very small then it's time to re-think the algorithm (small work-groups means less performance).

    In my opinion the best way to learn about this is to look at the examples you get fron nVidia's or AMD's SDK. Start with the most simple examples you can find and you will see how they use local memory in clever ways.

    Don't worry, you ask very good questions
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  10. #10

    Re: First local memory work

    clEnqueueNDRangeKernel :
    ....
    The work-group size to be used for kernel can also be specified in the program source using the __attribute__((reqd_work_group_size(X, Y, Z)))qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size __attribute__ qualifier.
    ...
    -> I guess that is what I should use for define work-groups size? Is it the only way, or the right way?
    -> If I assign local work size, does work-group size define from this?
    For example, in this :
    Code :
    	const size_t global_work_size[1] = {1024};
    	const size_t local_work_size[1] = {256};
    -> How work-group (or work-group size) are assigned?

Page 1 of 2 12 LastLast

Similar Threads

  1. Replies: 1
    Last Post: 11-18-2011, 10:05 AM
  2. Local work size!
    By Atmapuri in forum OpenCL
    Replies: 6
    Last Post: 05-21-2011, 08:15 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •