Page 3 of 3 FirstFirst 123
Results 21 to 29 of 29

Thread: Regarding async_work_group_copy(global to local)

  1. #21

    Re: Regarding async_work_group_copy(global to local)

    If val and wgt fit in local memory they almost certainly fit in constant memory as well. Have you tried that? The only difference for you is that instead of declaring them as __global you declare them as __constant.
    Although its a good idea to put the val and wgt into the __const address space, I am not sure if it helps with performance. I have best come across an Nvidia article (or post) which says that the constant cache is faster than the global cache. Or are you trying to say something else here? And I tried this piece and swept across the range of variables with no good result.

    What is number_of_values_fetched? Is it the same as NUM_ITEMS? If data is read only once from global memory then there will be no benefit in using local memory.
    The number_of_values_fetched actually might varies in each kernel invocation, between... say 10 and 2000. I will try to give a better picture of the kernel here.

    Code :
     
    //Shared memory
     
    __kernel void mckp(__global value_t *val,
                                 __global weight_t *wgt,
                                 __global value_t *soln,
    		             __global int *keep,
    		             const int i,
    		             const unsigned int capacity,
    		             const unsigned int num_choices,
    		             const unsigned int offset)
    {
      __local value_t locval[SB_SIZE];
      __local weight_t locwgt[SB_SIZE];
     
      event_t copydone;
     
      copydone = async_work_group_copy((__local value_t *)locval,
      	     			   (__global value_t *)(val + offset),
    				   num_choices, (event_t)0);
     
      copydone = async_work_group_copy((__local weight_t *)locwgt,
      	     			   (__global weight_t *)(wgt+offset),
    				   num_choices, (event_t)copydone);
     
      // Some irrelevant code
     
      wait_group_events(1, &copydone);
     
      for(k = 0; k < num_choices; ++k) {
      //Code which accesses locvals[k] once and locwgts[k] once
      //More irrelevant code
      }
    }

    Code :
    //Without shared memory
    __kernel void mckp(__constant value_t *val,
                                 __constant weight_t *wgt,
                                 __global value_t *soln,
    		             __global int *keep,
    		             const int i,
    		             const unsigned int capacity,
    		             const unsigned int num_choices,
    		             const unsigned int offset)
    {
     
      // Some code
     
      for(k = 0; k < num_choices; ++k) {
     
      // Code accessing val[offset+k] once and wgt[offset+k] once
      // More irrelevant code
     
      }
    }

    Where local memory is a win is where the kernel would fetch from the same global memory over and over.
    Well, although a single thread uses a particular value only once, a bigger perspective, considering the work-group leads me to think...

    Without shared memory:
    Total number of load requests = 2 * num_choices * work_group_size (Global), with at least 2 * num_choices cache misses.

    With shared memory:
    Total number of load requests = 2 * num_choices (Global) +
    2 * work_group_size * num_choices (Local / Shared)


    But I might be wrong with the above *thoughts*.

    I am not sure if this is the right place for the question, but is there any way to check how the GPU time inside the kernel was spent? I am not too sure if the Nvidia profiler helps me there.

    -- Bharath

  2. #22
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    OK, I get it. Yes, the greater is num_choices the more benefit you will see from local memory. Your analysis on memory accesses looks right. It's possible that for smallish values of num_choices local memory will not be worth it.

    but is there any way to check how the GPU time inside the kernel was spent?
    That's a great question. Unfortunately there's no standard way to do this. You will have to use proprietary profilers. I'm quite sure that NVidia supports this feature -- try asking in their forums
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #23

    Re: Regarding async_work_group_copy(global to local)

    That's a great question. Unfortunately there's no standard way to do this. You will have to use proprietary profilers. I'm quite sure that NVidia supports this feature -- try asking in their forums
    Although I didn't get a response from the Nvidia forums, I figured out that there is a way to get some information while inside the kernel using a profiler trigger, but I don't think I will be able to use it. 1. The kernel compiler cannot recognise the function. 2. I am not sure if the trigger works with Compute Capability 2.0 since the documentation does not specify this.

    But apart from these, I have tried out a few things which gave me results which are tough to understand, but are good (not completely )

    I tried my hand with CUDA, global and shared memory implementations. The shared memory was used like...

    Code :
    // somewhere inside the kernel
     
    get_local_thread_ID()
    if local_thread_ID < num_choices
      fetch (local_thread_ID)th data from the global memory
     
    // use the shared memory

    I could clearly see that the shared memory gives a good speed-up even with not-so-large values of num_choices.

    With this result, I tried doing the same with OpenCL, i.e. without using async_work_group_copy and wait_group_events. Surprisingly, it was at least 20% faster than the async_work_group_copy version, which was as fast as or slower than the global memory version. And it was functionally correct too.

    I remembered I forgot the synchronise the threads after collecting the data from the global memory to shared memory (both in OpenCL and CUDA) and added a barrier (barrier(CLK_LOCAL_MEM_FENCE) for OpenCL and __syncthreads() for CUDA) before using the shared memory to see that the timings went up, beyond the global memory implementation in the OpenCL implementation while the impact of barrier was not too much in case of CUDA. To sum up...

    Fastest to slowest:

    CPU: 158629.718750
    GPU-CUDA-GLOBAL: 919.958984
    GPU-CUDA-SHARED-WITH-BARRIER: 790.421021
    GPU-CUDA-SHARED-WITHOUT-BARRIER: 779.372986
    GPU-OPENCL-SHARED-WITH-BARRIER: 701.237000

    GPU-OPENCL-SHARED-WITH-ASYNC-WITH-GROUP-WAIT: 700.955017
    GPU-OPENCL-SHARED-WITH-ASYNC-WITHOUT-GROUP-WAIT: 698.851990
    GPU-OPENCL-GLOBAL: 680.328003
    GPU-OPENCL-SHARED-WITHOUT-BARRIER: 607.349976

    For the runs, the number of items to be fetched varied from 200 to 300 and the number of threads in a work-group (thread-block for CUDA) was 512

    I cannot in particular make a remark about the implementation since I don't have any other reference, but in general...

    - async_work_group_copy seems to be slower than program guided fetching
    - Barriers are damn costly!! Is this the case?

    Any comments on these?

    -- Bharath

  4. #24
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    - Barriers are damn costly!! Is this the case?
    Yes, they are rather costly. They are also necessary.

    async_work_group_copy seems to be slower than program guided fetching
    That's not very surprising either. async_work_group_copy has to support odd ratios between work-items and the amount of memory to copy so there's some overhead to support all cases. In your kernel each work-item copies at most a single item from global memory, right?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #25

    Re: Regarding async_work_group_copy(global to local)

    Yes, they are rather costly. They are also necessary.
    So, that would mean we need to have enough work with the shared data to overcome the prefetch+sync time. But I see that the CUDA implementation does better here. But I guess it depends on the implementations.

    That's not very surprising either. async_work_group_copy has to support odd ratios between work-items and the amount of memory to copy so there's some overhead to support all cases. In your kernel each work-item copies at most a single item from global memory, right?
    Yes, that is right, at most one. Ah, and I understand that async_work_group_copy will have to handle many cases, including cases where amount-of-data > number-of-threads.

    -- Bharath

  6. #26

    Re: Regarding async_work_group_copy(global to local)

    Regarding barriers, does the cost of using a barrier depend on the workgroup size? Like, lesser number of threads per workgroup, lesser is the time?

    -- Bharath

  7. #27
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    Like, lesser number of threads per workgroup, lesser is the time?
    Yeah, I think that's a fair assumption. However, overall small work-groups are typically less efficient than large work-groups. At least on GPUs.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #28
    Junior Member
    Join Date
    May 2011
    Posts
    1

    Re: Regarding async_work_group_copy(global to local)

    Bharat, thanks for asking these questions. I'm a complete newbie, so this was a great read.

  9. #29
    Junior Member
    Join Date
    May 2011
    Posts
    1

    Re: Regarding async_work_group_copy(global to local)

    that thread was useful for me too. thanks, guys

Page 3 of 3 FirstFirst 123

Similar Threads

  1. Replies: 6
    Last Post: 02-28-2013, 04:59 PM
  2. global & local size in 2D problem
    By pelliegia in forum OpenCL
    Replies: 2
    Last Post: 10-20-2012, 03:09 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •