Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 29

Thread: Regarding async_work_group_copy(global to local)

  1. #11

    Re: Regarding async_work_group_copy(global to local)

    If you are only copying something like a single int, then it's not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle... assuming that your hardware doesn't use a DMA engine for global->local copies.
    How about this case?

    - Each thread needs lets say 1000 elements to complete its work
    - Number of threads in 1 work group = 1024

    Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

    Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

    -- Bharath

  2. #12
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    How about this case?

    - Each thread needs lets say 1000 elements to complete its work
    - Number of threads in 1 work group = 1024

    Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.
    In most implementations, that's not true. If you have to copy 1000 elements and your work-group size is 1024, the first 1000 work items will copy one element each and the last 24 work-items will not do any work.

    Again, this is somewhat hardware-dependent.

    Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.
    Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don't know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #13

    Re: Regarding async_work_group_copy(global to local)

    Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don't know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.
    I quite get the point regarding how the contents are brought form the global to local by separate threads. But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.

    But yes, I also understand that OpenCL was not tailored for my application.

    -- Bharath

  4. #14
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.
    I honestly don't understand where is the problem. When you put some data in local memory it's because you want all work-items in the work-group to access all that data. In that case the cost of copying the data from global to local memory is usually negligible compared to the alternative of fetching global memory over and over. If each work-item is only going to access a small piece, then local memory is not needed.

    Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #15

    Re: Regarding async_work_group_copy(global to local)

    Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.
    Actually, I am solving a knapsack problem.

    We'd have N items having value V(0).. V(N-1) and weights W(0)..W(N-1) and a bag of capacity C. I am currently using dynamic programming technique and the kernel would look like

    Code :
    For i=0:N-1
      For j=1:C
        //some code - trivial arithmetic using V[i] and W[i]
     endFor
    endFor
    What goes into the OpenCL kernel is "//some code", and I launch C threads at a time and the kernel is enqueued N times (corresponding to the outer loop).

    During the ith call to the kernel, the code uses the ith element of the V array and the W array.
    I am currently getting some speedup using OpenCL(global memory) for good values of N and C, but I am wondering if I could use the shared memory to improve the performance significantly.

    -- Bharath

  6. #16
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    Ah, I see. Interesting

    If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with "in a register".

    Code :
    __kernel void knapsack(..., float v_i, float w_i)
    {
        // ...
    }

    That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #17

    Re: Regarding async_work_group_copy(global to local)

    Bit late on this. Held up debugging one similar implementation.

    If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with "in a register".
    This helped a bit. But I will have to move back to the shared memory usage when the number of elements required by one thread is "many".

    I did manage to understand and make use of the async copy in another similar context, although I did not quite get the speed up initially expected. Realized there was another bottleneck.

    That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits.
    Thanks for this suggestion. It will take a while before I digest these.

    -- Bharath

  8. #18

    Re: Regarding async_work_group_copy(global to local)

    Well, I am back with a few more questions.

    Previously, all the threads in a workgroup used a single value from the val[i] and wgt[i].

    Currently, I am working on a variant of the Knapsack problem, called the multiple choice knapsack problem. For this, each of the thread would need the access to the complete array val and wgt. I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory. Something like...

    Code :
    __kernel... (__global value_t *val...)
    {
        __local value_t localvals[NUM_ITEMS];
        //fetch using async_work_group_copy(global->local)
       loop: 1 to number_of_values_fetched
        // Work using the values fetched
        // Use localvals instead of val
       end loop
    }

    I see a good decrease in the number of global load requests (OK) but the amount of GPU time increases when compared to the global memory implementation (val and wgt are in global memory).

    IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases. Am I right in thinking so?

    I can also see the output from the profiler (from Nvidia), but cannot make good use of it as to where I am losing the time I gained due to the shared memory accesses.

    I guess I am being a bit vague, but any suggestions what numbers I could look into to understand what is happening?

    -- Bharath

  9. #19

    Re: Regarding async_work_group_copy(global to local)

    To add to what I have said, I see that the branches and number of divergent branches has increased in the shared memory implementation. Do async_work_group_copy or wait_group_events contribute to the branches, in any way? The rest of the kernel, the conditions branches remain the same for both shared and global memory implementations.

    -- Bharath

  10. #20
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory.
    If val and wgt fit in local memory they almost certainly fit in constant memory as well. Have you tried that? The only difference for you is that instead of declaring them as __global you declare them as __constant.

    IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases.
    What is number_of_values_fetched? Is it the same as NUM_ITEMS? If data is read only once from global memory then there will be no benefit in using local memory.

    Where local memory is a win is where the kernel would fetch from the same global memory over and over.

    Do async_work_group_copy or wait_group_events contribute to the branches, in any way?
    Sure they can. It's implementation-dependent.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. Replies: 6
    Last Post: 02-28-2013, 04:59 PM
  2. global & local size in 2D problem
    By pelliegia in forum OpenCL
    Replies: 2
    Last Post: 10-20-2012, 03:09 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •