Page 1 of 3 123 LastLast
Results 1 to 10 of 29

Thread: Regarding async_work_group_copy(global to local)

  1. #1

    Regarding async_work_group_copy(global to local)

    Hi folks,

    I have a kernel where a particular element (of a data structure) from the global memory.
    Other words, all the threads executing the kernel use the data at the same address in the global memory.

    I am trying to use the async_work_group_copy to get the data to the shared memory, first thing in the kernel. Also, as per the OpenCL specification, async_work_group_copy is executed by all threads.

    Is the following possible?

    One thread executes the async copy function and gets the data to the shared memory and the rest of the threads of the work group use the data brought in by one of the thread.

    Or is it better to allow the cache to handle the data accesses in this case?

    -- Bharath

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    All work-items from the same work-group share the same local memory. async_work_group_copy() is a function that loads data from global memory into local memory and it is executed by all work-items in a work-group. In other words, all work-items in the work-group must call async_work_group_copy() with the same arguments.

    After async_work_group_copy() has finished performing the memory transfer, all work-items in the work-group can read from local memory to access the data that was transferred.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: Regarding async_work_group_copy(global to local)

    I am not sure if my understanding of the local memory and async copy is correct. If I may ask a few questions...

    I would like to know why does the requirement of "same arguments" come in.

    Eg: The kernel has the following lines...

    __local char temp;
    async_work_group_copy((__local char *)&temp, (__global char *)globalvar, (event_t)0);

    Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration "__local char temp"? Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?

    -- Bharath

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    I would like to know why does the requirement of "same arguments" come in.
    Short answer: because the OpenCL specification requires it.

    Long answer: because all work-items in the work-group will perform the copy together. It's not a single thread doing the work. All threads collaborate.

    Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration "__local char temp"?
    Only one variable (one byte).

    Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?
    Yes, it will be exactly the same.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5

    Re: Regarding async_work_group_copy(global to local)

    I get the point. But when several threads try to access the Global memory, wouldn't there be clashes leading to further increase in the completion of copy?

    Also,

    Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    But when several threads try to access the Global memory, wouldn't there be clashes leading to further increase in the completion of copy?
    The should be no issue.

    Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?
    Are you asking whether the copy is performed by a single warp? That doesn't have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7

    Re: Regarding async_work_group_copy(global to local)

    Are you asking whether the copy is performed by a single warp? That doesn't have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.
    You got my question right, but I don't think I understand the explanation. If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?

    I was hoping for something close to prefetch but to the shared memory.

    Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?

    -- Bharath

  8. #8
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?
    Because each warp will only do part of the copy. Again, this will be done differently in different hardware.

    Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?
    I suggest referring to your hardware vendor's documentation. Some hardware doesn't even have a global memory cache.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  9. #9

    Re: Regarding async_work_group_copy(global to local)

    Because each warp will only do part of the copy. Again, this will be done differently in different hardware.
    So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.

    -- Bharath

  10. #10
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Regarding async_work_group_copy(global to local)

    So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.
    If you are only copying something like a single int, then it's not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle... assuming that your hardware doesn't use a DMA engine for global->local copies.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Page 1 of 3 123 LastLast

Similar Threads

  1. Replies: 6
    Last Post: 02-28-2013, 04:59 PM
  2. global & local size in 2D problem
    By pelliegia in forum OpenCL
    Replies: 2
    Last Post: 10-20-2012, 03:09 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •