Results 1 to 5 of 5

Thread: async_work_group_strided_copy

  1. #1
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    async_work_group_strided_copy

    Is anyone using this function? Or understand what exactly does it do?

    I have a matrix in global memory:
    Code :
    ooooooooooo
    ooooooooooo
    ooooXXXXXoo
    ooooXXXXXoo
    ooooXXXXXoo
    ooooooooooo

    And I need to put subregion of the matrix into the local memory:
    Code :
    XXXXX
    XXXXX
    XXXXX

    For the time being I manually calculate how many element copy operations each work-item within a workgroup should do. The code is not very simple and it will become much more complex as the dimension count of "matrix" become variable (more than 2).

    But I know the initial offset, the number of continues regions I need to copy and the "distance" in global buffer between these regions. May I somehow use async_work_group_strided_copy function efficiently here instea? ?? manual calculations?
    Blog (in russian)

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: async_work_group_strided_copy

    async_work_group_strided_copy is useful when you have an array of structures (AoS) and you want to transform it into a structure of arrays (SoA), or more specifically, when you have an array of structures and want to extract one of the struct fields.

    In your example, the "width" of your sub-matrix would need to be a builtin CL type, like an int, or a float4.

    If you want to do a rectangular copy, I recommend executing async_work_group_copy() in a loop. Each iteration of the loop would copy one row of the sub-matrix into local memory. The number of iterations of the loop would match the height of the sub-matrix.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Member
    Join Date
    Jul 2011
    Location
    Moscow, Russia
    Posts
    41

    Re: async_work_group_strided_copy

    Quote Originally Posted by david.garcia
    async_work_group_strided_copy is useful when you have an array of structures (AoS) and you want to transform it into a structure of arrays (SoA), or more specifically, when you have an array of structures and want to extract one of the struct fields.

    In your example, the "width" of your sub-matrix would need to be a builtin CL type, like an int, or a float4.

    If you want to do a rectangular copy, I recommend executing async_work_group_copy() in a loop. Each iteration of the loop would copy one row of the sub-matrix into local memory. The number of iterations of the loop would match the height of the sub-matrix.
    Thanks a lot! My submatrix width is variable, right now it is 5 in one kenel and 6 in another. I don't think there are built-in types with such a width.

    I already tried using async_work_group_copy in cycle. It is slow. I guess it is because the width is much smaller than local worksize thus a lot of workitems are just doing nothing. I end up with several times more wavefront's memory requests than when I organize load manually.

    Thanks again, I am now confident that I am using the best approach
    Blog (in russian)

  4. #4
    Member
    Join Date
    Oct 2010
    Location
    Vancouver, Canada
    Posts
    66

    Re: async_work_group_strided_copy

    Quote Originally Posted by david.garcia
    async_work_group_strided_copy is useful when you have an array of structures (AoS) and you want to transform it into a structure of arrays (SoA)
    Sorry David, but I have to quibble. The async_work_group_strided_copy is not especially useful for an AoS <-> SoA transformation. If it were to be useful for the latter it would take this:

    Code :
    ********XYZW********
    ********XYZW********
    ********XYZW********
    ********XYZW********
    and transform it into

    Code :
    XXXX YYYY ZZZZ WWWW

    But it does not... it instead produces:

    Code :
    XYZW XYZW XYZW XYZW

    It can't even claim to be transposing across a workgroup because it is a global <-> local memory copy (thus just deferring the transpose to when the read from local memory happens), as opposed to into private kernel variables. Unfortunately it also isn't especially useful for rectangular copies because it can only extract 1 gentype-per-row, so at most your rectangle can be 16 elements wide. This is a fixed-stride gather/scatter function (perhaps better called a pack/unpack function?), which limits its utility.

  5. #5
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: async_work_group_strided_copy

    Andrew is right. Thanks for the correction.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •