Results 1 to 6 of 6

Thread: vload4 vs four buffer acceses for local memoy buffer

  1. #1
    Junior Member
    Join Date
    Mar 2014
    Posts
    10

    vload4 vs four buffer acceses for local memoy buffer

    Does vload4 have any advantage over four individual buffer accesses for a local memory buffer?

    i.e

    ////////////////////////////////////////////////////////////
    __local int FOO[256];

    // case 1
    int4 pixel = vload4(0,FOO)

    // case 2
    pixel.x = FOO[0];
    pixel.y = FOO[1];
    pixel.z = FOO[2];
    pixel.w = FOO[3];

    /////////////////////////////////////////////////

    Also, does vload4 execute in one kernel clock cycle (assuming no bank conflicts) ?

    Thanks!
    Aaron

  2. #2
    A compiler could theoretically tell that case 1 and case 2 are essentially the same. I have seen compilers do this in similar cases, but I can't speak for all compilers. As such, I typically prefer the vload over separate loads so that I'm not relying on compiler tricks.

    As to your second question, nothing in the spec makes clock-level performance guarantees about any operation. Implementation by carrier pigeon would be completely legal. If you have questions about the behavior on a specific platform, I suggest you talk to the hardware vendor of the device you are using.

  3. #3
    Junior Member
    Join Date
    Mar 2014
    Posts
    10
    Thanks kunze. Now, what about bank conflicts. If work item one issues memory reads from address 0 to address 4, and
    the next work item reads from address 1 to address 5, then the individual reads would not exhibit bank conflict. However,
    if vload is used, then it is possible that vload #1 would conflict with vload #2.

  4. #4
    Again, the answer here would be architecture dependent. But for the architecture I use, one memory access with four lanes trying to access the same bank is no worse than four memory accesses with no bank conflicts. But this should be something that's pretty easy to verify empirically on whatever you're using.

  5. #5
    Junior Member
    Join Date
    Mar 2014
    Posts
    10
    Tried this out on HD 7700 series GPU: best perf was from individual loads, not vloadn.

  6. #6
    Senior Member
    Join Date
    Dec 2011
    Posts
    204
    With that amount of overlapped reads (work items re-reading the same memory other work items just read) this is a good candidate for workgroup shared local memory. Make those global memory reads just once, then read them as much as you need inside the work items. That will be faster than either individual loads or vloadn. You can code this yourself or use async_work_group_copy.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •