Page 1 of 2 12 LastLast
Results 1 to 10 of 15

Thread: minimal efficient workgroup size

  1. #1

    minimal efficient workgroup size

    Hello,

    I'm working on Mac OS 10.7, with AMD Radeon 6750M.
    I wrote an OpenCL kernel, signed with the following attributes:

    Code :
    kernel 
    __attribute__((vec_type_hint(float4)))
    __attribute__((reqd_work_group_size(1, WG_SIZE)))
    void my_kernel(...) 
    {
      // do something with float4 pixels
      local shared_res;
      local tmp[WG_SIZE];
      for (int i = 1; i < N; i++)
      {
         float4 v = read_image_f(...);
         tmp[get_local_id(1)] = foo(v);
         barrier(...); // local barrier
         sum(tmp, &shared_res); // sum tmp and write the result to shared_res
         if (shared_res > SOME_VALUE) break;
      }
    }

    As far as understand, each work-group runs on one warp (wavefront).
    In AMD the wavefront size is 64. Hence, there will be generally no benefit from having more than 16 work-items in each workgroup if the vec_type_hint is float4 (and the compiler uses this hint).

    However, it seems when WG_SIZE is 64 rather than 16 gives ~X4 boost to the running time of the kernel.
    I suspect that the compiler ignores the vec_type_hint(float4) hint, and compiles the code without vectorizing the float4 operations (i.e. running them one-by-one leaving 75% of the warp size empty)

    In my specific case, I would like to use a minimal but efficient size of work-group as I have a brunch in the kernel that allows me to stop the workgroup job and save some time (it saves ~80% of the time in my CPU implementation). As the break happens in all work-items at the group together, this should not make the performance worse (am I right?).

    How can I check my hypothesis or understand what's going on there and why does a larger workgroup size gives better performance?

    Thanks in advance,
    Yoav

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: minimal efficient workgroup size

    As far as understand, each work-group runs on one warp (wavefront).
    You may want to look again at your vendor's OpenCL programming guide. One work-group typically contains multiple warps/wavefronts/waves.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: minimal efficient workgroup size

    Right, but does the gpu runs more than WARPSIZE work items from the same workgroup at a time?

  4. #4
    Junior Member
    Join Date
    Jan 2012
    Posts
    2

    Re: minimal efficient workgroup size

    The work group size and the underlying compute unit size do not have to match. That said, you can write kernels that take advantage of the knowledge of the underlying architecture. In this case, you can have a macro that gets passed into the kernel compiler to indicate the warp/wavefront size. Then you can use local memory to do some work across the workgroup. If you workgroup size matches the compute unit size, the compiler could optimize away things like barriers.

    BTW, the workgroup AND compute unit size have nothing to do with the size of the memory buffer you pass into the kernel. It simply indicates how many cores will be used concurrently. Each core could operate on 1, 2, 3,... etc. bytes from the buffer.
    The exception is when you have vector core architectures, like the AMD VLIW, or Intel/AMDs SSE/AVX - then passing in float4 for example will help the compiler vectorize the work.

  5. #5
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: minimal efficient workgroup size

    Right, but does the gpu runs more than WARPSIZE work items from the same workgroup at a time?
    Reading your hardware vendor's programming guide you may have noticed that GPU hardware typically supports multiple warps/wavefronts/waves in each compute unit. Hardware vendors spend silicon supporting this because it's highly beneficial to performance even if at any point in time the compute unit was only executing a single wave. The performance boost comes from the fact that if a wave is executing an expensive operation, such as a global memory load, the hardware will switch to a different wave instead of having to wait for the global memory load transaction to finalize.

    In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  6. #6

    Re: minimal efficient workgroup size

    Thanks for the detailed replies!

    In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.
    Can't the GPU run another workgroup in parallel on the same compute unit to hide latency?

  7. #7
    Junior Member
    Join Date
    Dec 2011
    Posts
    15

    Re: minimal efficient workgroup size

    Quote Originally Posted by yoavhacohen
    Thanks for the detailed replies!

    In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.
    Can't the GPU run another workgroup in parallel on the same compute unit to hide latency?
    I believe that is the case, but typically a workgroup will have many times more threads than would fill a single warp/wavefront. This may be because the hardware schedular is more efficient at managing many threads within a workgroup as opposed to managing many blocks, i'm not sure on the details.

    --
    jason

  8. #8

    Re: minimal efficient workgroup size

    Quote Originally Posted by yoavhacohen
    In other words, if you want a kernel to run fast on a GPU, the work-group size you choose must be significantly larger than the wave size.
    Can't the GPU run another workgroup in parallel on the same compute unit to hide latency?
    Answer to my question:
    The GPU can run other workgroups in parallel to hide latency, but only if (the kernel requirements)*(# active workgroups) do not exceed the GPU resources.

    If the kernel requirements are high, than the number of active workgroups will be low, and the GPU will not be able to hide latency well. This is measured by "occupancy". NVidia has Occupancy calculator for their devices.

  9. #9

    Re: minimal efficient workgroup size

    Quote Originally Posted by yoavhacohen
    If the kernel requirements are high, than the number of active workgroups will be low, and the GPU will not be able to hide latency well. This is measured by "occupancy". NVidia has Occupancy calculator for their devices.
    The profiler found in the AMD APP SDK offers an occupancy calculator as well. By launching something like
    Code :
    sprofile -o results.csv -O ./yourapp
    you will get the performance counters in results.csv and the occupancy analysis in results.occupancy. The occupancy analysis from the AMD profiler tells you how many wavefronts per compute unit you can get, what is limiting your wavefronts/CU number (workgroup size, kernel requirements for registers or LDS, etc), and the % occupancy of the CUs.

  10. #10

    Re: minimal efficient workgroup size

    Quote Originally Posted by yoavhacohen
    If the kernel requirements are high, than the number of active workgroups will be low, and the GPU will not be able to hide latency well. This is measured by "occupancy". NVidia has Occupancy calculator for their devices.
    The profiler found in the AMD APP SDK offers an occupancy calculator as well. By launching something like
    Code :
    sprofile -o results.csv -O ./yourapp
    you will get the performance counters in results.csv and the occupancy analysis in results.occupancy. The occupancy analysis from the AMD profiler tells you how many wavefronts per compute unit you can get, what is limiting your wavefronts/CU number (workgroup size, kernel requirements for registers or LDS, etc), and the % occupancy of the CUs.

Page 1 of 2 12 LastLast

Similar Threads

  1. Global workgroup size and performance
    By Peccable in forum OpenCL
    Replies: 5
    Last Post: 10-24-2011, 01:29 AM
  2. Replies: 1
    Last Post: 05-14-2010, 09:27 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •