Page 2 of 2 FirstFirst 12
Results 11 to 15 of 15

Thread: minimal efficient workgroup size

  1. #11

    Re: minimal efficient workgroup size

    Quote Originally Posted by Bilog
    The profiler found in the AMD APP SDK offers an occupancy calculator as well.
    Is there a Mac version of the AMD profiler?

  2. #12
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Re: minimal efficient workgroup size

    There is a linux version that is command-line only. Its worth a try.

  3. #13
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: minimal efficient workgroup size

    I know it's a bit late, but here's some info on specific questions.

    Quote Originally Posted by yoavhacohen
    Hello,
    As far as understand, each work-group runs on one warp (wavefront).
    In AMD the wavefront size is 64. Hence, there will be generally no benefit from having more than 16 work-items in each workgroup if the vec_type_hint is float4 (and the compiler uses this hint).

    However, it seems when WG_SIZE is 64 rather than 16 gives ~X4 boost to the running time of the kernel.
    I suspect that the compiler ignores the vec_type_hint(float4) hint, and compiles the code without vectorizing the float4 operations (i.e. running them one-by-one leaving 75% of the warp size empty)
    Yes, AFAIK all the gpu implementations ignore vec_type_hint, and the programming model as far as opencl is concerned is entirely scalar-per-thread. On amd hardware until the latest iteration, it is implemented using a 4 or 5-wide instruction, but each thread gets it's own 4 or 5-wide ALU. The vec hint is just a way to help a cpu access SIMD units as cpu's have very limited number of 'threads': but gpu's don't need such a hint as they're already highly parallel.

    (note that each thread's ALU on pre-GCN AMD hardware is VLIW, not SIMD: simd enforces vectorised algorithms but AMD gets parallelism with scalar code simply based on data dependencies - so a vec_type hint isn't going to be very useful).

    The AMD doco makes it clear that 64 is the minimum size you want for efficiency. I find it works pretty well as a base-line for most algorithms. If you have small kernels (small register usage, local usage) and a lot of jobs they can schedule on the same processor core and hide latencies; so the optimum work size depends on the code being run and the size of the problem.
    In my specific case, I would like to use a minimal but efficient size of work-group as I have a brunch in the kernel that allows me to stop the workgroup job and save some time (it saves ~80% of the time in my CPU implementation). As the break happens in all work-items at the group together, this should not make the performance worse (am I right?).
    You might have to re-think that. Branches that 'save work' can often result in slower code: particularly in an inner loop where any extra work in evaluating a terminal condition can add up. But it depends a lot on the algorithm. Except for specific circumstances all threads execute all paths of all branches, they just mask out results in inactive branches. The specific circumstance is that since the processor executes a wavefront in groups of 16 (afaik, maybe it's groups of 64) in sequence, if all threads beyond those completed are terminated then they can avoid being executed at all. So if you're terminating random threads across the wavefront you will gain nothing but the cost of testing when they're done.

    So long as branches aren't in the innermost loop the cost is small. AMD hardware has some overhead implementing a branch but often branches can be removed by using branchless logic i.e. select(), (?, etc.
    How can I check my hypothesis or understand what's going on there and why does a larger workgroup size gives better performance?
    As suggested: read the vendor documentation. The AMD stuff is quite comprehensive (The AMD APP opencl programming guide, chapter 4 is all about performance). Some of the magazine articles on the hardware (anandtech, toms hardware, and so on) are also good for an overview.

  4. #14

    Re: minimal efficient workgroup size

    Thanks a lot for the detailed reply!

    The branching is done by testing some shared value, so all threads at the wavefront should terminate together. And yes, the branching is done in the outer loop, so it's not that expensive.

    I'm developing on Mac, and I have not find anything about automatically splitting kernels to different threads using the vec hint, so thanks for the information.
    BTW, NVidia implementation for windows explicitly output a warning that the vec hint is ignored.

    Why do gpu implementation ignore the vec hint? Is it a real limitation, or just because of the assumption that when you have a lot of threads it would be more efficient to ignore it?

    I could have implemented the wanted result of this vec hint by my self, if I could write the float4 result of read_image_f directly to the private registers of four threads, but this is not possible in OpenCL without passing it through the local memory (right?).
    Is this a hardware limitation or OpenCL language limitation? (i.e. are the different GPU hardwares can load image2d_t pixels to the registers of 4 physical threads?)

  5. #15
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: minimal efficient workgroup size

    Quote Originally Posted by yoavhacohen
    Thanks a lot for the detailed reply!
    ...
    Why do gpu implementation ignore the vec hint? Is it a real limitation, or just because of the assumption that when you have a lot of threads it would be more efficient to ignore it?
    It's just not necessary. From the documentation http://www.khronos.org/registry/cl/sdk/ ... fiers.html it's basically a way to utilise a wide SIMD unit: but GPU's don't have such simd units, so the hint just isn't appropriate for them.
    I could have implemented the wanted result of this vec hint by my self, if I could write the float4 result of read_image_f directly to the private registers of four threads, but this is not possible in OpenCL without passing it through the local memory (right?).
    Is this a hardware limitation or OpenCL language limitation? (i.e. are the different GPU hardwares can load image2d_t pixels to the registers of 4 physical threads?)
    Actually that isn't what the vec hint is for. It would be more like taking a routine that works on float4, and making it run in a single thread with float8. The hint helps the compiler combine multiple 'opencl threads' into single 'cpu threads', not the other way around.

    But as gpu's are optimised for float4 (the memory system as well as the alus), trying to do that would almost certainly result in slower code: so just stick to float4.

Page 2 of 2 FirstFirst 12

Similar Threads

  1. Global workgroup size and performance
    By Peccable in forum OpenCL
    Replies: 5
    Last Post: 10-24-2011, 01:29 AM
  2. Replies: 1
    Last Post: 05-14-2010, 09:27 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •