Results 1 to 6 of 6

Thread: Global workgroup size and performance

  1. #1
    Junior Member
    Join Date
    Oct 2011
    Posts
    26

    Global workgroup size and performance

    So I'm playing around with a many-body system to test out what kind of processing power I can get out of my FX 3800.

    I can get a decent framerate with 10000 particles or less when the global work-group size is the same as the particle count (or slightly larger to be exact, 10240, as the local size is 512).

    Now in the kernel I must loop through every other particle and compute the attraction/repulsion and so on, and this loop is what uses most of the computing power.

    Since the GPU is capable of a much larger work-group size than 10k, I tried use a multiple of the particle count as the global work-group size. For example
    Code :
    global_ws = local_ws*static_cast<size_t>(ceil(static_cast<float>(N*M)/static_cast<float>(local_ws)));
    Where N is the particle count and M is the multiple. In the kernel I would the divide the particles in M groups and use
    Code :
    get_global_id(0)%N
    To determine which particle I'm working on and
    Code :
    (get_global_id(0) - get_global_id(0)%N)/N
    to determine which subset of the particles to compute the interaction from.

    However it seems the performance gain is much lower than I'd expected. At M = 2 it is slightly faster but if I set M = 10, or higher, it is slower.

    So does anyone have a clue as to why this would happen? Should not more work groups lead to more parallel computations and thus make it faster?

  2. #2
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Global workgroup size and performance

    You've already got the hardware pretty busy with that much work, so there probably isn't much extra to be gained. And many more threads just increases the overheads and might even cause the job to be batched across multiple runs (== much slower). Extra threads can really only hide memory latency, but if it's saturated or you are ALU bound they can't help.

    If you are able to arbitrarily set the size, then try to match the hardware details so you have the workgroup some multiple of 'Kernel Preferred work group size multiple', and the (total global worksize / local worksize) to be some some integer multiple of 'Max Compute Units'. But once you're over a certain size problem this wont make much difference either (well from a few tests).

    Depending on the problem and how the solution is implemented, reducing the LWS to 64 or 128 might have more of an impact than increasing the global work size and having each work-group do less work.

  3. #3
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Global workgroup size and performance

    Quote Originally Posted by Peccable
    Code :
    global_ws = local_ws*static_cast<size_t>(ceil(static_cast<float>(N*M)/static_cast<float>(local_ws)));
    OT: Do you really have to do all those acrobatics for simple arithmetic in C plus plus? Damn. I missed a bullet there ...

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Global workgroup size and performance

    OT: Do you really have to do all those acrobatics for simple arithmetic in C plus plus? Damn. I missed a bullet there ...
    There are several ways to write that expression in C++. For example, you can use constructors instead of static casts:

    Code :
    global_ws = local_ws*size_t(ceil(float(N*M)/local_ws));
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Oct 2011
    Posts
    26

    Re: Global workgroup size and performance

    Quote Originally Posted by notzed
    You've already got the hardware pretty busy with that much work, so there probably isn't much extra to be gained. And many more threads just increases the overheads and might even cause the job to be batched across multiple runs (== much slower). Extra threads can really only hide memory latency, but if it's saturated or you are ALU bound they can't help.

    If you are able to arbitrarily set the size, then try to match the hardware details so you have the workgroup some multiple of 'Kernel Preferred work group size multiple', and the (total global worksize / local worksize) to be some some integer multiple of 'Max Compute Units'. But once you're over a certain size problem this wont make much difference either (well from a few tests).

    Depending on the problem and how the solution is implemented, reducing the LWS to 64 or 128 might have more of an impact than increasing the global work size and having each work-group do less work.
    Thanks, you are right. Reducing local work-size to 128 more than halved the time used for computations.

  6. #6
    Junior Member
    Join Date
    Oct 2011
    Posts
    26

    Re: Global workgroup size and performance

    Quote Originally Posted by david.garcia
    There are several ways to write that expression in C++. For example, you can use constructors instead of static casts:
    Code :
    global_ws = local_ws*size_t(ceil(float(N*M)/local_ws));
    Could also be done like this (at the risk of having one superfluous multiple of local_ws):
    Code :
    local_ws*(( N*M)/local_ws + 1)
    However shorter code isn't always better or clearer I'd say.

Similar Threads

  1. minimal efficient workgroup size
    By yoavhacohen in forum OpenCL
    Replies: 14
    Last Post: 02-05-2012, 04:18 PM
  2. Replies: 1
    Last Post: 05-14-2010, 09:27 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •