So I'm playing around with a many-body system to test out what kind of processing power I can get out of my FX 3800.

I can get a decent framerate with 10000 particles or less when the global work-group size is the same as the particle count (or slightly larger to be exact, 10240, as the local size is 512).

Now in the kernel I must loop through every other particle and compute the attraction/repulsion and so on, and this loop is what uses most of the computing power.

Since the GPU is capable of a much larger work-group size than 10k, I tried use a multiple of the particle count as the global work-group size. For example
Code :
global_ws = local_ws*static_cast<size_t>(ceil(static_cast<float>(N*M)/static_cast<float>(local_ws)));
Where N is the particle count and M is the multiple. In the kernel I would the divide the particles in M groups and use
Code :
get_global_id(0)%N
To determine which particle I'm working on and
Code :
(get_global_id(0) - get_global_id(0)%N)/N
to determine which subset of the particles to compute the interaction from.

However it seems the performance gain is much lower than I'd expected. At M = 2 it is slightly faster but if I set M = 10, or higher, it is slower.

So does anyone have a clue as to why this would happen? Should not more work groups lead to more parallel computations and thus make it faster?