So I'm playing around with a many-body system to test out what kind of processing power I can get out of my FX 3800.
I can get a decent framerate with 10000 particles or less when the global work-group size is the same as the particle count (or slightly larger to be exact, 10240, as the local size is 512).
Now in the kernel I must loop through every other particle and compute the attraction/repulsion and so on, and this loop is what uses most of the computing power.
Since the GPU is capable of a much larger work-group size than 10k, I tried use a multiple of the particle count as the global work-group size. For example
Where N is the particle count and M is the multiple. In the kernel I would the divide the particles in M groups and useCode :global_ws = local_ws*static_cast<size_t>(ceil(static_cast<float>(N*M)/static_cast<float>(local_ws)));
To determine which particle I'm working on andCode :get_global_id(0)%N
to determine which subset of the particles to compute the interaction from.Code :(get_global_id(0) - get_global_id(0)%N)/N
However it seems the performance gain is much lower than I'd expected. At M = 2 it is slightly faster but if I set M = 10, or higher, it is slower.
So does anyone have a clue as to why this would happen? Should not more work groups lead to more parallel computations and thus make it faster?