PDA

View Full Version : Global Barriers?



guillona
02-14-2010, 09:28 PM
Currently I'm writing an algorithm where I need a single (very quick) global barrier, and then processing can resume in parallel as it was... so basically I have a large amount of parallel work, then all work_items should hit a barrier... one work item proceeds past and does some very quick work... then all work_items resume past the barrier.

I don't see that this is possible with OpenCL. The barrier() instruction specifies that it only applies to work groups. This isn't good enough, because I want to work at the global_id level.

The other thing to do is to break my kernels into three kernels... kernel_1 does everything in parallel up to the barrier... kernel_2 does a single_work item and very little work (a huge waste of time to spawn, but required for the algorithm), and finally kernel_3 again works in parallel. Obviously I want to avoid the CPU management where I can, because it will add a bit of overhead that isn't required.

Normally I wouldn't care... but this is part of a very time-critical algorithm, and I want to ensure this part is as fast as possible.

Thanks!

dominik
02-16-2010, 03:20 AM
OpenCL only supports synchronization within workgroups. The official way of a global synchronization is to have multiple kernels as you pointed out. But rather than having 3 kernels you would only need 2 I think: In the first kernel you do all the work up to the barrier and only one workitem (say the one with global_id 0) does the sequential work. Then in the second kernel you do the remaining parallel work.

There's a paper a this year's CC conference called "Automatic C-to-CUDA Code Generation for Affine Programs". They say they use

a "single-writer multiple-reader" technique to achieve synchronization across thread blocks using the global memory space
They don't discuss the performance of this technique though...

dbs2
02-20-2010, 04:58 AM
The "single-writer multiple-reader" thing sounds a lot like one work-item writes and the others spin-lock on it. That may work, but without assurances as to how the hardware schedules work-groups it might also never complete. (I've heard that it tends to work on Nvidia hardware.)