Currently I'm writing an algorithm where I need a single (very quick) global barrier, and then processing can resume in parallel as it was... so basically I have a large amount of parallel work, then all work_items should hit a barrier... one work item proceeds past and does some very quick work... then all work_items resume past the barrier.

I don't see that this is possible with OpenCL. The barrier() instruction specifies that it only applies to work groups. This isn't good enough, because I want to work at the global_id level.

The other thing to do is to break my kernels into three kernels... kernel_1 does everything in parallel up to the barrier... kernel_2 does a single_work item and very little work (a huge waste of time to spawn, but required for the algorithm), and finally kernel_3 again works in parallel. Obviously I want to avoid the CPU management where I can, because it will add a bit of overhead that isn't required.

Normally I wouldn't care... but this is part of a very time-critical algorithm, and I want to ensure this part is as fast as possible.

Thanks!