with openCL I am computing the normalized cross correlation between two images for each 16x16 patches. So my global worksize is 640x512 and my local worksize is 16x16. The algorithm and the openCL code works fine so far. But sometimes the output image has "missing patches" (see image below).
The distribution of the black quads differs from time to time and sometimes the output image is completely correct.
I guess that the reason of the missing patches are reading the buffer by the host code before the calculation is completely done. So a synchronization should solve the problem. In the handbook I found the options:
- Use clFinish
- Use clEnqueueBarrier
- Use events
I don't know which is the best and I was not able to determine the difference between clFinish and clEnqueueBarrier.
Can please someone explain to me the difference of clFinish and clEnqueueBarrier (with other words than used in the Specification ) and give me are general explanation in which situation which method above is the best.
Thank you very much.
If you use only one command queue and this command queue uses in-order execution mode, and if you always use blocking read/write to access images and buffers, you generally don't have to bother about synchronization between host and device. OpenCL will handle it for you.
If you have several command queues and/or use out-of-order execution mode, then you have to handle synchronization manually.
As for your question about clFinish and clEnqueueBarrier, clFinish executes all the pending commands in the command queue and returns when they are finished. So in fact the host waits until all commands pending in the command queue are executed by the device.
clEnqueueBarrier is totally different and tells the device that the pending commands in the command queue before the barrier will have to all be finished before starting executing the pending commands after the barrier. This command is meaningful mostly in out-of-order execution mode.
For instance, suppose that you have three kernels to execute: K1, K2 and K3, and that K3 execution depends on the result of K1 and K2, but that K1 and K2 are independent. In out-of-order execution mode, K1 and K2 can be executed in parallel, but K3 must be executed only once K1 and K2 have finished.
So you can enqueue K1 and K2. Then enqueue a barrier, then enqueue K3 to ensure that K1 and K2 will have finished before K3 starts. Notice that synchronization is enforced here by the device, not by the host (the barrier is on the device-side). The host is not notified when the barrier is "crossed".
Start by using only one in-order command queue. Enqueue the data upload, the compute, and the data download. Only the download needs to use the "blocking" flag set to TRUE. You don't need clFinish, clFlush, or events. Only when you get this working should you attempt multiple command queues and events.
Thank you for your answers.
In the meantime I learned a lot about openCL, in-order and out-of-order, and your answers helped me to get a better understanding. So I understand now, that in most cases clFinish and clEnqueueBarrier makes sense when using out-of-order queues.
In my special case, I figured out that I had a mistake within the kernel. I mixed barrier(CLK_LOCAL_MEM_FENCE) with barrier(CLK_GLOBAL_MEM_FENCE);