My kernels take about 5 seconds to run with clFinish() after each of them is enqueued. When I removed all the clFinish(), it takes only 2.2 seconds while the results are exactly the same. I only used a single command queue, and in this case do I have to call clFinish or clFlush?

The spec doesn't seem to explain how a command queue works in detail. According to it, although clEnqueueReadBuffer performs an implicit flush, there is no guarantee that the queue will be complete after clFlush returns. That sounds to me that anyway a clFnish() has to be called in order to ensure all the tasks in a queue are finished before calling clEnqueueReadBuffer to transfer the data back to CPU.

So could anyone tell me why I still got correct results after all the clFinish() have been removed? Is it just an accident or this is the right way to use OpenCL?

Thanks in advance.