Hi. I am testing a simple heterogeneous computing program with OpenCL using one CPU and one GPU.

The CPU has one NDRangeKernel().
The GPU has two WriteBuffer(), one NDRangeKernel(), and one ReadBuffer().

And say,
CPU_time = NDRangeKernel(),
GPU_time = 2*WriteBuffer() + NDRangeKernel() + ReadBuffer().

Both CPU and GPU jobs are totally independent.
I expected a result that if the CPU and the GPU are running concurrently, total elapsed time should be max(CPU_time,GPU_time).

But actual results showed me kind of (CPU_time+GPU_time) which argues the CPU and the GPU are not executed in parallel.
So I analyzed with profiler to find what was wrong.

One strange thing was observed when the CPU has a heavy job whereas GPU computes small. (real data : CPU takes 0.05 sec and GPU takes 0.01sec)
It seems to me that since the CPU is busy for its computation, the first WriteBuffer() operation was delayed until the CPU complete.

Does anyone have this problem before?