I have a function Run() that calls execution of two kernels:
//I'am using C++ bindings
queue->enqueueNDRangeKernel(*kernelRow, cl::NullRange, *globalRangeRow, *localRangeRow, NULL, eventRow);
queue->enqueueNDRangeKernel(*kernelColumn, cl::NullRange, *globalRangeCol, *localRangeCol, NULL, eventCol);
// As you see, I'm using events (eventRow, eventCol) because of profiling.
How expensive (time performance) is calling enqueueNDRangeKernel (or clEnqueueNDRangeKernel ).
With Nvidia OpenCL Profiler, I got total time of execution (on GPU) 351 ms, but when I measured time of running of method Run()
I got 622 ms.
Why this difference is so large?
I tested on NVIDIA GT240.
I also tested on ATI HD 5670 and difference is much smaller.
When is data transfered to GPU, on calling clEnqueueNDRangeKernel or when buffer is created (clCreateBuffer)?