Sometimes I get huge kernel overhead.
I measure the time of the time using two ways using:
Where, get_time() uses gettimeofday() to get the current time in seconds as double.Code :double start_time_total = get_time(); cl::Event event; queue.enqueueNDRangeKernel(kernel, cl::NullRange, // offset global cl::NullRange, // local NULL, // pre-requisite events &event); double gpu_profiling_time = event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>(); gpu_profiling_time *= 1.0e-9; // Convert to seconds double end_time_total = get_time(); gpu_total_time = end_time_total - start_time_total;
When the CPU is used as the OpenCL device the difference between gpu_total_time and gpu_profiling_time makes sense.
However, when I use my GPU (AMD 6750M, on MacBook Pro) the overhead is sometimes huge, 0.000619s compare to 0.032589s (~X50 slower when measured from the host side).
The problem is consistent with specific kernels.
Here is the prototype of the kernel if it helps:
Code :kernel void resize( __read_only image2d_t src, __write_only image2d_t dst, int width, int height, float scale_x, float scale_y)
Note that the problem does not exist on Windows with NVidia hardware (at least for the specific device that I tried).
Any idea for solution?
Thanks in advance!