I am having this weird issue with my OpenCL code. I am profiling my code using NVIDIA NSight Visual Studio. At first, I was getting a large duration(in seconds) for "clEnqueueNDRangeKernel" command. But when I changed my kernel code and used loop unrolling, its duration reduced to orders of miliseconds. But, now what I found is that the duration of "clEnqueueReadBuffer" command following the "clEnqueueNDRangeKernel" has increased to orders of seconds which was earlier in miliseconds. I didn't chnage anything in my kernel related to data transfer that would affect the "readbuffer" command.
What is more surprising is that when I comment out that "clEnqueueReadBuffer" command the duration of "clEnqueueWriteBuffer" command following the "clEnqueueReadBuffer" command increases.
This continues till "clFinish" command for command queue.
Does anyone know where could be the problem. Is it in the kernel or the host code.
Any help is invited.

Note: This increase in duration of the command increases with the size of data passed.