while looking at the OpenCL programs in the NVidia SDK I saw that for profiling, they don't measure the runtime of the first run. I'm used to that concept from CPU benchmarking, where e.g. the caches are populated in the first run.
I'm now trying to understand what happens on the GPU when a kernel is executed for the first time. I realized that when using clCreateBuffer with CL_MEM_COPY_HOST_PTR the startup time is a lot higher than using clCreateBuffer and clEnqueueWriteBuffer (e.g. 46ms vs 5ms for BlackScholes).
I guess that's because when using CL_MEM_COPY_HOST_PTR, the memory is only copied once the kernel is loaded to a specific device, because up until this point it is not known where to copy the memory (there can be multiple devices in a context).
But even when using a blocking clEnqueueWriteBuffer, the runtime of the first kernel execution is higher than the subsequent executions. I then thought this was because the binary needs to be copied to the device, but there seems to be no correlation between the binary size and the kernel startup overhead.
Does anyone have more insight into what's going on when a kernel is executed for the first time on a device?
I'm using a Nvidia Tesla T10 and the Nvidia driver.