I thought that by creating two queues for one device, it would be possible to overlap data transfers and kernel execution.
I transfer the data for the first kernel to the device. Once it's there, I start the first kernel. In the meantime I transfer the data for the second kernel to the device (to a different part of the device memory, of course).
In theory, it should be possible to overlap the execution of the first kernel with the data transfer for the second kernel, right?
I wrote a small test program and used profiling to see when the commands are executed. But even if the first kernel is running for quite a while, the second data transfer to the device only starts when the kernel execution has finished. Is that a limitation of the hardware? Or of the Nvidia implementation? Or am I missing something?