I have a NVIDIA GPU with Compute Capabiity 3.0, so it should support 16 concurrent kernels. I am starting 10 kernels by looping through clEnqueueNDRangeKernel for 10 times. Each of the kernel is tied to a different command queue. How do I get to know that the kernels are executing concurrently?

One way which I have thought is to get the time before and after the NDRangeKernel statement. I might have to use events so as to ensure the execution of the kernel has completed. But I still feel that the loop will start the kernels sequentially. Can someone tell me if this is the right way to start concurrent kernels..?

Also what if I start more than 16 kernels (say 20), will the kernels be executed in a batch of 16 kernels i.e. first 16 will be executed in first batch and then remaining 4 kernels in the next batch..?