dbs2 wrote:
Queues should be asynchronous. On Mac OS X, at any rate, the non-blocking commands will return virtually instantly while the blocking ones will wait. If Nvidia's driver is not working that way then they have a serious performance bug.

You're right. I had the opportunity to run my program on a MacBook and non-blocking memory commands returned immediately. But using the Nvidia implementation the
blocking parameter seems to be ignored...
However, kernel computation and data transfers weren't overlapped on the MacBook either, but that's maybe just because it didn't have a dedicated graphics cards, I think.
dbs2 wrote:
A lot of cards have DMA engines that can support this, but I don't know of any vendors that are actually using this.
What exactly do you mean by "using". Do you mean in terms of OpenCL or in general? If the cards have DMA engines then why shouldn't they be used?