Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: overlapping data transfers and kernel execution

  1. #1
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    overlapping data transfers and kernel execution

    Hi there,

    I thought that by creating two queues for one device, it would be possible to overlap data transfers and kernel execution.
    An example:
    I transfer the data for the first kernel to the device. Once it's there, I start the first kernel. In the meantime I transfer the data for the second kernel to the device (to a different part of the device memory, of course).
    In theory, it should be possible to overlap the execution of the first kernel with the data transfer for the second kernel, right?

    I wrote a small test program and used profiling to see when the commands are executed. But even if the first kernel is running for quite a while, the second data transfer to the device only starts when the kernel execution has finished. Is that a limitation of the hardware? Or of the Nvidia implementation? Or am I missing something?

    Cheers
    Dominik

  2. #2
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    Re: overlapping data transfers and kernel execution

    I just realized that it's apparently not possible to have enqueue-commands asynchronously (i.e. non-blocking). A clEnqueueWriteBuffer() call takes the same time with and without blocking. Also enqueuing a kernel execution only seems to return after the kernel has been executed...

    I'm using the NVidia SDK, so I guess it's a limitations of that. Has anyone else had the same problem?

    Thanks
    Dominik

  3. #3

    Re: overlapping data transfers and kernel execution

    What "execution ordering" did you specify when you created your command queue (clCreateCommandQueue), that is, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE or not

  4. #4
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    Re: overlapping data transfers and kernel execution

    I didn't specify anything. I thought that CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE means that commands WITHIN a queue can be reordered. However, what I want is commands in DIFFERENT queues to overlap and commands in a single queue should be executed in order.

  5. #5
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: overlapping data transfers and kernel execution

    Queues should be asynchronous. On Mac OS X, at any rate, the non-blocking commands will return virtually instantly while the blocking ones will wait. If Nvidia's driver is not working that way then they have a serious performance bug. :(

    You should certainly be able to get data movement and computation to be scheduled together by using two queues, but whether the runtime will actually overlap them depends entirely on the implementation. A lot of cards have DMA engines that can support this, but I don't know of any vendors that are actually using this. If you use an out-of-order queue, the runtime should be able to do the same thing.

  6. #6
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    Re: overlapping data transfers and kernel execution

    Quote Originally Posted by dbs2
    Queues should be asynchronous. On Mac OS X, at any rate, the non-blocking commands will return virtually instantly while the blocking ones will wait. If Nvidia's driver is not working that way then they have a serious performance bug.
    You're right. I had the opportunity to run my program on a MacBook and non-blocking memory commands returned immediately. But using the Nvidia implementation the blocking parameter seems to be ignored...
    However, kernel computation and data transfers weren't overlapped on the MacBook either, but that's maybe just because it didn't have a dedicated graphics cards, I think.

    Quote Originally Posted by dbs2
    A lot of cards have DMA engines that can support this, but I don't know of any vendors that are actually using this.
    What exactly do you mean by "using". Do you mean in terms of OpenCL or in general? If the cards have DMA engines then why shouldn't they be used?

  7. #7
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: overlapping data transfers and kernel execution

    What I mean by not using DMA engines is that the only way that I'm aware of to overlap compute and transfer on current generation cards is to use one of the DMA engines on the card to do the transfer while the kernel is running. There are, unfortunately, a lot of limitations on how these can be used since they were really designed for efficient graphics. I don't know of any implementations today that use them to allow you to overlap transfers and compute. If the Nvidia OpenCL driver is so broken as to not allow non-blocking commands they obviously aren't doing this.

    However, you may not need to have separate queues to future-proof your design. If you have an out-of-order command queue, the runtime should be free to optimize the scheduling of the commands as best it can. So your best bet would be to just try to use an out-of-order queue if it's available, and hope the runtime does the right thing. (I.e., if it doesn't do the right thing with an out-of-order queue, I doubt having two queues is going to make a difference.)

  8. #8
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    Re: overlapping data transfers and kernel execution

    OK, I see. Thanks a lot for your reply!

  9. #9

    Re: overlapping data transfers and kernel execution

    I believe part of the issue is that the path of least resistance to trying to overlap transfers and execution is to create two command queues to a single device. Then spawn a CPU thread to issue commands to each queue. Out of order command queues, while clever, are much more difficult to program in the host code.

  10. #10
    Member
    Join Date
    Nov 2009
    Location
    Scotland
    Posts
    72

    Re: overlapping data transfers and kernel execution

    I agree that using two queues and having a CPU thread each is probably a safe way because it doesn't rely on the OpenCL implementation to support stuff like non-blocking reads/writes.
    However, I think that compared to CUDA, the OpenCL approach of having queues and therefore avoiding the need for multiple CPU threads (e.g. when using more than one device) is quite elegant and actually makes it easier to write the host code because you don't have to worry about synchronization etc.

Page 1 of 2 12 LastLast

Similar Threads

  1. Kernel execution's problem
    By Nibul in forum OpenCL
    Replies: 3
    Last Post: 01-11-2010, 06:13 AM
  2. Async kernel execution and data copy
    By shiftreduce in forum OpenCL
    Replies: 5
    Last Post: 11-12-2009, 11:32 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •