Results 1 to 10 of 10

Thread: non-blocking call to clEnqueueNDRangeKernel?

  1. #1
    Junior Member
    Join Date
    Aug 2009
    Posts
    9

    non-blocking call to clEnqueueNDRangeKernel?

    Hi!

    With Nvidia's current OpenCL release, the clEnqueueNDRangeKernel function blocks the calling process until the kernel has completed. The OpenCL specification is not explicit about the blocking/non-blocking behavior of clEnqueueNDRangeKernel, but I think a non-blocking version would be much more useful. So my question is if the clEnqueueNDRangeKernel function is meant to be non-blocking (in which case the Nvidia implementation is buggy) or if the implementor may choose blocking or non-blocking behavior (in which case the Nvidia implementation is less useful, but valid). Can somebody involved in the OpenCL specification please comment on that?

    Thanks & kind regards,
    Markus

  2. #2
    Senior Member
    Join Date
    Sep 2002
    Location
    Santa Clara
    Posts
    105

    Re: non-blocking call to clEnqueueNDRangeKernel?

    clEnqueueNDRangeKernel only enqueues a kernel for execution on a device. There is no requirement that the kernel being enqueued for execution must finish execution before returning. The prefix clEnqueue should be the hint.

  3. #3
    Junior Member
    Join Date
    Aug 2009
    Posts
    9

    Re: non-blocking call to clEnqueueNDRangeKernel?

    Quote Originally Posted by affie
    clEnqueueNDRangeKernel only enqueues a kernel for execution on a device. There is no requirement that the kernel being enqueued for execution must finish execution before returning. The prefix clEnqueue should be the hint.
    That's what I expected (although for example the read/write/map buffer funtions have an explicit blocking option), therefore I was surprised to see that the clEnqueueNDRangeKernel function returns only after the kernel has finished execution. Seems to me like a bug, or do I overlook something?

    Kind regards,
    Markus

  4. #4
    Senior Member
    Join Date
    Sep 2002
    Location
    Santa Clara
    Posts
    105

    Re: non-blocking call to clEnqueueNDRangeKernel?

    The clEnqueue{Read|Write}* API calls have a blocking option because of the user data ptr specified in these calls. The blocking / non-blocking control when the user data ptr can be reused by the application for a write or when the appropriate data will become available in user ptr for a read. clEnqueueNDRangeKernel does not need a blocking / non-blocking option.

    An implementor is free to choose blocking or non-blocking behavior but in general a blocking behavior will not be a very performant way to do things so do not believe any implementation will consider this a real option. I suggest that you work with NVIDIA to resolve this issue.

  5. #5
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: non-blocking call to clEnqueueNDRangeKernel?

    Quote Originally Posted by affie
    The clEnqueue{Read|Write}* API calls have a blocking option because of the user data ptr specified in these calls. The blocking / non-blocking control when the user data ptr can be reused by the application for a write or when the appropriate data will become available in user ptr for a read. clEnqueueNDRangeKernel does not need a blocking / non-blocking option.

    An implementor is free to choose blocking or non-blocking behavior but in general a blocking behavior will not be a very performant way to do things so do not believe any implementation will consider this a real option. I suggest that you work with NVIDIA to resolve this issue.
    Hi Markus,

    As per the suggestion above, did you get any feedback from NVidia on this issue? I think that the blocking nature of clEnqueueNDRangeKernel() is ridiculous as it stops a single CPU thread from scheduling concurrent execution of kernels on different devices. Interestingly, the oclSimpleMultiGPU sample supplied by NVidia suffers from this problem - it is trying to show off multi-GPU support, but achieves no concurrent computation because clEnqueueNDRangeKernel() blocks.

    I've posted to the NVidia forums about this here.

    Cheers,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

  6. #6
    Junior Member
    Join Date
    Aug 2009
    Posts
    9

    Re: non-blocking call to clEnqueueNDRangeKernel?

    Quote Originally Posted by monarodan
    Quote Originally Posted by affie
    As per the suggestion above, did you get any feedback from NVidia on this issue?
    Since I currently don't have access to Nvidia hardware, I did not further investigate this. However, I'm looking forward to receiving a Fermi card as soon as it becomes available (whenever this will be and will check again then.

    Kind regards,
    Markus

  7. #7

    Re: non-blocking call to clEnqueueNDRangeKernel?

    In my experience with the NVidia implementation the best way to achieve parallelism across devices is to launch a CPU thread for each device. This worked and scaled very well for me on an S1070 (1U 4 GPU server). The tricky part is NVidia and I at first didn't agree on what thread safety meant. The NVidia implementation crashes if multiple host threads access the same cl_context (though the standard says this should be possible).

    I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.

  8. #8
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: non-blocking call to clEnqueueNDRangeKernel?

    Quote Originally Posted by coleb
    In my experience with the NVidia implementation the best way to achieve parallelism across devices is to launch a CPU thread for each device. This worked and scaled very well for me on an S1070 (1U 4 GPU server). The tricky part is NVidia and I at first didn't agree on what thread safety meant. The NVidia implementation crashes if multiple host threads access the same cl_context (though the standard says this should be possible).

    I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.
    I've taken the same approach of having a host thread per device, and this is working well, though it does introduce some overheads.

    My concern is with thread safety. I am using a single context with multiple devices (currently 3 devices) and have not had problems with multiple threads accessing the OpenCL API (yet) with v3.0 of the CUDA SDK and driver version 197.16. However, what I am doing is strictly in violation of the OpenCL 1.0 spec and may not be portable! In Appendix A (A.2) the spec states:

    The OpenCL implementation is thread-safe for API calls that create, retain and release objects
    such as a context, command-queue, program, kernel and memory objects. OpenCL API calls
    that queue commands to a command-queue or change the state of OpenCL objects such as
    command-queue objects, memory objects, program and kernel objects are not thread-safe.
    If you follow the spec, then you should really serialise access by host threads to all OpenCL methods except for "API calls that create, retain and release objects". This means that you should only have one host thread calling clEnqueueNDRangeKernel() at a time. But, since that method blocks on the NVidia implementation, you would hold the lock for the duration of the calculation. The result being no concurrent execution of kernels across devices!

    I have anecdotal evidence that it is ok to call the various clEnqueue* commands concurrently from host threads for different command queues. I have also shown that calling clEnqueueNDRangeKernel() concurrently from two threads for the same queue results in crashes and the like.

    I think that the spec should say a lot more about thread safety, otherwise I have no idea if my code is going to be portable across different OpenCL implementations.

    Perhaps to give NVidia a nudge in the right direction, the spec states the following in 3.2.1:

    The command-queue schedules commands for execution on a device. These execute
    asynchronously between the host and the device.
    Given that, isn't a blocking clEnqueueNDRangeKernel() implementation in violation of the spec? Do NVidia reps read these forums, or should I go post this to their message boards? I'd really like to know NVidia's stance on this matter.

    Cheers,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

  9. #9

    Re: non-blocking call to clEnqueueNDRangeKernel?

    Quote Originally Posted by monarodan
    I think that the spec should say a lot more about thread safety, otherwise I have no idea if my code is going to be portable across different OpenCL implementations.
    The next version of the spec (OpenCL 1.1), has a lot more detail in the thread safety section. Since we're a new member of the khronos group I'm not sure how much more I'm allowed to say.

    Quote Originally Posted by monarodan
    Given that, isn't a blocking clEnqueueNDRangeKernel() implementation in violation of the spec? Do NVidia reps read these forums, or should I go post this to their message boards? I'd really like to know NVidia's stance on this matter.
    I'm working with an NVidia engineer on my current project. He suggestion was "use separate contexts for each host thread/device". I had to entirely re-write how my multi-threaded worked to get around this since it was crashing the NVidia OpenCL beta, though I haven't gone back and tried this branch of code on the released 3.0 driver.

  10. #10
    Junior Member
    Join Date
    Sep 2011
    Posts
    1

    Re: non-blocking call to clEnqueueNDRangeKernel?

    I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.

Similar Threads

  1. blocking renderings
    By clcks in forum Developers Coding:Beginner
    Replies: 2
    Last Post: 04-07-2010, 11:24 AM
  2. Why does EGL call glGetIntegerv when I call wglSwapBuffer?
    By teriba in forum Cross API and window system integration
    Replies: 0
    Last Post: 07-23-2009, 04:50 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •