PDA

View Full Version : non-blocking call to clEnqueueNDRangeKernel?



grabner
08-04-2009, 03:00 AM
Hi!

With Nvidia's current OpenCL release, the clEnqueueNDRangeKernel function blocks the calling process until the kernel has completed. The OpenCL specification is not explicit about the blocking/non-blocking behavior of clEnqueueNDRangeKernel, but I think a non-blocking version would be much more useful. So my question is if the clEnqueueNDRangeKernel function is meant to be non-blocking (in which case the Nvidia implementation is buggy) or if the implementor may choose blocking or non-blocking behavior (in which case the Nvidia implementation is less useful, but valid). Can somebody involved in the OpenCL specification please comment on that?

Thanks & kind regards,
Markus

affie
08-04-2009, 09:01 PM
clEnqueueNDRangeKernel only enqueues a kernel for execution on a device. There is no requirement that the kernel being enqueued for execution must finish execution before returning. The prefix clEnqueue should be the hint.

grabner
08-05-2009, 04:30 AM
clEnqueueNDRangeKernel only enqueues a kernel for execution on a device. There is no requirement that the kernel being enqueued for execution must finish execution before returning. The prefix clEnqueue should be the hint.
That's what I expected (although for example the read/write/map buffer funtions have an explicit blocking option), therefore I was surprised to see that the clEnqueueNDRangeKernel function returns only after the kernel has finished execution. Seems to me like a bug, or do I overlook something?

Kind regards,
Markus

affie
08-05-2009, 04:44 PM
The clEnqueue{Read|Write}* API calls have a blocking option because of the user data ptr specified in these calls. The blocking / non-blocking control when the user data ptr can be reused by the application for a write or when the appropriate data will become available in user ptr for a read. clEnqueueNDRangeKernel does not need a blocking / non-blocking option.

An implementor is free to choose blocking or non-blocking behavior but in general a blocking behavior will not be a very performant way to do things so do not believe any implementation will consider this a real option. I suggest that you work with NVIDIA to resolve this issue.

monarodan
05-04-2010, 11:20 PM
The clEnqueue{Read|Write}* API calls have a blocking option because of the user data ptr specified in these calls. The blocking / non-blocking control when the user data ptr can be reused by the application for a write or when the appropriate data will become available in user ptr for a read. clEnqueueNDRangeKernel does not need a blocking / non-blocking option.

An implementor is free to choose blocking or non-blocking behavior but in general a blocking behavior will not be a very performant way to do things so do not believe any implementation will consider this a real option. I suggest that you work with NVIDIA to resolve this issue.

Hi Markus,

As per the suggestion above, did you get any feedback from NVidia on this issue? I think that the blocking nature of clEnqueueNDRangeKernel() is ridiculous as it stops a single CPU thread from scheduling concurrent execution of kernels on different devices. Interestingly, the oclSimpleMultiGPU sample supplied by NVidia suffers from this problem - it is trying to show off multi-GPU support, but achieves no concurrent computation because clEnqueueNDRangeKernel() blocks.

I've posted to the NVidia forums about this here (http://forums.nvidia.com/index.php?showtopic=103476&st=0&gopid=1050604).

Cheers,

Dan

grabner
05-05-2010, 01:43 AM
As per the suggestion above, did you get any feedback from NVidia on this issue?
Since I currently don't have access to Nvidia hardware, I did not further investigate this. However, I'm looking forward to receiving a Fermi card as soon as it becomes available (whenever this will be :-) and will check again then.

Kind regards,
Markus

coleb
05-05-2010, 11:13 AM
In my experience with the NVidia implementation the best way to achieve parallelism across devices is to launch a CPU thread for each device. This worked and scaled very well for me on an S1070 (1U 4 GPU server). The tricky part is NVidia and I at first didn't agree on what thread safety meant. The NVidia implementation crashes if multiple host threads access the same cl_context (though the standard says this should be possible).

I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.

monarodan
05-11-2010, 07:36 PM
In my experience with the NVidia implementation the best way to achieve parallelism across devices is to launch a CPU thread for each device. This worked and scaled very well for me on an S1070 (1U 4 GPU server). The tricky part is NVidia and I at first didn't agree on what thread safety meant. The NVidia implementation crashes if multiple host threads access the same cl_context (though the standard says this should be possible).

I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.

I've taken the same approach of having a host thread per device, and this is working well, though it does introduce some overheads.

My concern is with thread safety. I am using a single context with multiple devices (currently 3 devices) and have not had problems with multiple threads accessing the OpenCL API (yet) with v3.0 of the CUDA SDK and driver version 197.16. However, what I am doing is strictly in violation of the OpenCL 1.0 spec and may not be portable! In Appendix A (A.2) the spec states:


The OpenCL implementation is thread-safe for API calls that create, retain and release objects
such as a context, command-queue, program, kernel and memory objects. OpenCL API calls
that queue commands to a command-queue or change the state of OpenCL objects such as
command-queue objects, memory objects, program and kernel objects are not thread-safe.

If you follow the spec, then you should really serialise access by host threads to all OpenCL methods except for "API calls that create, retain and release objects". This means that you should only have one host thread calling clEnqueueNDRangeKernel() at a time. But, since that method blocks on the NVidia implementation, you would hold the lock for the duration of the calculation. The result being no concurrent execution of kernels across devices!

I have anecdotal evidence that it is ok to call the various clEnqueue* commands concurrently from host threads for different command queues. I have also shown that calling clEnqueueNDRangeKernel() concurrently from two threads for the same queue results in crashes and the like.

I think that the spec should say a lot more about thread safety, otherwise I have no idea if my code is going to be portable across different OpenCL implementations.

Perhaps to give NVidia a nudge in the right direction, the spec states the following in 3.2.1:


The command-queue schedules commands for execution on a device. These execute
asynchronously between the host and the device.

Given that, isn't a blocking clEnqueueNDRangeKernel() implementation in violation of the spec? Do NVidia reps read these forums, or should I go post this to their message boards? I'd really like to know NVidia's stance on this matter.

Cheers,

Dan

coleb
05-12-2010, 09:21 AM
I think that the spec should say a lot more about thread safety, otherwise I have no idea if my code is going to be portable across different OpenCL implementations.


The next version of the spec (OpenCL 1.1), has a lot more detail in the thread safety section. Since we're a new member of the khronos group I'm not sure how much more I'm allowed to say.



Given that, isn't a blocking clEnqueueNDRangeKernel() implementation in violation of the spec? Do NVidia reps read these forums, or should I go post this to their message boards? I'd really like to know NVidia's stance on this matter.


I'm working with an NVidia engineer on my current project. He suggestion was "use separate contexts for each host thread/device". I had to entirely re-write how my multi-threaded worked to get around this since it was crashing the NVidia OpenCL beta, though I haven't gone back and tried this branch of code on the released 3.0 driver.

Klear
09-01-2011, 12:32 AM
I'm curious whether you see the same behavior in Fermi. I know from talking with an NVidia employee the only way to take advantage of Fermi's concurrent kernel execution is if all those kernels belong to the same context.