PDA

View Full Version : Device affinity for command queues and buffers seems at odds



monarodan
04-29-2010, 10:33 PM
Hi All,

To create a cl_mem object, one calls clCreateBuffer(), which takes a cl_context as an argument. I assume that this means that the cl_mem object has affinity with the cl_context used to create it and that it is an error to use it in any other context (the standard does not seem to state this explicitly). Since the cl_context was created with a set of cl_device_ids, I assume that it is valid to use cl_mem object with any of the devices used to create the cl_context that was passed to clCreateBuffer().

To read data out of a cl_mem object, one uses the clEnqueueReadBuffer() method, which takes a cl_command_queue as an argument. A cl_command_queue is created for a specific cl_device. It seems very strange that I need to specify a device when reading form a cl_mem object as it does not have device affinity.

This certainly lacks symmetry with creating a buffer with the CL_MEM_COPY_HOST_PTR flag as no device is passed to the clCreateBuffer() method. I've seen it said in other posts that the following are equivalent:



cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, ptr, 0 );




cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);


However, there is one distinction - the second case requires you to nominate a device (needed to create the queue) while the first does not!

Can someone clarify what is going on here? I am working within a context that has multiple devices and want to read data out of a cl_mem object using clEnqueueReadBuffer() - what device should the cl_command_queue that I use be associated with? Does it not matter?

Thanks in advance,

Dan

matrem
04-30-2010, 02:17 AM
... the standard does not seem to state this explicitly ...
CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it's raised. For example for the clEnqueueReadBuffer command :

CL_INVALID_CONTEXT if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.

matrem
04-30-2010, 02:29 AM
Memory objects are often cached on a device.

For me there is several differences between only create buffer (with copy) and create + enqueue write:
- the first is synchronous the second can be asynchronous;
- the second permit that the driver cache memory sooner on the good device; with the first example, there is more chances that the caching will only be done at clEnqueueNDRangeKernel command execution;

So the second method add more liberty for the developer to optimize caching time. And permit to do something else while the write is happening.

monarodan
04-30-2010, 02:39 AM
... the standard does not seem to state this explicitly ...
CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it's raised. For example for the clEnqueueReadBuffer command :

CL_INVALID_CONTEXT if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.

My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn't seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity). And yet, to copy data to/from the buffer I need to talk about a specific device.

I'm sure that in practice you generally queue up commands on a given device following a pattern along to lines of:

1) Copy from host
2) Execute kernel
3) Copy to host

And it just works out nicely. However, I still find it very strange that you can not copy data to the host without queuing a command for a particular device.

david.garcia
05-07-2010, 03:51 PM
My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn't seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity)

See the glossary on page 14:
Context: The environment within which the kernels execute and the domain in which
synchronization and memory management is defined.

See also Appendix A:

OpenCL memory objects, program objects and kernel objects are created using a context and can
be shared across multiple command-queues created using the same context. Event objects can be
created when a command is queued to a command-queue. These event objects can be shared
across multiple command-queues created using the same context.

monarodan
05-07-2010, 05:39 PM
See the glossary on page 14:
Context: The environment within which the kernels execute and the domain in which
synchronization and memory management is defined.


Fine, but this seems irrelevant. Consider the concept of thread local storage - the allocations still happen in the context of the processes heap, but the memory has thread affinity.



See also Appendix A:

OpenCL memory objects, program objects and kernel objects are created using a context and can
be shared across multiple command-queues created using the same context. Event objects can be
created when a command is queued to a command-queue. These event objects can be shared
across multiple command-queues created using the same context.


To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?

I'm not impressed that you had to quote from the glossary and appendix, rather the the standard proper, to try and answer my question. It seems that detail is being buried in the wrong places.

Cheers,

Dan

david.garcia
05-08-2010, 06:23 AM
I understand your frustration.


To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

That is defined in Appendix A. The quote I provided is only an excerpt.


Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?

Some device has to perform the data copy. OpenCL allows the application to choose any of the devices in the context to do the operation. Arguably this is better than leaving it up to the driver to decide which of the devices to use.


I'm not impressed that you had to quote from the glossary and appendix, rather the the standard proper

While the glossary is not normative, the appendix is.

monarodan
05-08-2010, 07:48 AM
To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

That is defined in Appendix A. The quote I provided is only an excerpt.


It is not defined nor discussed in any great detail in Appendix A. Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."



Some device has to perform the data copy. OpenCL allows the application to choose any of the devices in the context to do the operation. Arguably this is better than leaving it up to the driver to decide which of the devices to use.


I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.

I can imagine that providing a target device for a copy from host to device is useful as a hint as to which device is going to use the buffer, so preemptive caching in the devices physical memory may occur.

If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn't right with this API.

Cheers,

Dan

david.garcia
05-08-2010, 02:38 PM
Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."

That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble. The way to avoid any problems is by establishing dependencies between commands appropriately and by using clFlush() when there are dependencies across command queues.

As long as you use dependencies correctly, sharing resources between different command queues inside the same context is straightforward. I suggest searching the term "synchronization point" in the spec.


I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.

That will depend on each particular implementation. Remember that OpenCL serves a very wide range of computing devices.


If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn't right with this API.

Any standard API will be some sort of compromise of the alternatives suggested by multiple people from different companies. It is not possible to design an API or a language that will satisfy everybody.

Generally speaking, mapping memory objects into the host's address space and writing the data directly into the given pointer instead of copying it around will give better performance than using CL_MEM_COPY_HOST_PTR. This is only a general rule. YMMV.

monarodan
05-08-2010, 07:08 PM
Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."
That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble.


Should be expected? Why is that? The spec could make this as tight as it likes.

The OpenCL spec gives a fair amount of detail on memory fences and barriers so it is well defined what happens when a memory object is concurrently accessed and mutated by multiple compute units. However, they decide to stop there and just leave cross-command queue synchronisation very loose. The best you can do for synchronisation across command queues is to stop and wait just in case. That being said, this has nothing to do with my original query.



As long as you use dependencies correctly, sharing resources between different command queues inside the same context is straightforward.


Agreed, this is indeed straightforward.



Any standard API will be some sort of compromise of the alternatives suggested by multiple people from different companies. It is not possible to design an API or a language that will satisfy everybody.

I'm not looking for satisfaction, merely explanation of why it is like it is. Who knows, my tirade on the non-symmetry in the API regarding reading and writing memory objects may lead to changes in the spec, or, I might just be rehashing and age-old argument, or there might be a very good reason for why it is like it is.

I agree that design-by-committee is a less than ideal way to work, but every decision made should be justified and I would hope that those making the decisions are involved or represented in this community and would be willing to share such justifications here.

Cheers,

Dan

david.garcia
05-09-2010, 02:12 PM
[quote:3313vybf]That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble.

Should be expected? Why is that?
[/quote:3313vybf]

It should be expected because the same requirement exists when you modify resources within a single command queue.


The OpenCL spec gives a fair amount of detail on memory fences and barriers so it is well defined what happens when a memory object is concurrently accessed and mutated by multiple compute units. However, they decide to stop there and just leave cross-command queue synchronisation very loose.

The only difference between inter-queue synchronization and intra-queue synchronization is that when you switch between queues you must use clFlush()/clFinish() to make sure that all commands are flushed. I agree that the spec could explain this in more detail.


I'm not looking for satisfaction, merely explanation of why it is like it is.

To be honest I don't remember this particular topic being discussed in the meetings; I don't think anybody saw this asymmetry as a big problem. Leaving aesthetics aside, do you see this being a real-world performance bottleneck? If performance is a concern I would suggest first looking at redesigning the application to avoid using CL_MEM_COPY_HOST_PTR and using clEnqueueMapBuffer() instead.

monarodan
05-09-2010, 05:42 PM
[quote:8axp5z4h]I'm not looking for satisfaction, merely explanation of why it is like it is.
To be honest I don't remember this particular topic being discussed in the meetings; I don't think anybody saw this asymmetry as a big problem. Leaving aesthetics aside, do you see this being a real-world performance bottleneck? If performance is a concern I would suggest first looking at redesigning the application to avoid using CL_MEM_COPY_HOST_PTR and using clEnqueueMapBuffer() instead.[/quote:8axp5z4h]

What I was doing was wrapping OpenCL to some degree to fit the way I want to use it. When writing my wrapper of the buffer, I first wrote various constructors to match the methods of creation supported by OpenCL. Next I tried to write a method to retrieve buffer contents so I could do some simple testing. I was surprised to find that I could set the contents of the buffer without my class being coupled to the command queue (and transitively, with a specific device), but could not read the contents. Further reading and reasoning just lead to confusion. For example, with the only cross-device synchronisation available being mutual exclusion, it is difficult to think of a buffer as "shared" across devices, except in the case of a read-only buffer.

I am not currently using CL_MEM_COPY_HOST_PTR in my application and do not know if it represents a performance bottleneck (though I have some anecdotal evidence that CL_MEM_COPY_HOST_PTR is slow on NVidia hardware). I would, however, be in favor of removing CL_MEM_COPY_HOST_PTR from the spec and providing a utility method to create a buffer and perform a blocking copy from host memory instead. I think OpenCL needs something akin to GLU for OpenGL.

Personally, I feel that the real problem is with the command queue. What is the reasoning for a command queue being bound to a single device? I think that a command queue that is bound to the context (and hence, can contain commands that are bound to different devices) is much more flexible, simpler to use and would probably address my original issue as commands to copy memory to/from the host would not need to nominate a device.

Cheers,

Dan

david.garcia
05-09-2010, 06:45 PM
For example, with the only cross-device synchronisation available being mutual exclusion, it is difficult to think of a buffer as "shared" across devices, except in the case of a read-only buffer.

True, but you have the same problem trying to share a buffer between multiple NDRange kernel executions inside the same device. I.e., you can't. Different kernels must work on the same buffer with mutual exclusion. It would be very odd if it was allowed for multiple devices to modify a buffer simultaneously if we don't allow multiple kernels in the same device to do the same.


What is the reasoning for a command queue being bound to a single device?

Because it maps very well into how GPUs work and OpenCL is a low-level API. GPUs are asynchronous devices: there's a command FIFO between the host CPU and the GPU. The host puts commands in it and the GPU fetches them. That's basically what a CL command queue represents. For CPUs the story is different but it's preferable to have the same abstraction for both CPUs and GPUs.

Having a single queue sending commands to multiple devices would require the driver to do some kind of dynamic load balancing, which is nontrivial. In addition, multiple devices in the same context may have different capabilities, such as support for images, and kernels that may run on one device may not run in the other device.

OpenCL is close to the metal and, yes, that makes it a bit hard to use at times. The benefit is that it takes a lot of guesswork away from the driver and puts most of the control in the hands of the application. It will take some time for third party developers to create higher-level languages and abstractions on top of OpenCL that will make it easier to write portable and reasonably performant code.

Finally, in case there's any doubt: I am not speaking on behalf of the OpenCL working group. I don't even participate in the meetings nowadays.

monarodan
05-09-2010, 07:10 PM
For example, with the only cross-device synchronisation available being mutual exclusion, it is difficult to think of a buffer as "shared" across devices, except in the case of a read-only buffer.

True, but you have the same problem trying to share a buffer between multiple NDRange kernel executions inside the same device. I.e., you can't. Different kernels must work on the same buffer with mutual exclusion. It would be very odd if it was allowed for multiple devices to modify a buffer simultaneously if we don't allow multiple kernels in the same device to do the same.

I hope that you do not think that I am advocating change at this point, rather I am merely pointing out things that have either helped or hindered my understanding of the OpenCL API.

One question on what you've stated. Let's say that I have a NVidia Fermi card which, to my understanding, can concurrently execute multiple kernels on the one device. Are you suggesting that memory fences using CLK_GLOBAL_MEM_FENCE are not guaranteed to order reads and/or writes across the concurrently executing kernels?




What is the reasoning for a command queue being bound to a single device?

Because it maps very well into how GPUs work and OpenCL is a low-level API.

Sounds like a leaky abstraction to me!



Having a single queue sending commands to multiple devices would require the driver to do some kind of dynamic load balancing
[/quote]

I had not assumed that OpenCL would do any load balancing. Specifying which device a kernel would execute on would be part of the function call to enqueue a kernel (same goes for any other commands that require the programmer to nominate a device). Given that, wouldn't you agree that a command queue should be device agnostic? If a device-specific command queue is indeed a useful concept, then this could be written as a layer on top of the device-agnostic command queue. Again, a utility library akin to GLU would be useful here.

Cheers,

Dan

david.garcia
05-10-2010, 04:55 AM
Are you suggesting that memory fences using CLK_GLOBAL_MEM_FENCE are not guaranteed to order reads and/or writes across the concurrently executing kernels?

I cannot answer questions regarding hardware from other vendors. If is in general regarding standard OpenCL, the answer is: good question! I don't see any language in the spec talking explicitly about this scenario. Instead of guessing I'll try to bring this up to the group --but be aware that it will take time to get an answer.


Given that, wouldn't you agree that a command queue should be device agnostic?

I have trouble following you. What is a device-agnostic queue? One that is not bound to a particular device? Then, when you enqueue a command in this queue, what device runs it? Who chooses that device? How is explicitly selecting a device to run the command any different from having to explicitly select a queue, given that each queue is only associated with a single device?

david.garcia
05-10-2010, 05:22 AM
Now I realize I had misread section 3.3.1. I think it answers your question:


OpenCL uses a relaxed consistency memory model; i.e. the state of memory visible to a workitem
is not guaranteed to be consistent across the collection of work-items at all times.

Within a work-item memory has load / store consistency. Local memory is consistent across
work-items in a single work-group at a work-group barrier. Global memory is consistent across
work-items in a single work-group at a work-group barrier, but there are no guarantees of
memory consistency between different work-groups executing a kernel.

Memory consistency for memory objects shared between enqueued commands is enforced at a
synchronization point.

I will try to get some extra assurance from the group anyway since the text above only deals with execution barriers and not explicitly with memory fences.

monarodan
05-10-2010, 07:35 PM
I have trouble following you. What is a device-agnostic queue? One that is not bound to a particular device? Then, when you enqueue a command in this queue, what device runs it? Who chooses that device? How is explicitly selecting a device to run the command any different from having to explicitly select a queue, given that each queue is only associated with a single device?

Yes, by device agnostic I mean a queue that is not bound to one (and only one) device. I could have gone the other way and called it promiscuous as you could say that it services many devices.

I don't think the idea that a "device runs a queue" is right. Does the spec talk about who "pumps the queue"? I had expected that the OpenCL runtime, possibly executing on the host, actually enqueues, dequeues and executes commands. It just so happens that some commands execute code on the OpenCL devices' compute units. Perhaps this view of the world is completely incorrect? What I mean is that I would not expect the compute units on the device to be running code to pumps the queue, yet OpenCL defines a Device to be nothing more than a collection of compute units, so what does it mean for a device to "run a queue"?

It should be noted that some commands have nothing to do with the OpenCL device. For example, clEnqueueWaitForEvents() is an interesting animal. All it does it wait for a list of events to be set before it sets its own event. This requires no interaction with the device associated with the queue in which the "wait for events" command is enqueued. In fact, the list of events that it waits on could be events of commands enqueued for other devices.

If the command queue were not bound to a device, then the signature of some API functions, such as clEnqueueNDRangeKernel() would have to change to include the target device to execute the kernel on as this is no longer implied by the queue. Relating this back to the original question - I wonder if the commands to copy memory would require a device to be specified when being enqueued...

I think I have a very different view of what the command queue represents. I feel that the command and events concept is very good, not just as a synchronisation mechanism, but rather, to capture dependencies between commands. I see enqueued OpenCL commands as forming a graph (well, strictly a DAG) through these dependencies and the OpenCL runtime is able to execute only those commands that have all of their dependencies satisfied - that is, all of the events that it waits on have been set. I find the declarative nature of merely stating dependencies to be a better approach than thinking procedurally as you must do with an in-order queue.

I see no need for a "queue" concept at this fundamental level. There are just commands and dependencies. However, an in-order queue can be implemented trivially by making a linear list of commands, irrespective of that real dependencies. Other queues, such as a "priority queue" could also be layered on top of the DAG. The in-order and priority queues could be implemented in a utility library rather than in the OpenCL core API.

On a side note, the whole idea of out-of-order execution falls apart (at least on the current NVidia OpenCL implementation on Windows) because clEnqueueNDRangeKernel() is a blocking call. This makes it impossible for a single host thread to enqueue kernels for concurrent execution on multiple devices! I have resorted to having a host thread per device and my own work queues in the host to work around this. I would encourage the powers that be to change the OpenCL spec to state that clEnqueueNDRangeKernel() must not block (or add a "blocking" flag as per the memory transfer functions).

This discussion has been very interesting and made me go back to the OpenCL spec a number of times, and each time I seem to be getting more and more confused about the details. I guess some examples (just for reference - need not comment unless you feel compelled to) include:


In 3.2.1 it is stated that, "The command-queue schedules commands for execution on a device." But earlier it defined the command queue as a data structure, which is passive - how can the command queue schedule anything? This leads to the question, who pumps the command queue?[/*:m:1z5reisu]
Still in 3.2.1 it is stated that, "These (commands) execute asynchronously between the host and the device." Yet clEnqueueNDRangeKernel() blocks on the NVidia platform and I have read in other threads that this behavior is not considered to be an error with respect to the spec.[/*:m:1z5reisu]
Again in 3.2.1 it is stated that, "It is possible to associate multiple queues with a single context. These queues run concurrently and independently with no explicit mechanisms within OpenCL to synchronize between them." Yet A.1 states that event objects can be shared across command queues - isn't this an "explicit mechanisms" to synchronise between command queues?[/*:m:1z5reisu]

On that last point, where the spec states, "It is possible to associate multiple queues with a single context", you get the feeling that the spec is leading you to not create multiple command queues, however, you must use multiple command queues if you have multiple devices in the context!

Cheers,

Dan

david.garcia
05-11-2010, 12:23 PM
I don't think the idea that a "device runs a queue" is right.

I should have said "what device runs the commands in the queue" (my bad).

Re. clEnqueueWaitForEvents() and related examples where the command doesn't really have any tying to a particular device, I think I agree with you that those APIs should probably apply to context objects instead of queue objects (or device-agnostic queues if you prefer).

Your reflection about the DAG of dependencies is spot-on :)

I am extremely surprised to hear that clEnqueueNDRangeKernel() is blocking on some implementations. I doubt that it will stay that way for long. The spec already says that certain calls are non-blocking, but there's no way to enforce it, so in practice it's like the spec didn't say anything. Market forces hopefully will take care of the problem.

In summary I think you make some very good points. I wish the group had your input two years ago.

monarodan
05-11-2010, 07:12 PM
I don't think the idea that a "device runs a queue" is right.
I should have said "what device runs the commands in the queue" (my bad).


I wasn't trying to nitpick - your language actually aligns quite well with what the spec says. In this regard, I think that the spec is not specific or detailed (or maybe consistent?) enough to form an understanding of the OpenCL design without having been involved in its inception.


Your reflection about the DAG of dependencies is spot-on :)
Excellent - if the powers that be agree, then perhaps we can push OpenCL in that direction for future releases.


I am extremely surprised to hear that clEnqueueNDRangeKernel() is blocking on some implementations.
I was extremely surprised by it too, and a little caught out; I had quite a clear picture of how I was going to use OpenCL (I am undertaking my first OpenCL project currently) but this threw a spannner in my works. I'm just about to post in another thread (http://www.khronos.org/message_boards/viewtopic.php?f=28&t=1990) about this, so tune in if you're interested.


In summary I think you make some very good points. I wish the group had your input two years ago.
Thanks David - that's awfully flattering. I'd be happy to contribute more formally than just venting in these forums, so perhaps pull me in if/when appropriate.

Cheers,

Dan

david.garcia
05-12-2010, 02:46 PM
Excellent - if the powers that be agree, then perhaps we can push OpenCL in that direction for future releases.

Sorry, but you lost me here. What do you mean by that? The dependencies already form a DAG and if your device supports out-of-order execution it may take advantage of the structure of the DAG to improve performance.


Thanks David - that's awfully flattering. I'd be happy to contribute more formally than just venting in these forums, so perhaps pull me in if/when appropriate.


Unfortunately that's not in my hands. You need to be a member of Khronos in order to contribute to the discussions for future specs. See http://www.khronos.org/members/. Look in particular at the academic membership if you think you might apply.

monarodan
05-12-2010, 05:15 PM
Excellent - if the powers that be agree, then perhaps we can push OpenCL in that direction for future releases.

Sorry, but you lost me here. What do you mean by that? The dependencies already form a DAG and if your device supports out-of-order execution it may take advantage of the structure of the DAG to improve performance.


Ummm... no, they form queues and this strange out-of-order concept is employed to allow independent commands to run concurrently. I'm suggesting:

1) That you call it "a DAG comprised of commands that have dependencies" rather than "queues comprised of commands and events". The specification and API should reflect this.
2) That the DAG be bound to the context rather than bound to a specific device. I would go as far as to say that the DAG is a singlton for the context (after all, since the DAG can have multiple roots, what would multipe DAGs give you over a single one?)
3) That queues of various sorts (FIFO, Priority, etc) can be implemented in terms of the DAG as a separate layer and do not form part of the OpenCL spec.

On motivation is to allow some level of extensibility in scheduling commands for execution. Currently there are two protocols (in-order and out-of-order) munged into the one API. Taking a more layered approach allows the API to remain clean and invariant as new scheduing policies are devised by the OpenCL community - the community would own the upper layers and they need not be standardised.

Perhaps to give an example, how could I possibly implement a priority-queue (a common concept in work queuing APIs) for OpenCL commands with the current OpenCL spec?

Cheers,

Dan

monarodan
05-12-2010, 06:43 PM
Thanks David - that's awfully flattering. I'd be happy to contribute more formally than just venting in these forums, so perhaps pull me in if/when appropriate.

Unfortunately that's not in my hands. You need to be a member of Khronos in order to contribute to the discussions for future specs. See http://www.khronos.org/members/. Look in particular at the academic membership if you think you might apply.

The Academic Contributor Members is "for accredited academic institutions only", which I am not. The next level up is a Contributor Membership which would set me back USD$10,000 annually. This is beyond the reach of most small businesses, including mine. So, I guess the best I can do is make noise here and hope to influence someone.

I wonder how many would-be-contributors are excluded merely because of the high membership fees...

david.garcia
05-12-2010, 08:10 PM
Ummm... no, they form queues and this strange out-of-order concept is employed to allow independent commands to run concurrently.

Remember that most commands need to be executed in a particular device, hence the convenience of having queues tied to particular devices.

In-order queues are great while you are prototyping your application because they are very easy to use. Once you have the algorithm running and producing the desired results, then you can enable out-of-order execution, set up the command dependencies accordingly and get some performance improvements. There's nothing strange in OoO queues IMO.


how could I possibly implement a priority-queue (a common concept in work queuing APIs) for OpenCL commands with the current OpenCL spec?

Supporting priorities on GPUs is much more involved than you may think, and this has nothing to do with the architecture of OpenCL.


I wonder how many would-be-contributors are excluded merely because of the high membership fees

Many. On the other hand, too many cooks spoil the broth. It's difficult to find a balance.

monarodan
05-12-2010, 08:45 PM
Ummm... no, they form queues and this strange out-of-order concept is employed to allow independent commands to run concurrently.

Remember that most commands need to be executed in a particular device, hence the convenience of having queues tied to particular devices.

In-order queues are great while you are prototyping your application because they are very easy to use. Once you have the algorithm running and producing the desired results, then you can enable out-of-order execution, set up the command dependencies accordingly and get some performance improvements. There's nothing strange in OoO queues IMO.

I haven't suggested that the idea of in-order queues be dropped at all, nor that the API does not let you specify which device to execute a command on. I do think, however, that "convenience" should be built into higher layers.

I find it strange to define a DAG in terms of a queue, whilst I find it natural to define a queue in terms of a DAG. OpenCL is doing the former while I suggest moving to the latter.



I wonder how many would-be-contributors are excluded merely because of the high membership fees

Many. On the other hand, too many cooks spoil the broth. It's difficult to find a balance.

I'm not sure that best way of striking that balance is to just exclude those without deep pockets. I don't mind the W3C model where concessions on fees are made base on the annual turnover of the member.

Cheers,

Dan

affie
05-12-2010, 10:23 PM
Hi Daniel,

This thread is quite long and I've tried to read the back & forth between you and David. Let me try to answer the question you raised in the original post which is - why does a command-queue need to be associated with a device and why do memory objects have no device affinity?

A command-queue is associated with a device because this is the mechanism that is used to dispatch commands to be executed on a device. This way, an application has complete control over what work is done on what device. For CPU OpenCL device, the command-queue typically maps to a pthread work-queue or a dispatch queue in GCD on Mac OS X. For GPUs there is a command-processor in front which processes commands from the command-queue and queues them to appropriate blocks. As to the reason why command-queues support both in-order & out-of-order queues this is because most GPUs today only support in-order queues. In addition, an in-order queue is very easy to understand and use for most developers. However, it is certainly possible that a device maybe able to process reading, writing, copying memory objects in parallel with executing kernels. This is certainly possible if the device has one or more DMA engines to do the read/write/copy operations. This is why the spec allows read, write & map operations to be blocking or non-blocking irrespective of the command-queue order.

Why are out or order queues supported (optionally)?
1) Provides more control in maximizing performance to developer by allowing the developer to specify the dependencies which control when a command can be executed.

2) Really needed if you are enqueuing data-parallel (clEnqueueNDRangeKernel) and task-parallel (clEnqueueTask) kernels.

NOTE: As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case.

Now onto the question of why memory objects do not have a device affinity associated with them. This was a very long discussion in the working group and I was one of the main proponents of not specifying a device affinity. As defined in CL, memory objects are associated with the context and therefore can be used by any device associated with the context. By doing this, an application does not have to worry about managing memory objects across devices. The device memory can be viewed as just a cache where only memory objects needed by a command(s) executing on a device need to be allocated. This way, the actual amount of physical memory available on the device does not limit how many memory object you can create - it only limits the amount of memory needed by memory objects used by a command. For example, the host memory can be used as a backing store for memory objects instead. In addition, managing memory objects across devices becomes simpler in the sense that all the application has to do is ensure correct event dependencies between commands that are using the same memory object and have been enqueued to queue A on device A and queue B on device B and the OpenCL implementation figures out how best to transfer data from the device which has the latest (dirty) copy to the device that needs the latest copy. For example, some devices may be able to do a direct PCIe to PCIe transfer which a user may not be able to use. The user now does not have to worry about making appropriate copies depending on which devices it plans to use the memory object. We move this responsibility to the OpenCL implementation. Since the device caches the data for the memory object, enqueuing a command to a device allows you (the app) to determine which device will have the latest bits.

In the example you give in the first post

cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);

you ask the question which device should the command queue be associated with. In this case, it should be the device where kernels that use this memory object are going to be enqueued for execution. As far as copying data to the host, I recommend that you use clEnqueueMap to map the region you want to read or write instead of copying.

I will stop here as the response is already too long and hopefully I was able to answer a few of your questions.

monarodan
05-12-2010, 11:43 PM
Hello Affie,

Your explanation certainly covers many of the motivations behind the design, however, I still think it is very confusing.

Given your explanation, I find it very strange that there is no mechanism to transfer the data from a cl_mem object to the host without specifying a device. After all, the cl_mem is global to the context and the OpenCL runtime knows where the current version of the data is stored - can't it just give me the data?

Let's take a contrived example to try and show up potential problems - say I have two devices, A and B. I create a cl_mem object and enqueue a command on device A's command queue to copy data from the host to the cl_mem object. The data ends up in A's memory, but not in B's. I then enqueue a command on device B's queue to read the cl_mem object back into the host's memory. What happens? I expect that the OpenCL runtime will first copy the cl_mem data from A to B and then copy it from B to the host. Clearly it would have been more efficient to just copy directly from A... If the OpenCL runtime is smart enough and able to do the copy directly from A to the host (even though the command was queued on B's command queue) then we can deduce a couple of things:

1) It is irrelevant which command queue you enqueue commands that transfer data to the host.
2) Not all commands enqueued on a given command queue actually execute on the command queue's device.


This way, the actual amount of physical memory available on the device does not limit how many memory object you can create - it only limits the amount of memory needed by memory objects used by a command. For example, the host memory can be used as a backing store for memory objects instead.

I do like this philosophy (out of interest, do any OpenCL implementations you know of work this way in practice?). However, this scheme seems contrary to the statement that, "Since the device caches the data for the memory object, enqueuing a command to a device allows you (the app) to determine which device will have the latest bits." There is no determinism if the OpenCL runtime is able to shuffle cl_mem objects around between devices or even out to host memory. If you truly think of device memory as a cache, then you have to expect evictions, cache misses and the lack of determinism that follows.

Let's take another contrived example. Say I have want to run two different algorithms on the same data and I happen to have two devices in my context. What I want to do is transfer the data from the host to the context and then execute two different kernels on two different devices. Which command queue should I use for the memory transfer? Does it matter? I expect the answer to be that it doesn't matter - you could use either command queue and the result should be the same. Kinda makes you think that being forced to pick a device is wrong...

I can understand that specifying which device a cl_mem object is expected to be used on when copying data from the host can lead to better performance as eager caching of that data can happen on the device. However, I would consider this merely a hint and should be optional in the API rather than requiring it to be specified. In the example above (one cl_mem object, two kernels), I would hint that both devices are expecting to use the cl_mem object, so eager caching on both devices may follow. How would you achieve this eager caching on multiple (lets say n out of m) devices with the current API?

Cheers,

Dan

inam
05-27-2010, 07:56 PM
All the calls that take context are blocking. Non blocking calls needs to go "enqueued on a queue" which is associated with a particular device. Suppose to want to issue a read command but dont want to wait for it so that if the device has dma capability it starts the transfer while you are doing other stuff on the main thread. By the time you are done with other stuff and really needs the data it might already be there. You need an event back from read command so that you can query if the read has already finished or not and again event are associated with particular queues and NOT context. Only enqueueing calls return event and they all take queue on which to enqueue and associate event with that queue. If you make the read command per context and not per device (queue), it has to be blocking and you cant do anything on main thread until this call is done.

Regarding your second contrived example ...
I assume you are saying you "read" from this mem object in your kernel on both devices. If you use CL_MEM_COPY_HOST_PTR when creating mem object then yes, both devices will get the same data when respective kernels execute on two devices. But if you issue a non blocking write on one device then its applications responsibility to make sure that it take care of cross queue dependency meaning that it get the event back from clEnqueueWrite on one device and pass it in the event list to wait to clEnqueueNDRangeKernel on other deivce .... this give great flexibility in terms of both read/write on one device and execution on one or both device becoming asynchronous freeing up cpu for more useful work ... giving underlying scheduler more freedom to schedule anyway it wants for best efficiency as long as it keeps the dependency given by cross queue events. All this is possible because you make read/write "enqueued" or non-blocking and the reason it was possible is you enqueue read/write on a queue/device ... if you do it on a context these operation will be blocking as, even though mem object knows who the current dirty owner is, context has no idea "queue" to issue a read/async dma from it.

monarodan
05-27-2010, 08:34 PM
Regarding your second contrived example ...
I assume you are saying you "read" from this mem object in your kernel on both devices. If you use CL_MEM_COPY_HOST_PTR when creating mem object then yes, both devices will get the same data when respective kernels execute on two devices.

Let's change the scenario a little - let's say I have N devices and want to make a buffer available to M of those devices, where 1 < M < N. How, in the OpenCL API, can I ensure that all M (and only those M devices) devices have the buffer pushed to it eagerly? If I use CL_MEM_COPY_HOST_PTR then it seems that an OpenCL implementation will probably push it to all N devices, which is inefficient when M << N. If I use asynchronous commands then I am forced to either perform the copy M times or perform the copy once (for one particular device) and leave the copying of buffers from one device to another up to OpenCL - no doubt this would hurt performance as an OpenCL implementation would copy on demand rather than preemptively.

Perhaps to state my opinion more clearly - memory buffer operations in the context of a single device are not (in general) useful. My opinion is that the command queue should never have been associated with a particular device. Commands that pertain to a particular device (such as executing a kernel) should take the target device in the API call to enqueue the command. Some commands would benefit from a "hint" as to which device(s) to target - for example, a command to copy from the host to OpenCL could provide a list of devices (say, my M out of N above) that OpenCL can use guide the way resources are allocated, caches preemptively populated, etc.



All the calls that take context are blocking. Non blocking calls needs to go "enqueued on a queue" which is associated with a particular device.


Yes I know. The tail end of you sentence is exactly the problem I am talking about.



... even though mem object knows who the current dirty owner is, context has no idea "queue" to issue a read/async dma from it.


Right - point in context. Even though the context knows where the buffer is, I can't just ask the context to give it to me without nominating some device's queue to perform the copy or mapping to the host. That makes absolutely no sense and is the exact scenario that started this thread.

Cheers,

Dan

axeoth
07-21-2011, 09:35 AM
I just wanted to support the ideas exposed by Dan.

The statements below express personal opinions.

For both symmetry and practicality (see the scenarios he exposed), the link between the command queue and the device should be revised, either to be weakened or removed.

Devices should be specified, if required, when commands/kernels are queued.

Amnon
07-23-2011, 01:53 PM
Solution 1: Allow NULL as the queue parameter in clEnqueueReadBuffer(), clEnqueueWriteBuffer(), etc.
Solution 2: Have functions clReadBuffer(), clWriteBuffer(), etc.
Solution 3: Use a dummy queue, perhaps on a CPU device that is otherwise unused.

In any case, I do assume that using a queue on a different device should still work correctly. If that device is busy, then I would expect the SDK to silently wait until it is not.

monarodan
07-23-2011, 06:40 PM
In any case, I do assume that using a queue on a different device should still work correctly. If that device is busy, then I would expect the SDK to silently wait until it is not.

I guess it depends on what you think is the correct behavior. If I am trying to minimise latency or have deterministic behavior then silently waiting is behavior that I would need to avoid - but there is no way to avoid it with the current OpenCL API.

Cheers,

Dan