Page 1 of 4 1234 LastLast
Results 1 to 10 of 31

Thread: Device affinity for command queues and buffers seems at odds

  1. #1
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Device affinity for command queues and buffers seems at odds

    Hi All,

    To create a cl_mem object, one calls clCreateBuffer(), which takes a cl_context as an argument. I assume that this means that the cl_mem object has affinity with the cl_context used to create it and that it is an error to use it in any other context (the standard does not seem to state this explicitly). Since the cl_context was created with a set of cl_device_ids, I assume that it is valid to use cl_mem object with any of the devices used to create the cl_context that was passed to clCreateBuffer().

    To read data out of a cl_mem object, one uses the clEnqueueReadBuffer() method, which takes a cl_command_queue as an argument. A cl_command_queue is created for a specific cl_device. It seems very strange that I need to specify a device when reading form a cl_mem object as it does not have device affinity.

    This certainly lacks symmetry with creating a buffer with the CL_MEM_COPY_HOST_PTR flag as no device is passed to the clCreateBuffer() method. I've seen it said in other posts that the following are equivalent:

    Code :
    cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, size, ptr, 0 );

    Code :
    cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
    clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);

    However, there is one distinction - the second case requires you to nominate a device (needed to create the queue) while the first does not!

    Can someone clarify what is going on here? I am working within a context that has multiple devices and want to read data out of a cl_mem object using clEnqueueReadBuffer() - what device should the cl_command_queue that I use be associated with? Does it not matter?

    Thanks in advance,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

  2. #2
    Senior Member
    Join Date
    Nov 2009
    Posts
    118

    Re: Device affinity for command queues and buffers seems at odds

    Quote Originally Posted by monarodan
    ... the standard does not seem to state this explicitly ...
    CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it's raised. For example for the clEnqueueReadBuffer command :
    Quote Originally Posted by spec
    CL_INVALID_CONTEXT if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.

  3. #3
    Senior Member
    Join Date
    Nov 2009
    Posts
    118

    Re: Device affinity for command queues and buffers seems at odds

    Memory objects are often cached on a device.

    For me there is several differences between only create buffer (with copy) and create + enqueue write:
    - the first is synchronous the second can be asynchronous;
    - the second permit that the driver cache memory sooner on the good device; with the first example, there is more chances that the caching will only be done at clEnqueueNDRangeKernel command execution;

    So the second method add more liberty for the developer to optimize caching time. And permit to do something else while the write is happening.

  4. #4
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: Device affinity for command queues and buffers seems at odds

    Quote Originally Posted by matrem
    Quote Originally Posted by monarodan
    ... the standard does not seem to state this explicitly ...
    CL_INVALID_CONTEXT is used for this case, and the specification explicity say when it's raised. For example for the clEnqueueReadBuffer command :
    Quote Originally Posted by spec
    CL_INVALID_CONTEXT if the context associated with command_queue and buffer are not the same or if the context associated with command_queue and events in event_wait_list are not the same.
    My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn't seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity). And yet, to copy data to/from the buffer I need to talk about a specific device.

    I'm sure that in practice you generally queue up commands on a given device following a pattern along to lines of:

    1) Copy from host
    2) Execute kernel
    3) Copy to host

    And it just works out nicely. However, I still find it very strange that you can not copy data to the host without queuing a command for a particular device.
    Daniel Paull
    Real Engineers Think Bottom Up.

  5. #5
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Device affinity for command queues and buffers seems at odds

    My apologies - I restructured my sentences and that comment was left out of context! I meant to say that the standard doesn't seem to explicitly state that a buffer can be used on any device associated with a context (that is, that there is no device affinity)
    See the glossary on page 14:
    Context: The environment within which the kernels execute and the domain in which
    synchronization and memory management is defined.
    See also Appendix A:
    OpenCL memory objects, program objects and kernel objects are created using a context and can
    be shared across multiple command-queues created using the same context. Event objects can be
    created when a command is queued to a command-queue. These event objects can be shared
    across multiple command-queues created using the same context.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  6. #6
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: Device affinity for command queues and buffers seems at odds

    Quote Originally Posted by david.garcia
    See the glossary on page 14:
    Context: The environment within which the kernels execute and the domain in which
    synchronization and memory management is defined.
    Fine, but this seems irrelevant. Consider the concept of thread local storage - the allocations still happen in the context of the processes heap, but the memory has thread affinity.

    Quote Originally Posted by david.garcia
    See also Appendix A:
    OpenCL memory objects, program objects and kernel objects are created using a context and can
    be shared across multiple command-queues created using the same context. Event objects can be
    created when a command is queued to a command-queue. These event objects can be shared
    across multiple command-queues created using the same context.
    To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?

    Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?

    I'm not impressed that you had to quote from the glossary and appendix, rather the the standard proper, to try and answer my question. It seems that detail is being buried in the wrong places.

    Cheers,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

  7. #7
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Device affinity for command queues and buffers seems at odds

    I understand your frustration.

    To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?
    That is defined in Appendix A. The quote I provided is only an excerpt.

    Back to the original question then - if memory objects do not have device affinity, why is there no function to copy a buffer from device memory to host memory without enqueuing a command for a specific device?
    Some device has to perform the data copy. OpenCL allows the application to choose any of the devices in the context to do the operation. Arguably this is better than leaving it up to the driver to decide which of the devices to use.

    I'm not impressed that you had to quote from the glossary and appendix, rather the the standard proper
    While the glossary is not normative, the appendix is.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #8
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: Device affinity for command queues and buffers seems at odds

    Quote Originally Posted by david.garcia

    To say "can be shared" is very, very weak. In what way can they be shared, and what of concurrent access or usage is allowed?
    That is defined in Appendix A. The quote I provided is only an excerpt.
    It is not defined nor discussed in any great detail in Appendix A. Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."

    Quote Originally Posted by david.garcia
    Some device has to perform the data copy. OpenCL allows the application to choose any of the devices in the context to do the operation. Arguably this is better than leaving it up to the driver to decide which of the devices to use.
    I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.

    I can imagine that providing a target device for a copy from host to device is useful as a hint as to which device is going to use the buffer, so preemptive caching in the devices physical memory may occur.

    If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn't right with this API.

    Cheers,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

  9. #9
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Device affinity for command queues and buffers seems at odds

    Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."
    That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble. The way to avoid any problems is by establishing dependencies between commands appropriately and by using clFlush() when there are dependencies across command queues.

    As long as you use dependencies correctly, sharing resources between different command queues inside the same context is straightforward. I suggest searching the term "synchronization point" in the spec.

    I find it strange that an OpenCL device would be performing the copy between global and host memory. I had assumed some sort of direct memory access transfer would be used.
    That will depend on each particular implementation. Remember that OpenCL serves a very wide range of computing devices.

    If some device has to perform the data copy, which one does it when you call clCreateBuffer() with CL_MEM_COPY_HOST_PTR? Why are you not forced to, or even allowed to, specify a device when using this flag? Why is there no symmetrical way to copy data from the device to the host? Something just isn't right with this API.
    Any standard API will be some sort of compromise of the alternatives suggested by multiple people from different companies. It is not possible to design an API or a language that will satisfy everybody.

    Generally speaking, mapping memory objects into the host's address space and writing the data directly into the given pointer instead of copying it around will give better performance than using CL_MEM_COPY_HOST_PTR. This is only a general rule. YMMV.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  10. #10
    Junior Member
    Join Date
    Apr 2010
    Location
    Perth, WA
    Posts
    27

    Re: Device affinity for command queues and buffers seems at odds

    Quote Originally Posted by david.garcia
    Perhaps there is little detail required as OpenCL does not promise much - to quote the last sentence in A.1, "The results of modifying a shared resource in one command-queue while it is being used by another command-queue are undefined."
    That sentence from the spec is stating something that should be expected anyway: modifying a resource in one queue while another queue is making use of it is going to cause trouble.
    Should be expected? Why is that? The spec could make this as tight as it likes.

    The OpenCL spec gives a fair amount of detail on memory fences and barriers so it is well defined what happens when a memory object is concurrently accessed and mutated by multiple compute units. However, they decide to stop there and just leave cross-command queue synchronisation very loose. The best you can do for synchronisation across command queues is to stop and wait just in case. That being said, this has nothing to do with my original query.

    Quote Originally Posted by david.garcia
    As long as you use dependencies correctly, sharing resources between different command queues inside the same context is straightforward.
    Agreed, this is indeed straightforward.

    Quote Originally Posted by david.garcia
    Any standard API will be some sort of compromise of the alternatives suggested by multiple people from different companies. It is not possible to design an API or a language that will satisfy everybody.
    I'm not looking for satisfaction, merely explanation of why it is like it is. Who knows, my tirade on the non-symmetry in the API regarding reading and writing memory objects may lead to changes in the spec, or, I might just be rehashing and age-old argument, or there might be a very good reason for why it is like it is.

    I agree that design-by-committee is a less than ideal way to work, but every decision made should be justified and I would hope that those making the decisions are involved or represented in this community and would be willing to share such justifications here.

    Cheers,

    Dan
    Daniel Paull
    Real Engineers Think Bottom Up.

Page 1 of 4 1234 LastLast

Similar Threads

  1. Replies: 4
    Last Post: 11-24-2011, 06:15 AM
  2. Replies: 3
    Last Post: 12-08-2009, 10:29 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •