Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Questions about the usage of clEnqueueMapBuffer and CL_MEM_USE_HOST_PTR

  1. #1
    Junior Member
    Join Date
    Aug 2014
    Posts
    7

    Questions about the usage of clEnqueueMapBuffer and CL_MEM_USE_HOST_PTR

    hi All
    What I'm trying to do is to use OpenCL do some previous work on a picture on GPU and then send the processed image to some video analytic module on CPU.
    I tried to exclude any kind of memory copy for some performance reasons.

    What I can think of is to allocate some memory on CPU side and create a GPU cl_mem with CL_MEM_USE_HOST_PTR.
    Before the CPU can use that part of memory, I need to call clEnqueueMapBuffer() first.
    But I don't know when CPU side will finish the processing, so when shall I call clReleaseMemObject?
    I don't want to wait there or add a callback function(boring ).
    Since the processed data is already in the system memory, is there any function which can just tell OpenCL to "release" that part of memory?

    If I call clReleaseMemObject(), the data is invalid in the corresponding system memory.
    If I never call clReleaseMemObject(), I guess there will be a lot of memory leaks.

    Thanks

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
    This will synchronize the (host) buffer with the GPU cache. You can then release the OpenCL memory object.
    The user-allocated buffer is still valid and contains the result of the GPU computation.

  3. #3
    If you call clEnqueueMapBuffer (with blocking==TRUE), then immediately call clEnqueueUnmapBuffer and clReleaseMemObject, that should leave you with valid data in system memory. Does this sequence not work for you? It might be better than calling clEnqueueReadBuffer because on many platforms clEnqueueMapBuffer on a buffer allocated with CL_MEM_USE_HOST_PTR will not perform any copies, whereas clEnqueueReadBuffer will always produce a copy.

  4. #4
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by utnapishtim View Post
    In your scenario, you can use clEnqueueReadBuffer() with blocking_read=true and ptr set to the host memory pointer.
    This will synchronize the (host) buffer with the GPU cache. You can then release the OpenCL memory object.
    The user-allocated buffer is still valid and contains the result of the GPU computation.
    I think this will have an extra memory copy. (1 is from cl_mem_in to cl_mem_out, 2 is from cl_mem_out to user allocated host buffer).
    Since I'm processing a decoded video sequence, such expense of memory copy is very big, so I want to reduce the memory copy as much as possible

  5. #5
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by kunze View Post
    If you call clEnqueueMapBuffer (with blocking==TRUE), then immediately call clEnqueueUnmapBuffer and clReleaseMemObject, that should leave you with valid data in system memory. Does this sequence not work for you? It might be better than calling clEnqueueReadBuffer because on many platforms clEnqueueMapBuffer on a buffer allocated with CL_MEM_USE_HOST_PTR will not perform any copies, whereas clEnqueueReadBuffer will always produce a copy.
    Oh, I haven't tried this sequence yet. Is it a valid sequence for all the platforms? Or it's platform related?

  6. #6
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    Quote Originally Posted by lance0010 View Post
    I think this will have an extra memory copy. (1 is from cl_mem_in to cl_mem_out, 2 is from cl_mem_out to user allocated host buffer).
    Since I'm processing a decoded video sequence, such expense of memory copy is very big, so I want to reduce the memory copy as much as possible
    Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed (whence the special requirements for this kind of call as described in the note to 5.2.2 of OpenCL specs).

  7. #7
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by utnapishtim View Post
    Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed (whence the special requirements for this kind of call as described in the note to 5.2.2 of OpenCL specs).
    Is the "Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed" defined somewhere in the spec explicitly? I do agree with you on this but it's better if it's pointed out or hinted somewhere in the spec.

  8. #8
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    Quote Originally Posted by lance0010 View Post
    Is the "Calling clEnqueueReadBuffer() with the pointer used to create the host-allocated buffer won't make any redundant copy but will only synchronize memory between GPU and CPU if needed" defined somewhere in the spec explicitly? I do agree with you on this but it's better if it's pointed out or hinted somewhere in the spec.
    Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn't use a host buffer and expect that all transfers will be magically optimized.
    If you need a buffer for your GPU kernel, then create a device-allocated buffer. Then handle manually transfers between host and device. Thus you'll have full control of what happens.

    For instance on an AMD device, if the alignment of your memory pointer is correct, a CL_MEM_USE_HOST_PTR buffer may be pre-pinned, giving fast transfers between host and device. However, once used by a device kernel, this buffer is no more pre-pinned and the transfer from device to host will be slow (as will be all subsequent transfers).
    On NVIDIA devices, AFAIK only CL_MEM_ALLOC_HOST_PTR buffers are pinned. CL_MEM_USE_HOST_PTR always have slow transfer paths (this can have changed, but once more, this depends on the alignment of the pointer and the device being used).

    Because most modern GPUs can execute a kernel and copy memory between host and device at the same time, if the computation time of your kernel and the memory transfer time have the same order of magnitude, you could imagine splitting the computation in two parts, transfer the memory for the second half while the first half is being computed, and transfer the result of the first half while the second half is being computed.
    This requires two command queues and a good synchronization though.

  9. #9
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by utnapishtim View Post
    Honestly, if your buffer is to be accessed by a GPU kernel, you shouldn't use a host buffer and expect that all transfers will be magically optimized.
    If you need a buffer for your GPU kernel, then create a device-allocated buffer. Then handle manually transfers between host and device. Thus you'll have full control of what happens.

    For instance on an AMD device, if the alignment of your memory pointer is correct, a CL_MEM_USE_HOST_PTR buffer may be pre-pinned, giving fast transfers between host and device. However, once used by a device kernel, this buffer is no more pre-pinned and the transfer from device to host will be slow (as will be all subsequent transfers).
    On NVIDIA devices, AFAIK only CL_MEM_ALLOC_HOST_PTR buffers are pinned. CL_MEM_USE_HOST_PTR always have slow transfer paths (this can have changed, but once more, this depends on the alignment of the pointer and the device being used).

    Because most modern GPUs can execute a kernel and copy memory between host and device at the same time, if the computation time of your kernel and the memory transfer time have the same order of magnitude, you could imagine splitting the computation in two parts, transfer the memory for the second half while the first half is being computed, and transfer the result of the first half while the second half is being computed.
    This requires two command queues and a good synchronization though.
    The reason I want to use CL_MEM_USE_HOST_PTR is it seems can have less memory copy, but it may be slower than CL_MEM_ALLOC_HOST_PTR + explicit copy, (may be affected by different implementations) , right?

  10. #10
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    To put it simply, a kernel should never access a host memory buffer.

    In that case, the OpenCL implementation will either:

    - make a copy of the buffer between host and device before the kernel starts
    - or (if supported) give to the device direct access to host memory though PCIe interconnect at a 10x slower speed than a full copy.

    The main interest of host memory buffers is to allow memory transfer at full speed between host and device with clEnqueueCopyBuffer() while keeping CPU and GPU free to work during that transfer thanks to DMA.

    Pinning memory (with CL_MEM_USE_HOST_PTR) or allocating pinned memory (with CL_MEM_ALLOC_HOST_PTR) is a slow process (as well as unpinning and deallocating pinned memory), so using a host memory buffer is meaningful only if you intend to use it many times (lots of kernel launches).

    If you only intend to create a host buffer, call a kernel and immediately after delete this buffer, you'd better create a device buffer, fill it with clEnqueueWriteBuffer(), call the kernel, and get the result with clEnqueueReadBuffer(). You will save a lot of overhead.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •