Page 2 of 2 FirstFirst 12
Results 11 to 13 of 13

Thread: Questions about the usage of clEnqueueMapBuffer and CL_MEM_USE_HOST_PTR

  1. #11
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by utnapishtim View Post
    To put it simply, a kernel should never access a host memory buffer.

    In that case, the OpenCL implementation will either:

    - make a copy of the buffer between host and device before the kernel starts
    - or (if supported) give to the device direct access to host memory though PCIe interconnect at a 10x slower speed than a full copy.


    The main interest of host memory buffers is to allow memory transfer at full speed between host and device with clEnqueueCopyBuffer() while keeping CPU and GPU free to work during that transfer thanks to DMA.

    Pinning memory (with CL_MEM_USE_HOST_PTR) or allocating pinned memory (with CL_MEM_ALLOC_HOST_PTR) is a slow process (as well as unpinning and deallocating pinned memory), so using a host memory buffer is meaningful only if you intend to use it many times (lots of kernel launches).

    If you only intend to create a host buffer, call a kernel and immediately after delete this buffer, you'd better create a device buffer, fill it with clEnqueueWriteBuffer(), call the kernel, and get the result with clEnqueueReadBuffer(). You will save a lot of overhead.
    For the possible implementation, is there a third option here? As far as I know, Intel CPU & GPU are on the same die and share the last level cache, so they can read/write to the cache and validate the cache to memory for use (maybe with clEnqueueMapBuffer?). For such case, the mapping seems to be more efficient?

  2. #12
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    Quote Originally Posted by lance0010 View Post
    For the possible implementation, is there a third option here? As far as I know, Intel CPU & GPU are on the same die and share the last level cache, so they can read/write to the cache and validate the cache to memory for use (maybe with clEnqueueMapBuffer?). For such case, the mapping seems to be more efficient?
    Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

    - The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with CL_MEM_USE_HOST_PTR with the pointer aligned on 4KB
    - Use buffers instead of images (images cannot be mapped efficiently)
    - Use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() instead of clEnqueueReadBuffer() and clEnqueueWriteBuffer()

  3. #13
    Junior Member
    Join Date
    Aug 2014
    Posts
    7
    Quote Originally Posted by utnapishtim View Post
    Intel CPU and GPU share physical memory so mapping a buffer is very efficient if the following conditions are fulfilled:

    - The buffer is created with CL_MEM_ALLOC_HOST_PTR, or with CL_MEM_USE_HOST_PTR with the pointer aligned on 4KB
    - Use buffers instead of images (images cannot be mapped efficiently)
    - Use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() instead of clEnqueueReadBuffer() and clEnqueueWriteBuffer()

    I'll do some verification on Intel CPUs
    Thanks utnapishtim.

Page 2 of 2 FirstFirst 12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •