Results 1 to 4 of 4

Thread: Pinned Memory Again

  1. #1

    Pinned Memory Again

    Dear all,

    I'd like to clarify the pinned memory issue for me, once and for all.
    The specification is vague as well as overly complicated, so I have
    a number of issues that I'd like to get out of the way.

    The background of the question is: I'd like to create CUDA pinned
    memory semantics in OpenCL.

    Pinned memory is host memory allocated in a special way, with
    certain properties, that might result in faster than usual transfer
    times between host and device and vice versa.

    In CUDA the API is really simple. We can do (pseudo-code):

    ph = pinned_alloc_host(200);
    d = alloc_device(200);
    copy(d, ph, 200);

    In OpenCL this does not exist (unfortunately). However, there is
    something that might give similar behavior in terms of performance.
    From here on, everything becomes unclear and I'd like you to
    correct me or reaffirm my conclusions:

    • in OpenCL we can allocate a buffer on a device that has a
      corresponding block of memory on the host (these are the
      CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR flags)
    • CL_MEM_ALLOC_HOST_PTR will allocate this corresponding
      block of memory on the host
    • CL_MEM_USE_HOST_PTR will use an existing block of
      memory on the host
    • we get access to this block of memory on the host by
      calling clEnqueueMapBuffer()
    • under certain circumstances, this block of memory will
      behave similar to the CUDA pinned memory in terms of
      performance
    • to achieve this kind of performance, the call
      clEnqueueUnmapMemObject() must be used which takes as
      an argument the pointer from clEnqueueMapBuffer(), which
      represents the block of memory on the host and the original
      cl_mem object that represents the buffer on the device,
      created in the original clCreateBuffer() call


    Is this correct, so far? Here are a few more questions:

    • can unrelated host and device memory blocks be transferred,
      that were not created from the matching clCreateBuffer() and
      clEnqueueMapBuffer() calls?
    • how does clEnqueueReadBuffer() come into play here? can the
      pointer obtained from clEnqueueMapBuffer() be used in
      clEnqueueReadBuffer() or clEnqueueWriteBuffer?


    Thanks for reading
    Sebastian

  2. #2
    Senior Member
    Join Date
    Dec 2011
    Posts
    163
    You can use any host memory with clEnqueueRead/WriteBuffer. On NVIDIA hardware, the operations will go faster if the source or destination memory was allocated as pinned memory (using clCreateBuffer with CL_MEM_ALLOC_HOST_PTR). Also, they say that is the only way the operation can participate in overlapped copy and compute (which also requires multiple command queues). Check the NVIDIA overlap copy/compute example which shows how to allocate pinned memory. Also, the NVIDIA OpenCL programming guide discusses how to do it.

    With AMD and Intel, there is no read/write buffer advantage using pinned memory as your source/destination that I know of. For AMD discrete GPUs, the fastest DMA is achieved using clEnqueueMapBuffer. For AMD APU and Intel HD Graphics, you can get zero-copy (instant) mapping of device buffers if you use clEnqueueMapBuffer (and use the right allocation flags; check the respective vendor programming guides).

    Finally, both NVIDIA and AMD discrete GPUs have ways of accessing host memory from a kernel, which effectively combines the copy with the compute (the kernel runs slower but there is no copy operation).

  3. #3
    Dithermaster, thanks a lot for answering! Your advice was great.
    From the oclCopyComputeOverlap sample I get the following:

    devBuf = clCreateBuffer();
    hostBufPinned = clCreateBuffer(CL_MEM_READ_WRITE |
    CL_MEM_ALLOC_HOST_PTR);
    ptrHostBuf = clEnqueueMapBuffer(hostBufPinned);
    // use ptrHostBuf like regular pointer
    clEnqueueWriteBuffer(devBuf, ptrHostBuf);

    It is nice that this works but I wonder if this was intended by the
    OpenCL spec. clEnqueueUnmapMemObject is pointless in this
    example.

    The entire mapping business makes a lot more sense with APU and
    Intel HD Graphics (due to zero-copy). For discrete cards, I am still
    unclear when memory is allocated where and when memory is
    transferred. And I suspect it differs between implementers.

    Do you know of similar code examples from Intel and AMD? Code
    seems to be the only thing we can rely on since the specification
    is so vague. I think this is a big disadvantage. The specification
    should be clear enough to not allow major differences in functionality
    across implementers.

    Sebastian

  4. #4
    Senior Member
    Join Date
    Dec 2011
    Posts
    163
    The pinned memory read/write thing is unique to NVIDIA. Check on the Intel and AMD sites for their best practices / programming guidelines. A good order of operations is to understand OpenCL based on the spec and books, get something working, and then look to the vendor guidelines for optimization techniques.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •