Results 1 to 6 of 6

Thread: Transfers between host and device memory

  1. #1
    Junior Member
    Join Date
    Nov 2011
    Posts
    8

    Transfers between host and device memory

    Hi there, I have two questions:

    First question: I need to transfer data from GPU to CPU and CPU to GPU. To compute the transfer rate I'm timing the transfers using OpenCL Events; It looks like the transfer from GPU to CPU is faster than the transfer from CPU to GPU (12.2GB/s vs 11GB/s). I read somewhere that this behavior is normal, but don't know why: is it because restrictions imposed by the PCIe or the GPU ?. Any explanation and links will be useful. BTW: I'm using a NVidia C2070 GPU and a PCIe x16 2nd Generation; and the buffer at the host is pinned memory

    Second question is: What I actually need is to transfer data from GPU1 to GPU2, so I'm transferring by doing 2 transfers: GPU-CPU and then CPU-GPU using pinned memory. Is there any way to transfer GPU-GPU directly ?. Both GPUs are C2070.

    Thanks.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Transfers between host and device memory

    Is there any way to transfer GPU-GPU directly ?
    You may want to read about clEnqueueCopyBuffer() and clEnqueueCopyImage() to perform an explicit copy.

    Alternatively, you can just create a context with the two GPUs in it and let the OpenCL runtime move data from one device to the other automatically for you. All you need is one command queue for GPU 1 and another command queue for device 2. When you enqueue an NDRange on queue 1, all necessary data will be transferred to GPU 1 automatically if it was not already there.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Nov 2011
    Posts
    8

    Re: Transfers between host and device memory

    Quote Originally Posted by david.garcia
    Is there any way to transfer GPU-GPU directly ?
    You may want to read about clEnqueueCopyBuffer() and clEnqueueCopyImage() to perform an explicit copy.
    I actually tried using clEnqueueCopyBuffer(), however the performance was not good:

    queue[0] = clCreateCommandQueue(context, device[0], QUEUE_OPTS , &result);
    queue[1] = clCreateCommandQueue(context, device[1], QUEUE_OPTS , &result);
    // Kernel creation and argument passing:
    // - coefx[0], coefy[0] and res[0] "were created in" queue[0] using clEnqueueCopyBuffer(queue[0],...)
    // - coefx[1], coefy[1] and res[1] "were created in" queue[1] using clEnqueueCopyBuffer(queue[1],...)
    kernel[0] = clCreateKernel(OpenCLProgram, kernel, &err)
    err = clSetKernelArg(kernel[0], 0, sizeof(cl_mem), (void*)&(coefx[0]));
    err = clSetKernelArg(kernel[0], 1, sizeof(cl_mem), (void*)&(coefy[0]));
    err = clSetKernelArg(kernel[0], 2, sizeof(cl_mem), (void*)&(res[0]));
    // variables with index 1 were created in queue[1]
    kernel[1] = clCreateKernel(OpenCLProgram, kernel, &err);
    err = clSetKernelArg(kernel[1], 0, sizeof(cl_mem), (void*)&(coefx[1]));
    err = clSetKernelArg(kernel[1], 1, sizeof(cl_mem), (void*)&(coefy[1]));
    err = clSetKernelArg(kernel[1], 2, sizeof(cl_mem), (void*)&(res[1]));

    err = clEnqueueNDRangeKernel(queue[0], kernel[0], 1, NULL, WSize, GSize, 0, NULL, NULL);
    err = clEnqueueNDRangeKernel(queue[1], kernel[1], 1, NULL, WSize, GSize, 0, NULL, NULL);
    size = Ntotal * sizeof(float);
    offset = size;
    // Barriers before timing
    clFinish(queue[0]);
    clFinish(queue[1]);
    start = gettimeofday();
    // res[0] and res[1] were created in queue[0] and queue[1] respectively
    err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);
    clWaitForEvents(1, &event);
    finish = gettimeofday();
    wallTime = finish start;
    openCLTime = clGetEventProfilingInfo(event);

    In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1

    Alternatively, you can just create a context with the two GPUs in it and let the OpenCL runtime move data from one device to the other automatically for you. All you need is one command queue for GPU 1 and another command queue for device 2. When you enqueue an NDRange on queue 1, all necessary data will be transferred to GPU 1 automatically if it was not already there.
    Not really sure if I this will actually work, because the variable (say RES in the above code) exists in both GPUs. The main ideain the example above is that each GPU does the same work, but it works on a different piece of data (i.e the same array with different offset), so I'm guessing that the OpenCL runtime will believe that the data is already there, so it won't try to automatically copy the data, correct ?

    Thanks.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Transfers between host and device memory

    Can you elaborate on this part?

    Code :
    // - coefx[0], coefy[0] and res[0] "were created in" queue[0] using clEnqueueCopyBuffer(queue[0],...)

    First, coefx, coefy and res are buffer objects, and buffer objects belong to a context, not to a queue. Why do you call clEnqueueCopyBuffer() at all here? It seems like you simply load data into coefx and coefy and then let the GPU do some computations. I don't see why clEnqueueCopyBuffer() would be necessary.

    Code :
    clFinish(queue[0]);
    clFinish(queue[1]);

    This may not be a very good idea. What you want is both devices running simultaneously. If you call clFinish() on each device separately, it's possible that the second GPU will not be doing anything while clFinish(queue[0]) is waiting for the first GPU to finish.

    Instead, you may want to call clWaitForEvents() on two events, one from each queue.

    Separately, I don't quite get this either:

    Code :
    // res[0] and res[1] were created in queue[0] and queue[1] respectively
    err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);

    Is it truly necessary for your algorithm to copy the results into a single buffer? If the transfer is not strictly required, you may want to keep the two buffers separate.

    In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1
    Can you show us the two alternative versions of the code for comparison?

    Not really sure if I this will actually work, because the variable (say RES in the above code) exists in both GPUs. The main ideain the example above is that each GPU does the same work, but it works on a different piece of data (i.e the same array with different offset), so I'm guessing that the OpenCL runtime will believe that the data is already there, so it won't try to automatically copy the data, correct ?
    I understand that you want each device to do part of the work. I don't quite get the rest of the statement. There is not a single variable "res". Instead, you have already divided it into res[0] and res[1]. The OpenCL runtime is fully aware at all times of where each buffer is located, whether it's on the first or the second GPU and it will use that knowledge to minimize data transfers.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Nov 2011
    Posts
    8

    Re: Transfers between host and device memory

    In a simpler way, the kernel takes 2 inputs (coefx and coefy), perform some computations and updates the output (res).

    I'm partitioning the kernel execution into my 2 available GPUs in such a way that each GPU performs part of the processing and outputs half of the array "res". Once the kernel execution in both GPUs is done, the array res[1] in the GPU1 is copied into the array res[0] in the GPU0, obviously no data is overwritten, the data are copied into the unused part of the array.

    Quote Originally Posted by david.garcia
    Can you elaborate on this part?
    Code :
    // - coefx[0], coefy[0] and res[0] "were created in" queue[0] using clEnqueueCopyBuffer(queue[0],...)
    When the variables coefx, coefy and res are populated before kernel execution, they are tied to a specific queue by doing: (notice that it will be the same for coefx[.] and coefy[.])
    Code :
    res[0] = clCreateBuffer(context, mode, size, NULL, &err);
    // h_data is the host buffer with the information to be copied into res[0]
    // This line "ties" the variable res[0] to queue[0] and hence to GPU0
    err = clEnqueueCopyBuffer(queue[0], h_Data, res[0], 0, 0, size, 0, NULL, NULL);

    First, coefx, coefy and res are buffer objects, and buffer objects belong to a context, not to a queue. Why do you call clEnqueueCopyBuffer() at all here? It seems like you simply load data into coefx and coefy and then let the GPU do some computations.
    In the above code I specified which queue (and hence GPU) to perform the copy, therefore the array will belong to that GPU

    I don't see why clEnqueueCopyBuffer() would be necessary.
    Because I want to specify where to "store" the buffer by specifying the queue (GPU)

    Code :
    clFinish(queue[0]);
    clFinish(queue[1]);

    This may not be a very good idea. What you want is both devices running simultaneously. If you call clFinish() on each device separately, it's possible that the second GPU will not be doing anything while clFinish(queue[0]) is waiting for the first GPU to finish.

    Instead, you may want to call clWaitForEvents() on two events, one from each queue.
    Yes, you are right; in fact I use in my actual code clWaitForEvents()

    Separately, I don't quite get this either:

    Code :
    // res[0] and res[1] were created in queue[0] and queue[1] respectively
    err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, &event);
    Hopefully the above explanations/code clarifies this as well.

    Is it truly necessary for your algorithm to copy the results into a single buffer? If the transfer is not strictly required, you may want to keep the two buffers separate.
    Yes, it is necessary to copy the results back into a single buffer.

    [quote:2syi6k24]In the above code the wallTime is the time that the copy takes, which is bigger than the time using a transfer GPU0-CPU and then CPU-GPU1
    Can you show us the two alternative versions of the code for comparison?
    [/quote:2syi6k24]
    This piece of code copies from GPU1 to CPU and then from CPU to GPU0
    Code :
    // Defines size and offset
    size = N/2;
    offset = size;
    // Creates pinned host buffer
    cl_mem PinnedBuf = NULL;
    PinnedBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size, NULL, NULL);
    float *temp  = NULL; 
    temp = (float *) clEnqueueMapBuffer(queue[1], PinnedBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, size, 0, NULL, NULL, NULL);
    // GPU1 to CPU 
    err = clEnqueueReadBuffer(queue[1], res[1], CL_FALSE, 0, size, temp, 0, NULL, &event);
    // CPU to GPU0
    err = clEnqueueWriteBuffer(queue[0], res[0], CL_FALSE, offset, size, temp, 1, &event, NULL);
    This piece of code copies directly from GPU1 to GPU0
    Code :
    // GPU1 to GPU0
    // Assumes that eventTmp is the event tied to the kernel execution on queue[1]
    err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &eventTmp, NULL);

    I understand that you want each device to do part of the work. I don't quite get the rest of the statement. There is not a single variable "res". Instead, you have already divided it into res[0] and res[1].
    Correct

    The OpenCL runtime is fully aware at all times of where each buffer is located, whether it's on the first or the second GPU and it will use that knowledge to minimize data transfers.
    Say, if I defined res[0] of size N/2 to belong to the queue[0] (having the array indexes from 0 to N/2 - 1) and res[1] of size N/2 to belong to queue[1] (having the array indexes from N/2 to N - 1) and at some point in the kernel[0] running on queue[0] there is a reference to the location index "N-1", does the OpenCL runtime 1) "bring" the correct data, or 2) have some undefined behavior or, 3) seg fault ?

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Transfers between host and device memory

    // This line "ties" the variable res[0] to queue[0] and hence to GPU0
    err = clEnqueueCopyBuffer(queue[0], h_Data, res[0], 0, 0, size, 0, NULL, NULL);
    That line is not necessary to "tie" the variable to a device. In fact it's causing an additional data transfer that is not needed in the first place.

    In the above code I specified which queue (and hence GPU) to perform the copy, therefore the array will belong to that GPU
    That's not necessary.

    Because I want to specify where to "store" the buffer by specifying the queue (GPU)
    Not necessary.

    Yes, it is necessary to copy the results back into a single buffer.
    OK, do that at the end. Don't call clEnqueueCopyBuffer() at the beginning.

    temp = (float *) clEnqueueMapBuffer(queue[1], PinnedBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, size, 0, NULL, NULL, NULL);
    // GPU1 to CPU
    err = clEnqueueReadBuffer(queue[1], res[1], CL_FALSE, 0, size, temp, 0, NULL, &event);
    // CPU to GPU0
    err = clEnqueueWriteBuffer(queue[0], res[0], CL_FALSE, offset, size, temp, 1, &event, NULL);
    That code may or may not work in all implementations.

    First, OpenCL 1.1. doesn't clarify whether a pointer returned by clEnqueueMapBuffer() may be used in other APIs such as clEnqueueReadBuffer().

    Second, you didn't call clEnqueueUnmapMemObject(). Without a call to clEnqueueUnmapMemObject() there is no guarantee that the data that is read from PinnedBuf will match the data that was written into it (you can search the specification for the term "synchronization point").

    In other words, the code above is non-portable.

    Say, if I defined res[0] of size N/2 to belong to the queue[0] (having the array indexes from 0 to N/2 - 1) and res[1] of size N/2 to belong to queue[1] (having the array indexes from N/2 to N - 1) and at some point in the kernel[0] running on queue[0] there is a reference to the location index "N-1", does the OpenCL runtime 1) "bring" the correct data, or 2) have some undefined behavior or, 3) seg fault ?
    res[0] is one buffer object. res[1] is an entirely separate buffer object. They both belong to the same context. They do not belong to any particular device. If you access an index out of bounds in res[0] or res[1] in any device results are undefined.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. copying a variable from host memory to device memory
    By shahsaurabh1990 in forum OpenCL
    Replies: 4
    Last Post: 03-26-2013, 01:10 AM
  2. Device-host memory communication
    By jbasic in forum OpenCL
    Replies: 2
    Last Post: 10-07-2009, 10:48 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •