PDA

View Full Version : CreateBuffer for Multiple Devices.



Seshadri
03-28-2011, 06:39 AM
Hi,

I want to create a OpenCL context, upon which I would want to interface multiple OpenCL devices.

As the CreateBuffer API accepts only <OpenCL context> as its parameter, and there are no device related parameters like Command Queue IDs or Device IDs provided, I am just wondering how will the CreateBuffer method work in multiple device scenario? In other words, if a context is associated with multiple devices, for which device will the CreateBuffer API allocate a memory?

Also the Appendix 1.0 of OpenCL spec says, the memory objects created using the context are shared across multiple command queues, and hence multiple devices.

So it is not clear to me, how the CreateBuffer work in a context where multiple devices are associated? In other words, it is not clear to me, for which device (among those multiple devices) the cl_mem object gets allocated?

Can any one throw some light in this issue?

Thanks
Seshadri

david.garcia
03-28-2011, 02:59 PM
I am just wondering how will the CreateBuffer method work in multiple device scenario? In other words, if a context is associated with multiple devices, for which device will the CreateBuffer API allocate a memory?

The OpenCL driver will take care of that. From the point of view of the application, the buffer is shared among all devices in that context.

Seshadri
03-28-2011, 10:08 PM
Thanks David for prompt response.

Just want to add some of my observations in the meantime. I observed this, after I searched the internet after I posted the mail.

Its not defined clearly in OpenCL specific, hence it is implementation specific, as confirmed in NVIDIA forum.

http://forums.nvidia.com/index.php?show ... try1192390 (http://forums.nvidia.com/index.php?showtopic=192398&st=0&gopid=1192390&#entry1192390)

This is how it works in NVIDIA devices.

“On NVIDIA GPUs the actual memory to hold the buffer in device memory is not allocated until the device is specifically addressed to use the data. For read-only buffers, this would be when a clEnqueueWrite* command is issued to that device's command-queue. For write-only buffers, this is even trickier. The actual allocation will take place on the first execution of a kernel, of which the buffer was set as an argument of, or at the first call to clEnqueueRead* command for that buffer on a command queue associated with the device”

So we can simply assume, as if the allocation never takes place at the time of CreateBuffer in NVIDIA GPUs.

“OpenCL does not assume that data can be transferred directly between devices within the same context, so such a behavior is implementation specific. Technically, you need to explicitly transfer the data from one device to the other, by issuing a clEnqueueRead* command on the command queue attached with the 1st device, and then a synchronized clEnqueueWrite* command on the command queue of the 2nd device. This off course transfers data through the host. The same cl_mem object is used in both commands.”

david.garcia
03-29-2011, 05:31 AM
“OpenCL does not assume that data can be transferred directly between devices within the same context, so such a behavior is implementation specific. Technically, you need to explicitly transfer the data from one device to the other, by issuing a clEnqueueRead* command on the command queue attached with the 1st device, and then a synchronized clEnqueueWrite* command on the command queue of the 2nd device. This off course transfers data through the host. The same cl_mem object is used in both commands.”

The person who wrote the quote above is unfortunately mistaken. It's the driver's responsibility to transfer data transparently from different devices within a context (if necessary). Memory objects are available to all devices in the same context as if they were shared.

If you need further reassurance I can try to summon the OpenCL spec editor, but I'd rather not bother him.

This other quote, however, is true to the extent that I know:


“On NVIDIA GPUs the actual memory to hold the buffer in device memory is not allocated until the device is specifically addressed to use the data. For read-only buffers, this would be when a clEnqueueWrite* command is issued to that device's command-queue. For write-only buffers, this is even trickier. The actual allocation will take place on the first execution of a kernel, of which the buffer was set as an argument of, or at the first call to clEnqueueRead* command for that buffer on a command queue associated with the device”

This behavior is allowed in the specification.

atlemann
05-06-2011, 02:44 AM
The person who wrote the quote above is unfortunately mistaken. It's the driver's responsibility to transfer data transparently from different devices within a context (if necessary). Memory objects are available to all devices in the same context as if they were shared.


Here is my scenario:
- Dataset too big for 1 device
- 4 devices
- Dataset split into 4 buffers in a 3D matrix composition
- Border data must be exchanged between the devices in each iteration.

How do I do that? What commands should I use?

Should I use enqueueWrite/Read/Copy between the buffers?

Should I have some small extra border buffers and extra kernels to copy data from the border buffers to the main buffer? Would the extra border buffers automagically be copied from one device to the other if I run a write kernel on one device and a read kernel on another?

I have posted this question in multiple forums, but I am not getting any clear answers.

- Atle

david.garcia
05-06-2011, 02:30 PM
Should I use enqueueWrite/Read/Copy between the buffers?

Should I have some small extra border buffers and extra kernels to copy data from the border buffers to the main buffer? Would the extra border buffers automagically be copied from one device to the other if I run a write kernel on one device and a read kernel on another?

Both are valid solutions. I would probably use clEnqueueCopyBuffer() to propagate border information across devices since that probably makes the kernel source code more readable. I don't think there would be a big performance difference between the two solutions you suggest.

This is not really anything special about OpenCL if you think about it. It's the same problem you would need to solve if you were programming in C or MPI.