Splitting workload across devices belonging to different platforms
I am a complete newbie to OpenCL programming. I wanted to know whether it is possible to split the workload across different devices belonging to different platforms. I am targetting a CPU-GPU system.
Since the GPU device is NVIDIA, it doesn't support the CPU device. Hence, I have two platforms. One belonging to AMD for the CPU and another belonging to NVIDIA for the GPU. I believe I cannot use a singe context as there are two different platforms. So is there anyway to simultaneously execute kernels on both the devices each controlling different portions of the data?
You can't use a single context, as a context is created only on devices which belong to the same platform - and you have two platforms. So you'll have to create two contexts, one on AMD platform which has the CPU only, and one on NVIDIA platforms which contains the GPU only.
Originally Posted by AnirbanGhose
However, it is possible to have kernels execute simultaneously on both devices, processing different portions of the data. The key is creating the cl_mem objects on each platform with CL_MEM_USE_HOST_PTR, and use the clEnqueueMapBuffer and clEnqueueUnMapMemObj commands (Sections 5.2.4 & 5.4.2 in OpenCL 1.2).
When creating a memory object with CL_MEM_USE_HOST_PTR, the OpenCL runtime uses the pointer passed by the application during the call to clCreateBuffer as the buffer to sync with during map/unmap commands. Below is an example of how to split work between the devices (which belong to different contexts/platforms), assuming single input buffer and single output buffer. (widen it to N where applicable)
- The application allocates two buffers (one for input and one for output) using a regular allocation command - malloc, new, etc.
- The application creates two memory objects for input buffers, one on each context, with CL_MEM_USE_HOST_PTR and passes as the host_ptr a pointer into the input buffer allocated in step 1. for example, the cl_mem created on the GPU context receives a pointer to the beginning of the allocated input buffer, and the cl_mem created on the CPU context receives a pointer with offset of 1MBytes from the beginning of the input buffer.
- Repeat this step, this time for output buffer (create two output buffers, one for each context, CL_MEM_USE_HOST_PTR, pass pointers, etc.)
- The application writes the required information into the buffer allocated on step 1 (note that this step can happen before or after #2, but mus happen before #4)
- The application calls clEnqueueUnMapMemObj twice, once for each cl_mem object on each context. Note that as of this point, these buffers are considered marshaled to the OpenCL runtime, and it may copy the contents of these buffers to the device-side allocated buffers at any suitable time (so watch out on updating them at this point)
- The application calls the clEnqueueNDRangeKernel to run the kernels on this data.
- The application calls the clEnqueueMapBuffer twice, once for each cl_mem object on each context. Note that you may wish to create some sync mechanism (semaphore or other) on the application side using the events of each command, to ensure that both commands will not execute in parallel. Just as precautions.
At this point the buffer on the host contains the contents of the work done by the two kernels.
Originally Posted by OferRosenberg
Accordingly, I should then be changing the global work size for clEnqueueNDRangeKernels when I am calling them to process the data for each device? Is this step required? Can the OpenCL runtime infer directly as to how much amount of data it is supposed to process from the CL Map Objects? This is one concept that I am clearly not grasping. I am trying to build a program that given any percentage it would split the data across the devices.