I have implemented an algorithm using OpenCL and have several versions: CPU-only, multi GPU, and multi GPU + CPU. I am working on Mac OS X 10.6.1 using dual NV GT120's and dual Intel Xeon Quad Core CPUs. I am encountering some performance behavior I do not understand and am hoping someone can clarify. My main question is around how the OpenCL implementation chooses to allocate work to the CPU.

I have done some profiling of the CPU path and multi GPU path independently on this machine and have established a rough ratio of the performance of each path and use this to up-front determine how much work to send on each path.

The basic usage is:

Generate workload for each GPU and the CPU (I have one GPU context with two devices and one CPU context with one device)
Create a thread for each workload (three total) and in each thread:
* Copy the data to the device using OpenCL
* Invoke the kernel multiple times in a loop (each execution is in a loop and requires a Host read back on each iteration)

Now, what puzzles me is this: if I just run the application to use just a single thread and one OpenCL context on the CPU, I get the best performance. Watching in Mac System Monitor, I see the application use approximately 18 threads and consume 1200%+ of the CPU. However, when I run my three threaded version where each thread is sending work to a device (thread 1 - CPU, thread 2 - GPU 0, thread 3 - GPU 1) I see the application create approximately 24 threads and only use about 300% of the CPU initially. As soon as the GPU threads retire because they have finished their work, the thread running on the CPU immediately starts consuming 1200%+ again. So in other words, by having the GPU threads, I am massively slowing down the thread that runs on the CPU, and hence overall getting worse performance than just running on the CPU context. I have tried setting thread priorities and that did not seem to have an impact.

Could someone help me understand this behavior?

Thanks in advance for any help you can provide.