I am trying to get multiple devices running concurrently on Nvidia hardware. I've posted various questions in various places, and I haven't really got the answers I'm looking for.
I managed to get past the clEnqueueNDRange blocking behavior on the Nvidia OpenCL implementation (seriously, why would they do that?) by making multiple threads, but now I'm running into a problem with shared buffers.

Basically, I have two buffers; one for input and one for output - CL_MEM_READ_ONLY and CL_MEM_WRITE_ONLY, respectively. The output one is split up into multiple sub-buffers, where size depends on the size of the data and how many devices there are. The input one is always the same size.

Now, I thought that the input one would just be copied among the devices; apparently I was wrong. Interestingly, the implementation instead puts it on GPU 0, and when GPU 0 is done with it, moves it onto GPU 1, then back onto GPU 0, then onto GPU 2, then back again onto GPU 0, and so on until the work is done. Obviously this is not optimal - instead of the work being parallelized, it is once again serialized and now with multiple copy overheads. I cannot split up the input buffer because each thread needs the data in that buffer, and I would rather not make a buffer for each device and copy the data to each one - though, if that's the only way, I will do that.

I'm working on 3 Nvidia M2090s. Is this what the behavior should be? Is there a workaround? Am I just doing it wrong?