Hello everybody,

I have searched the web for similar questions but couldn't find the answer I need...

I am making a dynamic scheduler (following AMD's OpenCL Guide) for handling multiple GPUs. However, I am experiencing some troubles with the way OpenCL handles memory...

Basically, I have 5 buffers, i'll just call them A, B, C, D, E ...
I am executing two kernels on two devices:

Device 1 : A = f(B,C) [ does not modify B or C ]
Device 2 : D = f(B,E) [ does not modify B or E ]

I am making one host thread per queue, and there is only one queue on each device....
The problem is that, if Device1 executes first, Device 2 does not execute the task until B is available (i.e. until Device 1 is done...). So, in the end, everything ends up being serialized.
I have tried to use READ_ONLY and WRITE_ONLY buffers to indicate the OpenCL implementation that B is not modified, but experienced the same problem...
Is there any AMD-and-NVidia-compatible way of concurrently enqueueing these two tasks without having to duplicate B?

Thank you very much !

Edit : my tests were done on an NVidia platform.