Lets say I have a simple application with arrays A, B, C, D and RES of the same type where all elements of A is 1, B is 2, C is 3 and D and RES are empty.

I will run a kernel which does Z = X + Y, first I run it on the first device for calculating D=A + B, then on 2nd device I Will run RES = D + C.

I expect all elements in RES to be 6

OK so I start by creating a context with 2 devices and memory objects. for A, B, C I use CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, for D and RES, CL_MEM_READ_WRITE

I run the first kernel and wait until it finishes before I start second kernel through the event info of the kernels. Then copy the RES to using enqueueReadBuffer.

Anyway, all I see is 6 in all elements which I wanted so no problem there...

So the questions are:

1- Did A and B got copied to second device even though they werent used?

2- The specifications say:

In the command-queue that wants to synchronize to the latest state of a memory object,
commands queued by the application must use the appropriate event objects that represent
commands that modify the state of the shared memory object as event objects to wait on.
I waited for 1st kernel to finish (which changed the state of the memory object that 2nd kernel will use) before starting 2nd kernel. Am I guaranteed to have the updated D object in the second device's memory when running 2nd kernel? (meaning OpenCL automatically syncs the memory objects between devices?)

3- If after calculating D=A+B (and waiting for the events to finish). If the host thread was changing elements of C to 4.
a) Do I have to use a enqueueWriteBuffer before queuing RES=C+D ?
b) The createCommandQueue takes a device_id as an argument, if I write the buffer using a queue attached to a device which is not doing the calculation, is it guaranteed that other device will be able to use it?

4- Related to 3b, when I read the RES using enqueueReadBuffer, does it matter which command queue is used?

Thanks!