I just got 2 new GPUs yesterday. They are both NVIDIA C2070. I wrote a simple program to compare the runtime of using 1 GPU and 2 GPUs. Surprisingly, 2 GPUs don't give me any speedup. Basically, I have 2 kernels that have their own independent inputs and outputs. I ran different variations of numbers of contexts and command queues, and the command queues are always in-order execution. This is the result:

1 command queue on 1 device
total time: 558,866 microseconds

2 command queues on 1 context on 1 devices
(run kernel A on command queue A; run kernel B on command queue B)
total time: 717,828 microseconds

2 command queues on 1 context on 2 devices
total time: 826,846 microseconds

2 command queues on 2 contexts on 2 devices
(run kernel A on command queue A which is on context A that include only device A; run kernel B on command queue B which is on context B that include only device B)
total time: 519,748 microseconds

Running 1 kernel itself takes 198,018 microseconds (this is the time when the kernel starts running on gpu until finish. there is nothing to do with cpu side.).

Can anyone explain what's going on? I expect to get some speedup when using 2GPUs but apparently not.