Firstly, I have already posted this in the JavaCL forum a while ago, but haven't received a reply yet.

My problem is as follows. I am working on a multi-GPU application that is a very simple benchmark. One of the tests is to measure the combined host-to-device memory bandwidth using different transfer methods. If I use the normal CLBuffer write method with blocking writes, corresponding to clEnqueueWriteBuffer, then I am able to achieve a maximum of 5GB/s. If I use the map and unmap methods of CLBuffer, I cannot get past 1.9GB/s. Where am I going wrong?

My test method: I launch a separate thread for each GPU. Each thread repeated maps a buffer, copies data to it and then unmaps the buffer. This is repeated 1000 times. System.currentTimeMillis() is used to record the start and end times in milliseconds. All buffers are allocated outside of this loop.

My computer: Nvidia GeForce GTX 560 Ti and GTX 260, Intel Core i7 2600k, Asus P8P67 Pro motherboard.

Does anyone have some suggestions? Thanks very much for the help.

BTW, Nvidia's oclBandwidthTest reports 14GB/s to both cards, and 7GB/s to either card individually.