I'm testing various buffer creation streategies on an APU (acer iconia tab). The algorithm is Saxpy (vector addition), performed many times with different vector sizes. In particular, I'd like to find out if on an APU there is the chance to perform vector addition on the GPU faster than on the CPU, something that is practically never convenient on a traditional architecture (CPU and GPU not on the same chip) due to the PCI bus latency.
Since the ram is shared between the CPU and the GPU, I expected that creating a buffer with USE_HOST_POINTER and using Mapping/Unmapping would lead to extremely better performances. However, I tested both a project where data transfers between buffers and host memory are performed "manually" (i.e. enqueueRead/WriteBuffer) and a project based on Mapping/Unmapping. In the first case, the GPU execution time begins to be lower than the CPU execution time for vectors that are bigger than about 1 million elements. In the second case, the GPU never "wins" on the CPU, that is, its execution time is always higher than the CPU one. Moreover, the GPU execution time with mapping is lower than the GPU execution time with copy only for "small" vector sizes, but it turns to be higher for quite big vectors.
Any idea about this? Is my assumpition wrong?
The C++ sources of the projects:
http://www.gabrielecocco.it/apu/SaxpyAl ... opyPtr.cpp
The followings are the execution timings (GPU with copy and GPU with mapping), in the format: VECT_SIZE EXEC_TIME