I suspect that I experience problems with memory contention in the below setup. Do you agree? If so: can I do anything about it?
I send two large arrays to the GPU (in form of read-only buffers) and each kernel computes some output value by performing a large bunch of lookups in a sub-area of each input array. I have run the program on an 8 core CPU, and on a 240 core GPU, but the CPU is still marginally faster than the GPU. However, if I perform an experiment in which I still provide the two large arrays as input, but replace the array lookup-code with some very local computation (without lookups in the arrays), the GPU is much faster than the CPU as it should be.
So, doesn't this looks like a problem with memory contention as the only difference (as I see it) is the numerous array lookups? In that case: can I deal with this contention in some way?
The arrays are transferred like this:
bs1_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(bs1).astype(numpy.int32))