I guess the 16 computations units describe the 16 cores on your GPU. On current NVidia GPUs you have sets of 8 cores grouped together in what they call streaming multiprocessor (SM).
That's probably why the OpenCL implementation says there are two compute units.
Interesting. But what does that mean from a parallelization point-of-view? Will my kernel only be executed in two cores simultaneously, or will it automatically be distributed to all 16 units?
Each workgroup will be scheduled to a compute unit. The workitems, however, are distributed across the cores on a compute unit.
Can I somehow verify that all 16 compute units (cores) are used? It worries me, that opencl only returns "2" when I ask for MAX_COMPUTE_UNITS, and my running times also match suspiciously well to a situation where only two cores are used. Would like to verify that this is not the case.
I can't think of a way of verifying how many cores are used, but it really is normal that MAX_COMPUTE_UNITS on NVidia GPUs returns the number of SMs rather than cores. On an NVidia Tesla S1070 which has 240 cores it returns 30, because that's the number of SMs on that chip.
What exactly do you mean with "my running times also match suspiciously well to a situation where only two cores are used."? There can be several reasons why your program doesn't show the expected speedup, e.g. your program could be bandwidth-limited.