Ok, so I have my code running, but I have to say I'm disappointed with the performance.

It's a particle system and I get the following approximate performance statistics:

Scalar version on the CPU: 1 Million particles per second.
GPGPU version using GLSL: 55 Million particles per second.
OpenCL version on CPU: 5 Million particles per second.
OpenCL version on GPU: 4 Million particles per second.

OpenCL on the CPU seems about right. I'm doing calculations on 3 component float vectors (in float4s) and I'm on a Core 2 Duo, so two cores. A six times speed-up would be my theoretical maximum, and that's not including the fact that there's some unavoidable scalar calculation. I'm happy with that result.

The problem is obviously the GPU based OpenCL. It's about 12x slower than my GPGPU implementation, and it's even slower than the CPU OpenCL. Obviously something is going very wrong. I suspect it's down to memory access, but I don't know for sure.

How can I find out what is making my code slow?
What profiling tools are there?

I'm currently on Snow Leopard, but could probably get my code to Linux if there were better tools there.