I recently got an ATI 5870, and recoded my neural network app.to run on it. The time went from 200 sec. to 65 sec. Just for fun, I changed the device to CPU, and the time went to 55 sec. I am very interested in finding out what OpenCL is doing to get this performance on my CPU. I am particularly interested in the threading model that it is using. Is this TBB? Pthreads? Where can I find out?
Also, I have code that takes an integer array, and uses the int4 to grab 4 values at a time. Is this using SSE2? Again, where can I find out?