I don't understand very well. There are two things you may want to measure in the CPU. One of them is "how much time does it take OpenCL to enqueue all these commands"; the other one is "how much time does it take since I enqueue the first command until the GPU has completely finished doing all the work and has written the image back".Is there any method to compare the execution time between GPU and CPU?
The first one, "how much it takes to enqueue these commands" doesn't matter. The second one measures the total amount of time that it takes to send all the data to the GPU, then run the kernel and then read it back to the CPU. The way to measure this one is simple: call GetTickCount() once before you submit the data to the GPU and call it again after clFinish(cqCommandQueue) returns.
clBuildProgram is the function that compiles your kernel into assembly for the GPU. It is true that it's usually slow (sometimes it takes seconds). The good thing is that you only need to call this function only one time. After the first time you can use clGetProgramInfo() to read back the GPU program after it is compiled. Look in the spec for CL_PROGRAM_BINARY_SIZES and CL_PROGRAM_BINARIES. Your application can then store the program binary into a file and the next time that you run the application you can use clCreateProgramWithBinary() instead of clCreateProgramWithSource(). When you create a program from a binary then clBuildProgram() is much faster.The command clBuildProgram() is taking so much time. Why it happens?