When i collect my time in my application, the parallel time with 2 Threads run in CPU is 0.89 seconds, but the sequential time is 7.813 seconds.
Both codes are exactly the same, with the same data structure.

I already debug for have sure that the execution is really going in the correct way, and all run well.

For collect the time i put both, the call of the Kernel and the call of function in sequential code, in a for with 10000 interactions.

And for to measure only a execution, without compilation time, i put the function bellow, before and after for.
clock_t tempo_execucao_real_inicial = clock();

Do have OpenCL some treatment that optimize memory access or a generating code optimized that generate results with this discrepancy?

Very Thanks,