When i run my application in CPU, no problems occurs. I can put 256 threads in CPU, that the execution happens and the results in the end of execution all is correct.

But, when i put more than 48 threads in GPU the application enter in a infinite loop.
More than 40 threads gets to run, but the results is not the same.

Is it possible be a problem in rounding of the numbers? The way that CPU makes rounding is different than GPU?

This is a first thing that went through my mind.