What's better, i to work all iterations with private memory and in the end of my algorithm make the a copy of this private memory to global memory or it's better in each iteration of algorithm insert intermediate results in a global memory?
When you measure time, takes in to consideration whole time, compilation and execution of kernel? Or only the kernel's run?
Other doubt is that my work load at first is simple and when i run whether in CPU or GPU is fast, therefore i put the kernel's call and the clFinish(queue) method, inside a structure for 1-10000 to colect time. Is it correct when run my algorithm in GPU?