The following piece of code is part of my kernel code for my calculation, because other part code are quite independent parallel that can be executed on each work item (no data synchronization needed), but this part looks like a serial one (the i th output needs the output the i-1 th updated value), so I think that I can make one work item do it, and other work item just do nothing when it comes to this step. So i wrote this , supposing I use work item 0 to finish the computation
//tid is the thread local id, tB and m are all pointer to local memory
//basically I need to derive array m from array tB, one element of m is derived on each step of the first loop. The value of m Is correct when I execute the kernel on CPU, but wrong on GPU. Is it because the synchronizing goes wrong on gpu? Or do you have suggestions to make it work right on gpu? Thank you so much!
for (i=0; i<34; i++)
for(j = i+1; j < 34; j++)
//then i read value m back to host code and check the values