I know this is ridiculous and I believe I did something wrong, because I can't google anything related on internet. But still, I just can't re-use a shared memory arry in ati 5870, while the same program run well on nVidia gpu.

say, I have a kernel look like this: (my code is not as simple as this. but the detail is the same)

__local float tmp1[16];
__local float tmp2[16];
uint localIdX = get_local_id(0);
float a,b;
// I first define tmp1 and use it for a
tmp1[localIdX]=1;
a=tmp1[localIdX];
barrier(CLK_LOCAL_MEM_FENCE);

// then if I re-use tmp1for later calculation, the code result will go wrong on ati 5870, while nvidia's result is good
// but if I use tmp2 instead, then ati is also good.
// example as below

if I use tmp1,
tmp1[localIdX]=1; // the code will go wrong on ati, while nvidia is good
b=tmp1[localIdX];

if a new tmp2 is used:
tmp2[localIdX]=2; // then ok for ati too
b=tmp2[localIdX];


I make sure there is synchronization before re-use of shared memory. This re-use problem only happen on ati 5870, while nVidia GTX260 is good with re-use of shared memory with the same code..

I think maybe there is problem when I build the program, or something related to my card. but I really have no clue now.

Any thought will be appreciated! Thanks.