I'm working on some code and I am getting DRASTIC performance changes between the two lines of code:
c0o.xyzw += (float4)(fa,fb,fc,fd);
c1o.xyzw += (float4)(fa2,fb2,fc2,fd2);
c2o.xyzw += (float4)(fa3,fb3,fc3,fd3);
c3o.xyzw += (float4)(fa4,fb4,fc4,fd4);

versus

c0o.xyzw = (float4)(fa,fb,fc,fd);
c1o.xyzw = (float4)(fa2,fb2,fc2,fd2);
c2o.xyzw = (float4)(fa3,fb3,fc3,fd3);
c3o.xyzw = (float4)(fa4,fb4,fc4,fd4);

The first one runs lightning fast (0.01 sec). The second one slows my kernel down to 18 seconds.

Note that c0o ..c30o are uninitialized float4's ... is it just discarding the memory write because it is writing to uninitialized memory? Does opencl initialize the stack variables at all?