i have tried to optimize my kernel using "float8" and "float16" instead of "float4".
System is XP32, OpenCL 1.2 on an AMD Athlon X2 250 and a Radeon 6750. (testing on a AMD A6 3450M APU shows the same behavior)
After a workaround because the <vectorname>.s[<index>] is unsupported ( why? ), i stuck at the following problem:
I need to add all components of a vector. So i did it with the following line (part of an n-Body Simulation)
Kernel runs as expected, but very slow....(comparing to "float" )Code :barrier(CLK_GLOBAL_MEM_FENCE); waiting every item has finished vx[tid] += dt * (Fx.s0+Fx.s1+Fx.s2+Fx.s3+Fx.s4+Fx.s5+Fx.s6+Fx.s7+Fx.s8+Fx.s9+Fx.sa+Fx.sb+Fx.sc+Fx.sd+Fx.se+Fx.sf); adding all 16 components
After some debugging i changed the line of code to
this doubles the speed of the execution of the kernel (note that above these lines there is a loop calculating millions of sqrt´s with float16 without any (timing) problems) ! Why does a "cheap" addition slows the kernel in that manner?Code :vx[tid] += dt * (Fx.s0+Fx.s1+Fx.s2+Fx.s3+Fx.s4+Fx.s5+Fx.s6+Fx.s7); //+Fx.s8+Fx.s9+Fx.sa+Fx.sb+Fx.sc+Fx.sd+Fx.se+Fx.sf);
Is there any function to add the components of a vector fast(er), or what can i do to avoid this strange behavior?