Cl_device_preferred_vector_width_float = 1
clGetDeviceInfo(dev /* device */,
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT /* param_name */,
sizeof(cl_uint) /* param_value_size */,
&float_size /* param_value */,
¶m_size /* param_value_size_ret */);
On Intel's CoreI7-Ivy Bridge and got float_size=1.
Does it make sense ?
I know for sure that the CPU in the chip can load 4 X 32bits words (e.g float) at one clock.
Is it possible that the GPU can not do it ?
The Intel OpenCL C compiler tries to vectorise your kernel, that's why the preferred vector size is always 1. If I remember one of their webinars correctly, the compiler will try to group work items along dimension 0 of the N-D range used to launch the kernel. The build log generated when building the kernel will tell you if it was actually vectorised or not. You can use the Intel Kernel Builder to build your kernels offline and examine the build log easily.
As a side note, AMD's CPU runtime on a Sandy Bridge Core i7 reports a preferred vector width of 4 of float, so it probably doesn't automatically vectorise kernels.