I can't quite understand the relationship if any between vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE (let's shorten that to PWGM). PWGM is related to the number of work-items that can be executed on processing elements in a compute unit. If your data type is scalar this makes sense (say a GPU with 16 SIMD units with VLIW4 can execute 16 work-items per clock piped over four clocks = 64 work-items). What happens in the case of vector types such as float4 or double2? My intuition says that the PWGM should decrease by a factor of the ILP explicitly invoked using the vector types (so if float gave 64 work-items, float4 would give 16 work-items). However, every time I query PWGM it gives the same result (64 work-items in this example).
This leads me to also wonder that if PWGM is independent of the data type then why must I query it from clGetKernelWorkGroupInfo, which is only available after building the program and kernel? Shouldn't this query be available from clGetDeviceInfo?