Results 1 to 2 of 2

Thread: Vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE

  1. #1
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE

    I can't quite understand the relationship if any between vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE (let's shorten that to PWGM). PWGM is related to the number of work-items that can be executed on processing elements in a compute unit. If your data type is scalar this makes sense (say a GPU with 16 SIMD units with VLIW4 can execute 16 work-items per clock piped over four clocks = 64 work-items). What happens in the case of vector types such as float4 or double2? My intuition says that the PWGM should decrease by a factor of the ILP explicitly invoked using the vector types (so if float gave 64 work-items, float4 would give 16 work-items). However, every time I query PWGM it gives the same result (64 work-items in this example).

    This leads me to also wonder that if PWGM is independent of the data type then why must I query it from clGetKernelWorkGroupInfo, which is only available after building the program and kernel? Shouldn't this query be available from clGetDeviceInfo?

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Vector types and CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE

    CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE is most closely related to the warp size (the number of work-items in a warp/wavefront), although it's not the same thing.

    This leads me to also wonder that if PWGM is independent of the data type then why must I query it from clGetKernelWorkGroupInfo, which is only available after building the program and kernel? Shouldn't this query be available from clGetDeviceInfo?
    Because in principle it will depend on the kernel. Let's say that your kernel is using float16 everywhere and has little or no flow-control. If your hardware has a native SIMD width of 16 then a smart compiler may decide that you have already done all the work vectorizing the code and the hardware can just run your kernel as-is. In that case the CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE will be 1.

    The kernel described above is not common. Most kernels you see will be written in scalar form and if your hardware natively runs 16-float wide SIMD instructions it makes more sense to map 16 work-items to one SIMD unit. In that case CL_KERNEL_PREFERRED_WORK_GROUP_MULTIPLE will be 16.

    I hope this sheds some light on why things work the way they do.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. Non-power of 2 vector types
    By PaulS in forum Suggestions for next release
    Replies: 2
    Last Post: 02-01-2010, 12:15 PM
  2. Replies: 1
    Last Post: 10-01-2009, 05:04 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •