Results 1 to 8 of 8

Thread: Built-in Functions: Work-Item Functions

  1. #1
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Built-in Functions: Work-Item Functions

    Please provide a built-in function that provides the preferred work group multiple from within a kernel, similar to get_global_size(0), get_local_size(0), etc.

    The only work around I see right now requires querying CL_DEVICE_PREFERRED_WORK_GROUP_MULTIPLE and pass the value to a kernel using a preprocessor option before building the program from source.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Built-in Functions: Work-Item Functions

    Sean, what can you do with that information inside the kernel? Once an NDRange has been enqueued and a kernel is executing, it has a fixed work-group size and querying the preferred size cannot make a difference.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Re: Built-in Functions: Work-Item Functions

    The example that came to mind was a parallel reduction. If the work group size and preferred work group multiple are passed as preprocessor options then the loops can be fully unrolled and improve performance. The preferred work group multiple still seems important inside the kernel because you can potentially take advantage of the SIMD nature of running in lock step within a wavefront, so those unnecessary barriers can be removed and improve performance too. Here's a code snippet of what I had in mind.

    ...

    #if !defined(WORK_GROUP_SIZE) || !defined(PREFERRED_WORK_GROUP_MULTIPLE)
    for (size_t i = local_size; i >= local_multiple; i =>> 1)
    #else
    #pragma unroll
    for (size_t i = WORK_GROUP_SIZE; i >= PREFERRED_WORK_GROUP_MULTIPLE; i =>> 1)
    #endif
    { ; // reduction with a barrier to sync multiple wavefronts }

    #if !defined(WORK_GROUP_SIZE) || !defined(PREFERRED_WORK_GROUP_MULTIPLE)
    for (size_t i = local_multiple >> 1; i >= 1; i =>> 1)
    #else
    #pragma unroll
    for (size_t i = PREFERRED_WORK_GROUP_MULTIPLE >> 1; i >= 1; i =>> 1)
    #endif
    { ; // reduction without a barrier, only one wavefront }

    ...

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Built-in Functions: Work-Item Functions

    Okay, I understand your use case and it does have some appeal. The issue I see is that your code at least is making one big assumption: that you can skip the barrier when you reach the work-group-size-multiple boundary. That is only true in certain hardware and if the returned value of work-group-size-multiple boundary is exactly one warp.

    We would need to take a step back and see whether something like this can be exposed in a more hardware-agnostic way.

    I'll forward this to the committee to see what they think.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Re: Built-in Functions: Work-Item Functions

    Hi David,

    Do you happen to have any feedback from the committee on this suggested function?

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Built-in Functions: Work-Item Functions

    I could get in trouble if I share what the committee has discussed. Let's say that it didn't make it to CL 1.2.

    I'll ask the spec editor if he wants to make an official comment.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Re: Built-in Functions: Work-Item Functions

    Thanks David, and I definitely don't want to get you into any trouble.

  8. #8
    Senior Member
    Join Date
    Sep 2002
    Location
    Santa Clara
    Posts
    105

    Re: Built-in Functions: Work-Item Functions

    CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE is supposed to be a performance hint to make sure that the local workgroup size specified to clEnqueueNDRangeKernel is a multiple of this value.

    To do what you are suggesting, the definition of CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE will need to be extended to state that this now refers to the SIMD or SIMT execution size. If we did this, then it is correct that the barrier become a NOP. Another option would be to add a new query to indicate the device's SIMD execution size will have to be added

    I don't believe this specific modification to CL_KERNEL_WORK_GROUP_SIZE_MULTIPLE was discussed by the working group. I'll bring this up to the group for discussion.

Similar Threads

  1. Required atomic built-in functions
    By sean.settle in forum OpenCL
    Replies: 2
    Last Post: 02-07-2012, 04:16 PM
  2. Replies: 4
    Last Post: 06-11-2010, 08:54 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •