Results 1 to 5 of 5

Thread: Early returns and barriers inside the kernel

  1. #1
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Early returns and barriers inside the kernel

    What is the recommended way to convert serial loops into NDRanges. The loop count is a starting value for global_size, but global_size must be a multiple of local_size. So that either requires doing a conditional statement inside the kernel for global_id < loop count, or pad (with appropriate values) whatever array argument before passing it to the kernel so that it becomes a multiple of local_size. The second method removes the condition inside the kernel but takes time to resize array(s). So the first method seems better, but doesn't that preclude one from them using a barrier(CLK_LOCAL_MEM_FENCE)?

    Basically, is it possible to do something like:

    if (get_global_id(0) >= n)
    {
    return;
    }
    else
    {
    ...;
    barrier(CLK_LOCAL_MEM_FENCE);
    ...;
    }

    I recall the OpenCL 1.1 spec saying if any work-item encounters a barrier, then all executing work-items must also encounter that barrier. But if a work-item already returned, then it's not an executing work-item, right?

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Early returns and barriers inside the kernel

    Your analysis of the situation is correct.

    Code :
    if (get_global_id(0) >= n)
    {
    return;
    }
    else
    {
    ...;
    barrier(CLK_LOCAL_MEM_FENCE);
    ...;
    }

    Is not valid.

    The second method removes the condition inside the kernel but takes time to resize array(s)
    Wouldn't it be possible to size the arrays properly from the beginning?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Senior Member
    Join Date
    Mar 2011
    Location
    Seoul
    Posts
    118

    Re: Early returns and barriers inside the kernel

    I agree doing it from the beginning would be much more efficient. However I'm trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can't assume anything about the input array sizes.

    Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What's your opinion in this case?

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Early returns and barriers inside the kernel

    I agree doing it from the beginning would be much more efficient. However I'm trying to port a pre-existing library (GNU Scientific Library) to OpenCL so unfortunately I can't assume anything about the input array sizes.
    I don't understand why that prevents you from padding all buffers at creation time to have a size that is a multiple of the maximum work-group size supported by the device. Surely the CL buffer objects are not visible by the app. If the issue is because you are currently using CL_USE_HOST_PTR what I would do is replace it with CL_ALLOC_HOST_PTR and then enqueue a write command.

    Another thing is that the particular function has a stride for the array elements, which if calculations are to be done on a GPU (either host side or over PCIe) should be pre-coalesced. In that case I can temporarily pack and pad as desired. What's your opinion in this case?
    Packing and padding on the CPU prior to sending through PCIe sounds like a good idea to me.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Senior Member
    Join Date
    Sep 2002
    Location
    Santa Clara
    Posts
    105

    Re: Early returns and barriers inside the kernel

    You could pad the buffers appropriately and this solution works. If you decide you do not want to pad then the code would need to be something like:

    if (get_global_id(0) < n)
    {
    ....
    }

    barrier(CLK_LOCAL_MEM_FENCE)

    if (get_global_id(0) < n)
    {
    // do work after the barrier
    ...
    }

    Basically the barrier needs to come out of the conditional and must be encountered by all work-items of the work-group.

Similar Threads

  1. Replies: 4
    Last Post: 08-06-2012, 01:18 AM
  2. local declaration inside the kernel
    By Gamingdrake in forum OpenCL
    Replies: 4
    Last Post: 06-27-2011, 12:10 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •