Results 1 to 2 of 2

Thread: OpenCL for histogram (or anything with global memory sharing

  1. #1
    Junior Member
    Join Date
    May 2011
    Posts
    1

    OpenCL for histogram (or anything with global memory sharing

    I'm trying to write a histogram kernel in OpenCL to compute the R, G, and B histograms of an input image. My kernel looks like this:

    Code :
    const sampler_t mSampler = CLK_NORMALIZED_COORDS_FALSE |
                               CLK_ADDRESS_CLAMP|
                               CLK_FILTER_NEAREST;
     
     
    __kernel void computeHistogram(read_only image2d_t input, __global int* rOutput,
                                   __global int* gOutput, __global int* bOutput)
    {
     
        int2 coords = {get_global_id(0), get_global_id(1)};
     
        float4 sample = read_imagef(input, mSampler, coords);
     
        uchar rbin = floor(sample.x * 255.0f);
        uchar gbin = floor(sample.y * 255.0f);
        uchar bbin = floor(sample.z * 255.0f);
     
        rOutput[rbin]++;
        gOutput[gbin]++;
        bOutput[bbin]++;
     
     
    }

    When I run it on an 2100 x 894 image (1,877,400 pixels) i tend to only see in or around 1,870,000 total values being recorded when I sum up the histogram values for each channel. It's also a different number each time. I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).

    The 1,870,000 output is for a {1,1} workgroup size (which is what seems to get set by default if I don't specify otherwise). If I force a larger workgroup size like {10,6}, I get a drastically smaller sum in my histogram (proportional to the change in workgroup size). This seemed strange to me, but I'm guessing what happens is that all of the work items in the group increment the output array value at the same time, and so it just counts as a single increment?

    Anyways, I've read in the spec that OpenCL has no global memory syncronization, only syncronization within local workgroups using their __local memory. The histogram example by nVidia breaks up the histogram workload into a bunch of subproblems of a specific size, computes their partial histograms, then merges the results into a single histogram after. This doesn't seem like it'll work all that well for images of arbitrary size. I suppose I could pad the image data out with dummy values...

    Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).

    Thanks!

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: OpenCL for histogram (or anything with global memory sha

    I did expect this since once in a while two kernels probably grab the same value from the output array and increment it, effectively cancelling out one increment operation (I'm assuming?).
    Correct. You want the operation to be atomic: "load value; increment value; store value" should all occur as if no other work-items were executing concurrently.

    In order to support these sort of operations, OpenCL offers what are called "atomic functions" (search for the term in the spec).

    Unfortunately, atomics are rather expensive, which is why you see that histograms are usually computed with a reduction: compute partial histograms first, then accumulate. This accumulation needs to happen at least at two levels: accumulate all work-items in a work-group, then accumulate the result from all work-groups.

    This doesn't seem like it'll work all that well for images of arbitrary size.
    Why not? I don't know what nVidia is doing specifically, but in general with a reduction all you need to do if the input is larger is add more rounds of reduction. That is, the number of rounds grows as the logarithm of the size of the input, which is not bad at all.

    Being new to OpenCL, I guess I'm wondering if there's a more straightforward way to do this (since it seems like it should be a relatively straightforward GPGPU problem).
    Parallel programming is hardly ever straightforward. It doesn't matter what API or language you use
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. Replies: 6
    Last Post: 02-28-2013, 04:59 PM
  2. OpenCL global memory concept
    By JohnJi in forum OpenCL
    Replies: 1
    Last Post: 06-28-2012, 02:26 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •