Results 1 to 5 of 5

Thread: How can kernels that operate on single items be efficient?

  1. #1

    How can kernels that operate on single items be efficient?

    I am new to GPUs, OpenCL, and parallel programming in general. One thing that confuses me is a kernel like this:

    Code :
    __kernel void add_buffers(__global const float *a, __global const float *b, __global float *result) {
        int gid = get_global_id(0);
        result[gid] = a[gid] + b[gid];
    }
    I see this pattern very often in code examples, where the kernel operates on a single item of a large data set. Semantically it makes sense, but I don't see how it can possibly be performant. If a kernel is basically a function call, then as the size of the input grows, the overhead of invoking the function on each individual element should dominate.

    For example, nobody would write traditional multithreaded code for a CPU like this:

    Code :
    void add_buffers(float *a, float *b, float *result) {
        for (int i = 0; i < SIZE; i++) {
            // Pseudocode...
            spawn a thread that adds a[i] to b[i] and stores it to result[i]
        }
    }
    The overhead of spawning all those threads, with their function calls and context switches and whatnot, would eliminate most or all of the speedup gained from doing more than one add operation in parallel. Instead, you'd write something like this:

    Code :
    void add_buffers(float *a, float *b, float *result) {
        int range = SIZE / NUMBER_OF_CPU_CORES;  // Assume it divides evenly
        for (int i = 0; i < NUMBER_OF_CPU_CORES; i++) {
            int start = i * range;
            int end = i * range + range - 1;
            // Pseudocode...
            spawn a thread that adds the numbers a[start to end] to b[start to end] and stores the result in result[start to end]
        }
    }
    Although each thread is now doing more work, this is better because it only spawns as many threads as there are processing elements (cores), thus keeping overhead to a minimum.

    In fact, even a sequential version of the algorithm running on a single core CPU should beat the OpenCL variation running on a GPU, even for large input sizes, due to the overhead of each kernel invocation handling just one item of the data set. But of course that would defeat the purpose of OpenCL, so my understanding of how a kernel works must be wrong. What am I missing? Thanks.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: How can kernels that operate on single items be efficien

    I like how you thought about this

    The overhead of spawning all those threads, with their function calls and context switches and whatnot, would eliminate most or all of the speedup gained from doing more than one add operation in parallel.
    That is correct. You may be assuming that an OpenCL implementation on a CPU spawns a new thread for each work-item. It's not the case. A typical implementation will actually be more efficient than the more elaborate code that you showed later. My understanding is that Apple for example implements OpenCL on top of Grand Central Dispatch for their CPUs.

    In fact, even a sequential version of the algorithm running on a single core CPU should beat the OpenCL variation running on a GPU, even for large input sizes, due to the overhead of each kernel invocation handling just one item of the data set.
    GPUs have dedicated hardware threads that are very lightweight to initialize. That's why kernels such as the example you wrote can run at great speed.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: How can kernels that operate on single items be efficien

    Quote Originally Posted by david.garcia
    You may be assuming that an OpenCL implementation on a CPU spawns a new thread for each work-item. It's not the case.
    Right, but let's say we ignore the cost of thread spawning. There's still some amount of overhead in invoking a kernel, just as there's some amount of overhead in invoking a function in a single-threaded program. If that overhead is K, and I'm processing N items, then the total overhead in the single-item-per-kernel model is KN. But if I simply rewrite the kernel to compute, say, 16 work items instead of just one, then the overhead shrinks to KN/16. That's quite a speedup for such a relatively simple change to the kernel, especially as N grows. This is why the prevalence of the single-item-per-kernel design surprises me.

    Quote Originally Posted by david.garcia
    GPUs have dedicated hardware threads that are very lightweight to initialize. That's why kernels such as the example you wrote can run at great speed.
    Hmm, perhaps the value of K on a GPU is far lower than I realize. But not all OpenCL code runs on the GPU. For some platforms, it may run on the CPU. Would the overhead matter then?

    In any case, your point highlights a gap in my knowledge. I've tried to locate a good resource for learning about GPU hardware architecture, but there doesn't seem to be much out there. Almost everything I find about GPGPU is from a software perspective and virtually ignores the hardware or treats it very abstractly. (The OpenCL spec is an example of the latter.) What I would love to find is a book of the same scope and caliber as Hennessy and Patterson's venerable "Computer Architecture", but for the GPU instead of the CPU. Does anything remotely like that exist?

    Thanks for any advice.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: How can kernels that operate on single items be efficien

    If that overhead is K, and I'm processing N items, then the total overhead in the single-item-per-kernel model is KN. But if I simply rewrite the kernel to compute, say, 16 work items instead of just one, then the overhead shrinks to KN/16. That's quite a speedup for such a relatively simple change to the kernel, especially as N grows.
    Good thinking! Maybe this is something that can be automated

    But not all OpenCL code runs on the GPU. For some platforms, it may run on the CPU. Would the overhead matter then?
    Not really. The people who write OpenCL drivers and compilers have already thought about these issues and solved them pretty well. I can't go into details as this is proprietary information. However, you are already thinking in the right direction.

    I've tried to locate a good resource for learning about GPU hardware architecture, but there doesn't seem to be much out there. [...] What I would love to find is a book of the same scope and caliber as Hennessy and Patterson's venerable "Computer Architecture", but for the GPU instead of the CPU. Does anything remotely like that exist?
    I don't think such a book exists. The only way I know to learn more about what goes under the hood of GPUs is to start working for a company that designs them. The the most common way to do that is to be a recent graduate, but it's possible to join afterwards as well. We are hiring (hint, hint).
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Jul 2011
    Posts
    8

    Re: How can kernels that operate on single items be efficien

    Quote Originally Posted by david.garcia
    Good thinking! Maybe this is something that can be automated
    Is the smiley at the end meaning, that OpenCL already scales automatically like this, or suggestion to contribute providing a way to automate such thing?

    I was bit AFK from OpenCL, but we were having alike thoughts for combining kernels or other means to manipulate the "hand written" kernel distribution to adjust the end-result kernels based on the core and thread count.

    So in case you are looking a way to control such automation I'm interested to show what we can do


    Kalle

Similar Threads

  1. Efficient use of memory in GPU
    By Gamingdrake in forum OpenCL
    Replies: 2
    Last Post: 07-06-2011, 03:33 PM
  2. Can i use the vincent in linux operate system
    By qqrilxk in forum OpenGL ES general technical discussions
    Replies: 1
    Last Post: 05-11-2007, 12:47 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •