Results 1 to 7 of 7

Thread: Huge kernel overhead on Mac

  1. #1

    Huge kernel overhead on Mac

    Hello,
    Sometimes I get huge kernel overhead.
    I measure the time of the time using two ways using:
    Code :
    double start_time_total = get_time();
     
            cl::Event event;
            queue.enqueueNDRangeKernel(kernel, 
                                                 cl::NullRange, // offset
                                                 global
                                                 cl::NullRange, // local
                                                 NULL, // pre-requisite events
                                                 &event);
    double gpu_profiling_time = 
            event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - 
            event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>();
            gpu_profiling_time *= 1.0e-9; // Convert to seconds
    double end_time_total = get_time();
    gpu_total_time = end_time_total - start_time_total;
    Where, get_time() uses gettimeofday() to get the current time in seconds as double.

    When the CPU is used as the OpenCL device the difference between gpu_total_time and gpu_profiling_time makes sense.
    However, when I use my GPU (AMD 6750M, on MacBook Pro) the overhead is sometimes huge, 0.000619s compare to 0.032589s (~X50 slower when measured from the host side).
    The problem is consistent with specific kernels.

    Here is the prototype of the kernel if it helps:
    Code :
    kernel void resize(
                       __read_only image2d_t src, 
                       __write_only image2d_t dst, 
                       int width, 
                       int height,
                       float scale_x, 
                       float scale_y)

    Note that the problem does not exist on Windows with NVidia hardware (at least for the specific device that I tried).

    Any idea for solution?

    Thanks in advance!
    Yoav

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Huge kernel overhead on Mac

    It looks like the way you are measuring time on the host side is incorrect. Starting the clock before calling clEnqueueNDRangeKernel() and stopping it once it returns doesn't measure the same thing as gpu_profiling_time.


    clEnqueueNDRangeKernel() is analogous to ordering a pizza. It takes very little time. However, what you want to know is how long it takes to bake the pizza in the oven. That's what gpu_profiling_time is giving you.

    This has been discussed a few times in the past. I suggest using the search feature to find more information.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: Huge kernel overhead on Mac

    Sorry, I forgot to mention that I do use queue.finish() before calling get_time() at second time on the host side. I'm aware that clEnqueueNDRangeKernel() is a non-blocking operation.
    However the problem exists despite the using of queue.finish(). Something is really strange there.

    Here is the fixed code snippest:
    Code :
    double start_time_total = get_time();
     
            cl::Event event;
            queue.enqueueNDRangeKernel(kernel, 
                                                 cl::NullRange, // offset
                                                 global
                                                 cl::NullRange, // local
                                                 NULL, // pre-requisite events
                                                 &event);
            queue.finish();
    double gpu_profiling_time = 
            event.getProfilingInfo<CL_PROFILING_COMMAND_END>() - 
            event.getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>();
            gpu_profiling_time *= 1.0e-9; // Convert to seconds
    double end_time_total = get_time();
    gpu_total_time = end_time_total - start_time_total;

    I searched measuring time but couldn't find anything about this problem.

    Thanks in advance,
    Yoav

  4. #4

    Re: Huge kernel overhead on Mac

    I've just noticed that the huge overhead happens only at the first time I run the kernel.

    I do build it in advance and cache the program, but are there addition operations that Apple's implementation does when running a kernel for the first time?

  5. #5
    Junior Member
    Join Date
    Oct 2011
    Location
    Seattle, WA
    Posts
    27

    Re: Huge kernel overhead on Mac

    I see the same thing and believe this first-time delay is normal and predominately related to lazy buffer allocation on the compute device.

    You didn't post you kernel code / arguments, but I can speculate.
    If you have an output buffer declared on the device, the implementation has no reason to actually allocate it until you run the kernel the first time.

    Even the kernel code might not be moved over to the device until the first use then cached afterward.

  6. #6

    Re: Huge kernel overhead on Mac

    That makes sense, thanks.
    I think that the specification should allow prevention of such lazy operations by adding flags to the kernel and the buffer constructors.
    It might be the case that the application can do the allocation asyncroniuously but not the running of the kernel itself. In such cases, the lazy approach is a waste of time.

  7. #7
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Huge kernel overhead on Mac

    Quote Originally Posted by yoavhacohen
    That makes sense, thanks.
    I think that the specification should allow prevention of such lazy operations by adding flags to the kernel and the buffer constructors.
    It might be the case that the application can do the allocation asyncroniuously but not the running of the kernel itself. In such cases, the lazy approach is a waste of time.
    How is the lazy approach a waste of time? It only happens once, it still has to happen once no matter what happens. If you're doing any micro-benchmarks it is a given that you cannot get reliable results if you do not let the system `warm up' first - i.e. do a couple of dummy runs.

    BTW the 'lazy allocation' is at the operating system level, and beyond such a specification's scope. Although a unix process can be given a big virtual address space, those pages do not exist until they are accessed. This is not something a driver could change.

Similar Threads

  1. How to reduce kernel overhead
    By fakruddeen in forum Interoperability issues
    Replies: 0
    Last Post: 10-26-2012, 01:18 AM
  2. huge delay between GPU and CPU
    By xstopka in forum OpenCL
    Replies: 1
    Last Post: 04-22-2011, 09:18 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •