Results 1 to 4 of 4

Thread: Performance on CPU

  1. #1
    Junior Member
    Join Date
    Apr 2010
    Posts
    6

    Performance on CPU

    Before anything, this is my first OpenCL program, so be cute.

    I use AMD OpenCL implementation.
    Command Queue on my CPU (host == device)

    OpenCL time: 978ms
    OpenMP time: 266ms
    One thread time: 279ms

    1. Why this bad performance in the same CPU?
    I understand that this is a memory and not processor work, but on the same device must have at least the same results.
    (I believe) there is no buffer copy (I use CL_MEM_USE_HOST_PTR on buffer creation).

    2. GPU has restrictions in memory allocation. I want to make ultra huge sparse matrix-vector multiplication for Finite Element Analysis. If I write and read ALL THE TIME small pieces of these big matrix-vector to GPU, I will have performance cost, No? (Matrix & vector cannot fit in small GPU memory - only 100MB OpenCL buffer allocations for my ATI Radeon).

    OpenMP code
    Code :
    #pragma omp parallel for
    for(size_t z = 0; z < SIZE; z++)
    	c[z] = a[z] + b[z];
    OpenCL code:
    Code :
    status = clEnqueueWriteBuffer(*cqueue, *ba, CL_FALSE, 0, SIZE * sizeof(float), a, 0, 0, 0);
    status |= clEnqueueWriteBuffer(*cqueue, *bb, CL_FALSE, 0, SIZE * sizeof(float), b, 0, 0, 0);
    kernel.setArg(0, *ba);
    kernel.setArg(1, *bb);
    kernel.setArg(2, *bc);
    size_t dim[1] { SIZE };
    status |= clEnqueueNDRangeKernel(*cqueue, *kernel, 1, 0, dim, 0, 0, 0, 0);
    status |= clEnqueueReadBuffer(*cqueue, *bc, CL_TRUE, 0, SIZE * sizeof(float), c, 0, 0, 0);
    Kernel code
    Code :
    __kernel void vector_add(__global float *A, __global float *B, __global float *C)
    {
    	size_t idx = get_global_id(0);
    	C[idx] = A[idx] + B[idx];
    }

  2. #2
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Performance on CPU

    I'd probably try:
    a) using float4 as the kernel argument type, although openmp can probably vectorise that loop, i dont think amd's compiler will by itself.
    b) use enqueuemap/unmap rather than writebuffer: write WILL copy the data because that is what you told it to do. USE_HOST_PTR is only useful if you have the data already setup and for a cpu, aligned appropriately.
    c) (or just don't time the copies).

    Still, remember it's still only executing on a cpu, so it isn't magic. openmp already adds spreading across cores, and the compiler is probably already doing vectorisation of such a simple loop.

    TBH if you're only doing opencl on a cpu and will never move to other hardware, it's hardly worth the hassles. It's more useful as a debugging tool at this point, although looking at where cpu and gpu designs are heading, they are converging rapidly.

    If you only have a tiny gpu card, you can't expect to be able to solve huge problems. A GPU kernel can only work on data that is in physical memory on the card at the time it executes. A whole programme can allocate more memory than that, but each kernel must be able to access the data when it runs. Go buy another card or be content with investigating smaller problems: the programming techniques are the same at any rate.

  3. #3

    Re: Performance on CPU

    My observations:

    1) Definitely try float4s (or float8s, of float16s) ... I achieved >50% speedup on AMD 5870 by going from floats to float4s.

    2) I tried mapping buffers, and for me, for the most part, it only increased throughput where it was already deficient due to variations in other parameters in my tuning process. Still, where my tuning curve is best, it gives 1 to 3 percent. SO, I use it, but it's not a miracle for me. CL/GL interop may be the miracle I still seek; yet to be investigated....

    3) I don't know "openmp", but, before going to OpenCL, I went multi-CPU-core with XCode's Grand Central Dispatch. It *did* give me increased throughput over single-threaded, but I get (I guess) at least double the throughput (CPU only) that I do under GCD, with OpenCL. *So*, I'd say that OpenCL may be very worth your while even if you're only using the CPU.

    4) Good luck!

    == Dave

  4. #4
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Performance on CPU

    Quote Originally Posted by Photovore
    My observations:
    2) I tried mapping buffers, and for me, for the most part, it only increased throughput where it was already deficient due to variations in other parameters in my tuning process. Still, where my tuning curve is best, it gives 1 to 3 percent. SO, I use it, but it's not a miracle for me. CL/GL interop may be the miracle I still seek; yet to be investigated....
    Remember he's talking about a cpu-only speed test, comparing against a cpu loop. It's just timing memory copies, so extra copies are going to add up.

    On a discrete GPU it can't make so much difference since the data needs to be copied one way or another anyway.

Similar Threads

  1. when run on cpu or graphics card of cpu
    By prince in forum OpenCL
    Replies: 2
    Last Post: 01-20-2013, 07:35 AM
  2. opencl performance
    By opencl_beginner in forum OpenCL
    Replies: 2
    Last Post: 11-09-2010, 10:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •