Results 1 to 7 of 7

Thread: CPU vs GPU optimizations

  1. #1
    Junior Member
    Join Date
    Jul 2011
    Posts
    23

    CPU vs GPU optimizations

    Hello

    I have implemented a straightaway naive matrix multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for just an 8-core CPU system while I only run it on CPUs. I have applied some popular optimizations like utilizing private memory and local memory optimizations, and grouping my matrix in one dimension so I use both global and local dimension sizes. Now I get Speedup of around 24 with same 8-core CPU.
    First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
    Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
    Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup...

    Please help me clarify because I am so new to OpenCL and its giving me so big performance I cant beleive it, I have verified results and they are perfectly accurate.
    Thanks in advance

  2. #2

    Re: CPU vs GPU optimizations

    Quote Originally Posted by akhal
    First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
    SIMD instructions such as SSE + multithreading.

    Quote Originally Posted by akhal
    Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
    Maybe it's how you measure things? Using local memory on CPU should give you no performance increases as it is the same as global (host) memory.

    Quote Originally Posted by akhal
    Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup...
    Of course registers are used as much as possible on CPUs but besides that you are only left with multi-threading and vectorized instructions. As I said before, local and global memory are no different. You can verify this by querying the local memory type.

  3. #3
    Junior Member
    Join Date
    Jul 2011
    Posts
    23

    Re: CPU vs GPU optimizations

    Quote Originally Posted by matthiasv
    Quote Originally Posted by akhal
    First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
    SIMD instructions such as SSE + multithreading.

    Quote Originally Posted by akhal
    Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
    Maybe it's how you measure things? Using local memory on CPU should give you no performance increases as it is the same as global (host) memory.

    Quote Originally Posted by akhal
    Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup...
    Of course registers are used as much as possible on CPUs but besides that you are only left with multi-threading and vectorized instructions. As I said before, local and global memory are no different. You can verify this by querying the local memory type.
    I measure it with passing CL_COMMAND_PROFILING_ENABLE flag in making command queue, and then after enqueuing each kernel, I do clFinish(kernel) and then measure time with corresponding event.. so thats pretty much standard way of measuring kernel execution time, I even do it multiples times and then averaging it, so there is no problem with my time calculations, they Why I get manytimes speedup of private/local memory+grouping work items, compared with simple kernel which only uses global memory and single work item threads... if as you said these optimizations arent for CPUs; I really wonder why?

  4. #4

    Re: CPU vs GPU optimizations

    There might be some cache effects due to better alignment of memory accesses. But from my point of view this is just speculation. Are you on AMD or Intel OpenCL? If the latter is the case, you can inspect the compilation result with the Intel Offline Compiler and see what it generates for the simple and the more advanced kernel.

  5. #5
    Junior Member
    Join Date
    Jul 2011
    Posts
    23

    Re: CPU vs GPU optimizations

    I am using AMD OpenCL implementation...
    Also is SIMD utilization or auto-vectorization possible if I havent used OpencL vectors for example? Also local/private memory can boost speedup on CPUs? I am confused because someone told me that for device CPUs there is no local memory in OpenCL so no benefit, and that it only gives performance for GPUs...

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: CPU vs GPU optimizations

    Also is SIMD utilization or auto-vectorization possible if I havent used OpencL vectors for example?
    With a good compiler, yes.

    Also local/private memory can boost speedup on CPUs? I am confused because someone told me that for device CPUs there is no local memory in OpenCL so no benefit
    While CPUs do not have actual local memory, writing your algorithm in a way that takes advantage of local memory will often improve cache performance.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Junior Member
    Join Date
    Jul 2011
    Posts
    23

    Re: CPU vs GPU optimizations

    Thank you so much for kind information...

Similar Threads

  1. Using the CPU and GPU in one algorithm
    By exoide in forum OpenCL
    Replies: 1
    Last Post: 10-19-2011, 03:10 PM
  2. cpu working, gpu not
    By peksi in forum OpenCL
    Replies: 3
    Last Post: 09-29-2010, 05:07 PM
  3. Faster on CPU than on the GPU
    By vijaykiran in forum OpenCL
    Replies: 1
    Last Post: 08-12-2010, 10:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •