Results 1 to 3 of 3

Thread: CPU vs GPU performance

  1. #1
    Newbie
    Join Date
    Oct 2013
    Posts
    2

    CPU vs GPU performance

    Hi
    Recently I've been experimenting with OpenCL on a laptop with i5 3210M processor and an HD 4000 GPU. I am trying to build a picture of what algorithms are faster on which device and why, but some of the results I am getting seem quite strange to me:

    I am testing a very very simple kernel - just one floating point operation and no global/local memory access at all - just the registers:
    __kernel
    void dummy() {
    float f = 2.0f * 2.0f;
    }


    However, the CPU totally wins over the GPU. Here are the results as reported by the analyze tool in Intel's Kernel Builder for different workspace sizes.
    CPU GPU
    10^3 0,049036 0,418
    10^4 0,079028 0,37
    10^5 0,11 0,4
    10^6 0,66 1
    10^7 1,81 7,589
    10^8 5,6 68,18



    The local group sizes are set to Auto, so after several iterations, the optimal is chosen.

    My first suggestion was that the work per work item is too little and the thread creation overhead (however cheap it is) outweighs the performance gains.
    My next attempt was to increase the work per work item by just adding 50 more floating operations to it, but that did not change the result significantly.

    Could someone explain what I am missing. Thanks

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    Your example is far too simple. The compiler will first replace the constant expression 2.0f * 2.0f by the constant value 4.0f, then detect that f is used nowhere and optimize it away.

    As a result your kernel probably does nothing.

    Your kernel should at least write something to an output value, which ensures that computed values are not optimized away.

  3. #3
    Newbie
    Join Date
    Oct 2013
    Posts
    2
    Quote Originally Posted by utnapishtim View Post
    Your example is far too simple. The compiler will first replace the constant expression 2.0f * 2.0f by the constant value 4.0f, then detect that f is used nowhere and optimize it away.

    As a result your kernel probably does nothing.

    Your kernel should at least write something to an output value, which ensures that computed values are not optimized away.
    Yes, apparently that is the case.
    The main reason behind the CPU's being faster seems to be that there has to be some meaningful amount of work per work item - maybe something around 100 floating point operations in order to hide the thread creation latency.
    I wasn't detecting this behavior because of the reason poined by you - my 'slightly more complicated' kernel was being compiler-optimized to the same simple one. Thanks!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •