Results 1 to 6 of 6

Thread: [XF] Float16 vs 16 float

  1. #1
    Junior Member
    Join Date
    Aug 2011
    Posts
    4

    [XF] Float16 vs 16 float

    Hello every body,

    I'm new with OpenCL. I try to illustrate the power of float16, but I failed to.
    I built a program which add to 1024*1024*16-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it's longer with float16 than 16 * float?

    Thanks for your help.

    A part of my code :
    Code :
    Fichier Main.cpp :
     
    // Define an index space (global work size) of threads for execution.
    // A workgroup size (local work size) is not required, but can be used.
    size_t globalWorkSize[1];
    size_t localWorkSize[1];
    // There are nbKernel threads
    globalWorkSize[0] = nbKernel/16;
    localWorkSize[0] = 512;
     
    // Execute the kernel.
    // 'globalWorkSize' is the 1D dimension of the work-items
    status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize,
    localWorkSize, 0, NULL, NULL);
     
    clFinish(cmdQueue);
     
    Fichier.cl :
     
    __kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C)
    {
     
    unsigned int const i = get_global_id(0);
     
    C[i] = A[i] + B[i];

    Thanks

    Xavier Faure

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: [XF] Float16 vs 16 float

    Can you show us the kernel code for 16 floats -- you only showed the float16 code.

    What is the value of nbKernel in both cases? Have you tried passing NULL instead of localWorkSize?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: [XF] Float16 vs 16 float

    Quote Originally Posted by xfaure
    Hello every body,

    I'm new with OpenCL. I try to illustrate the power of float16, but I failed to.
    I built a program which add to 1024*1024*16-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it's longer with float16 than 16 * float?
    There's all sorts of reasons the float16 case might run slower:

    A) The biggest problem is the memory accesses will not be coalesced. The float16 load will be serialised into a number of smaller loads (float4?), which are spread across the address space. With a float type, every thread will access a single float in a packed range which can be serviced by a single memory load for each wavefront/warp.
    B) You get less parallelisation of the ALU. i.e. each processor needs to do more than one operation to implement the arithmetic. (whether this is faster or slower though depends on the problem, extra parallelism isn't always a win)
    C) You will have 1/16th as many threads running, and that might not be enough to hide the memory latencies depending on the problem size.
    D) Probably not a problem in this case, but if you did something more complex, you are using many more registers - this limits how many threads can run concurrently on a given multi-processor.
    E) Maybe ... if the compiler does a full load, then a full alu op, then a full store, you don't get such good interleaving of memory + alu ops, which may prevent the memory latency from being hidden.

    I've never used a float16, and I can't imagine where they'd be particularly useful for GPU, or CPU performance - they might be worth it on CELL though, where instruction-level data pipelining is critical to performance, and there are lots of registers.

    GPU's seem to be optimised for float4 (not surprisingly: RGBA, or XYZW, etc).

  4. #4
    Junior Member
    Join Date
    Aug 2011
    Posts
    4

    Re: [XF] Float16 vs 16 float

    Quote Originally Posted by david.garcia
    Can you show us the kernel code for 16 floats -- you only showed the float16 code.
    There are just two differents :

    1) The number of nbKernel
    nbKernel = 1024*1024*16, the size of the array in case of 16 floats
    nbKernel = 1024*1024, in case of float16

    2) The Declaration of the function :
    __kernel void vecadd(__global float const * const A, __global float const * const B, __global float * const C) in case of 16 floats
    __kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C) in case of float16

    Quote Originally Posted by david.garcia
    What is the value of nbKernel in both cases? Have you tried passing NULL instead of localWorkSize?
    I tried passing NULL but it gave the same result.

    Thanks for your help

  5. #5
    Junior Member
    Join Date
    Aug 2011
    Posts
    4

    Re: [XF] Float16 vs 16 float

    Quote Originally Posted by notzed
    There's all sorts of reasons the float16 case might run slower:
    A) The biggest problem is the memory accesses will not be coalesced. The float16 load will be serialised into a number of smaller loads (float4?), which are spread across the address space. With a float type, every thread will access a single float in a packed range which can be serviced by a single memory load for each wavefront/warp.
    I tried a lot of configuration with a 8388608-array with float.
    The two parameters I tried to optimize are local size ( 16 32 64 128 ... ) and type float dimension (float float2 float4 ... )
    And look the results :
    http://fex.insa-lyon.fr/get?k=rGntvoe45QpW32s8IyT
    http://fex.insa-lyon.fr/get?k=sjP7NfgeX22pRJLJUj1
    http://fex.insa-lyon.fr/get?k=b8iNfmjpwRSYdfDb1DR
    http://fex.insa-lyon.fr/get?k=mCLjkJK6sBn2Cq01JpF
    So, the type float seems to have a bad impact on the computation time.
    [/quote]

    Quote Originally Posted by notzed
    B) You get less parallelisation of the ALU. i.e. each processor needs to do more than one operation to implement the arithmetic. (whether this is faster or slower though depends on the problem, extra parallelism isn't always a win)
    Does it mean my kernel doesn't have enough job?
    Quote Originally Posted by notzed
    C) You will have 1/16th as many threads running, and that might not be enough to hide the memory latencies depending on the problem size.
    I try with a lot of size and every time it's the same problem.
    Quote Originally Posted by notzed
    D) Probably not a problem in this case, but if you did something more complex, you are using many more registers - this limits how many threads can run concurrently on a given multi-processor.
    You are right : here, it 's not my problem.
    Quote Originally Posted by notzed
    E) Maybe ... if the compiler does a full load, then a full alu op, then a full store, you don't get such good interleaving of memory + alu ops, which may prevent the memory latency from being hidden.
    I don't understand this anwer.
    Quote Originally Posted by notzed
    I've never used a float16, and I can't imagine where they'd be particularly useful for GPU, or CPU performance - they might be worth it on CELL though, where instruction-level data pipelining is critical to performance, and there are lots of registers.

    GPU's seem to be optimised for float4 (not surprisingly: RGBA, or XYZW, etc).
    OK thanks a lot for your answer and sorry, I took a long time to answer.

    Have nice day.

    Xavier

  6. #6
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: [XF] Float16 vs 16 float

    Quote Originally Posted by xfaure
    I tried a lot of configuration with a 8388608-array with float.
    The two parameters I tried to optimize are local size ( 16 32 64 128 ... ) and type float dimension (float float2 float4 ... )
    And look the results :
    http://fex.insa-lyon.fr/get?k=rGntvoe45QpW32s8IyT
    http://fex.insa-lyon.fr/get?k=sjP7NfgeX22pRJLJUj1
    http://fex.insa-lyon.fr/get?k=b8iNfmjpwRSYdfDb1DR
    http://fex.insa-lyon.fr/get?k=mCLjkJK6sBn2Cq01JpF
    So, the type float seems to have a bad impact on the computation time.
    Well ... yes.

    Anyway - you've demonstrated that there is no benefit from using float16 but a negative impact on performance. This is a result.

    The reasons I listed are possible ones - some of the actual hardware details are proprietary so some of them are only guesses.

    I suspect the main one here is the memory reads aren't coalesced properly. See the nvidia or amd documentation (the 'programming guide' ones), they cover this pretty well with nice diagrams.

Similar Threads

  1. float VS floatN
    By exoide in forum OpenCL
    Replies: 4
    Last Post: 01-23-2012, 06:11 AM
  2. Replies: 3
    Last Post: 04-28-2011, 05:03 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •