Page 2 of 3 FirstFirst 123 LastLast
Results 11 to 20 of 25

Thread: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

  1. #11
    Junior Member
    Join Date
    Aug 2009
    Posts
    1

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    mitchde,
    The performance difference comes mainly from the fact that you are not fully utilizing the ATI architecture. The ATI architecture is a 5-way VLIW, where as Nvidia is a scalar architecture. If you want to see performance on ATI hardware, you need to program using the vector types that OpenCL provides.

  2. #12
    Junior Member
    Join Date
    Aug 2009
    Posts
    21

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    Thanks !

    I now will try the float4 insted of float:
    Code :
    /*const char * sProgramSource = 
    "__kernel void vectorAdd(              \n" \
    "__global const float * a,             \n" \
    "__global const float * b,             \n" \
    "__global   float * c)                 \n" \
    "{\n" \
    "   // Vector element index            \n" \
    "   int loop;            \n" \
    "   int test1;            \n" \
    "   int nIndex = get_global_id(0);     \n" \
    "   for (loop=1; loop< 1000; loop++)\n" \
    "{\n" \
    "   c[nIndex] = a[nIndex] + b[nIndex]; \n" \
    "   c[nIndex] = c[nIndex] * (a[nIndex] + b[nIndex]); \n" \
    "   c[nIndex] = c[nIndex] * (a[nIndex] / 2.0 ); \n" \
    "}\n" \
     
    "}\n"; 
     */
     
    const char * sProgramSource = 
     "__kernel void vectorAdd(              \n" \
     "__global const float4 * a,             \n" \
     "__global const float4 * b,             \n" \
     "__global   float4 * c)                 \n" \
     "{\n" \
     "   // Vector element index            \n" \
     "   int loop;            \n" \
     "   int test1;            \n" \
     "   int nIndex = get_global_id(0);     \n" \
     "
     "   c[nIndex] = a[nIndex] + b[nIndex]; \n" \
     
     "}\n";

    I changed also the OpenCL calls from float to float4.

    hDeviceMemA = clCreateBuffer(hContext,
    CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float4), pA, &err);

    (for all 3 : pA, pB, pC)
    and
    err = clEnqueueReadBuffer(hCmdQueue, hDeviceMemC, CL_TRUE, 0, cnDimension * sizeof(cl_float4),
    pC, 0, NULL, NULL);

    I will report what changed, at least it compiled

  3. #13
    Junior Member
    Join Date
    Aug 2009
    Posts
    21

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    Upps.
    I have seen that AMD itself uses float or float2 , not float4.
    i now will remove that loop and go back to very basic gpu source code for first steps in OpenCL

    http://ati.amd.com/technology/streamcom ... pencl.html ) also uses float and not float4 in their own example.



  4. #14

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    Does it actual use the GPU already?
    I'm quite surprise, nVidia and ATI OpenCL drivers does seem really so I actually wonder how Apple could have OpenCL working

  5. #15
    Junior Member
    Join Date
    Aug 2009
    Posts
    21

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    Yes indeed, the Example, i modded to an "bench" is running well and valid at least on Nvidia GPUs / Mac OS X Snow Leopard 10.6.

    Here some resullts i got posted:

    NEW V025 test results !

    ATIs (no freezes of the bench anymore , ATI 4870 works now):
    Number of OpenCL devices found: 3
    OpenCL Device # 0 = Radeon HD 4870
    Device 0 is an: GPU with max. 750 MHz and 4 units/cores // 4 cores are wrong !!! //
    Now computing - please be patient....
    time used: 4.126 seconds

    Number of OpenCL devices found: 2
    OpenCL Device # 0 = Radeon HD 4870
    Device 0 is an: GPU with max. 750 MHz and 4 units/cores
    time used: 3.997 seconds
    At least with actual drivers and my benchmark ATI benches are useless.
    Seems to be that either OpenCL isnt sooo universal (same code run on all GPUs optimized) or bugs in ATI OpenCL part. Maybe some OpenCL PRAGMA settings must set for ATI to get better performance.


    NVIDIAs:
    Number of OpenCL devices found: 2
    OpenCL Device # 0 = GeForce GTX 285
    Device 0 is an: GPU with max. 1584 MHz and 240 units/cores
    time used: 0.231 seconds
    OpenCL Device # 1 = Intel® Core™ i7 CPU 920 4,3GHz
    time used: 1.296 seconds

    by grue:
    Number of OpenCL devices found: 3
    OpenCL Device # 0 = GeForce 8800 GT
    Device 0 is an: GPU with max. 1500 MHz and 112 units/cores
    time used: 0.683 seconds
    OpenCL Device # 1 = GeForce GTX 260
    Device 1 is an: GPU with max. 1400 MHz and 216 units/cores
    time used: 0.365 seconds
    OpenCL Device # 2 = Intel® Xeon® CPU X5365 @ 3.00GHz
    time used: 3.094 seconds

    by moondark
    Number of OpenCL devices found: 3
    OpenCL Device # 0 = GeForce 9600M GT
    Device 0 is an: GPU with max. 1250 MHz and 32 units/cores
    time used: 2.798 seconds
    OpenCL Device # 1 = GeForce 9400M
    Device 1 is an: GPU with max. 1100 MHz and 16 units/cores
    time used: 9.549 seconds
    OpenCL Device # 2 = Intel® Core™2 Duo CPU P8600 @ 2.40GHz
    time used: 15.800 seconds

    by antic
    Number of OpenCL devices found: 2
    OpenCL Device # 0 = GeForce 9500 GT
    Device 0 is an: GPU with max. 1350 MHz and 32 units/cores
    time used: 3.053 seconds
    OpenCL Device # 1 = Intel® Core™2 CPU 6600 @ 3.80GHz
    time used: 15.188 seconds

    by ricola
    Number of OpenCL devices found: 2
    OpenCL Device # 0 = GeForce 9400 GT
    Device 0 is an: GPU with max. 1375 MHz and 16 units/cores
    time used: 3.992 seconds
    OpenCL Device # 1 = Intel® Core™2 CPU E7500 @ 3,66 GHz
    time used: 12.048 seconds

  6. #16

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    Niceeeeee!
    Does it even support the connection with OpenGL?

  7. #17
    Junior Member
    Join Date
    Aug 2009
    Posts
    21

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    I now tried an orig. Apple OpenCL Demo .

    There is an OpenCL + OpenGL (GLUT) togehter, when you mean that.

    Looks like :
    CPU
    http://www.insanelymac.com/forum/ind...=post&id=54948

    GPU
    http://www.insanelymac.com/forum/ind...=post&id=54949

  8. #18
    Junior Member
    Join Date
    Aug 2009
    Posts
    3

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    static cl_device device;
    > no cl_device defined, i must use cl_device_id

    err = clGetDeviceGroupInfo(device, CL_DEVICE_IDS, &device_id, sizeof(cl_device_id), NULL);
    > clGetDeviceGroupInfo call doesnt exist, only the clGetDeviceInfo, CL_DEVICE_IDS doesnt exist in OpenCL

    clGetDeviceConfigInfo(device_id, CL_DEVICE_MAX_THREAD_GROUP_SIZE, &max_threadgroup_size, sizeof(unsigned int), NULL);
    > clGetDeviceConfigInfo doesnt exist in OpenCL, CL_DEVICE_MAX_THREAD_GROUP_SIZE doesnt exist ,only ..._MAX_WORK_GROUP_....
    These are historical interfaces. I believe they are left over from the WWDC 2008 release. They've been removed from the standard and Apple's OpenCL implementation. Rough translations follow:

    cl_device -> cl_device_id
    clGetDeviceGroupInfo -> clGetContextInfo( CL_CONTEXT_DEVICES)
    clGetDeviceConfigInfo(CL_DEVICE_MAX_THREAD_GROUP_S IZE) -> clGetDeviceInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE)

    The last one might also be translated as clGetKernelWorkGroupInfo, depending on what you are doing.

    All and all, a benchmark that looks at array addition is pretty weak. You are mostly just benchmarking memory bandwidth. Even if you have all the data you need in some equivalent of a L1 cache, its still 3 LSU ops for each arithmetic instruction. You folks should work on some more real world examples.

  9. #19
    Junior Member
    Join Date
    Aug 2009
    Posts
    21

    Re: OpenCL Addition Example (Mac OS X 10.6 Snow *Leopard)

    "All and all, a benchmark that looks at array addition is pretty weak. You are mostly just benchmarking memory bandwidth. Even if you have all the data you need in some equivalent of a L1 cache, its still 3 LSU ops for each arithmetic instruction. You folks should work on some more real world examples."

    You are absoulte right - but for starting+understanding OpenCL coding (not the OpenCL sourcepart !) an weak OpenCL source part is OK

    What do you think about that Apple OpenCL example, coding qJulia on GPU ?
    I think that OpenCL source part is much more "real parallel gpu programming" than an simple vector add - also i7 CPU´s can do really fast.

    I compiled that also and got around 30 FPS in the starting szenes , 10-60 fps in the animation with an fixed 800x800 window.



    Code :
    //
    // File:       qjulia.c
    //
    // Abstract:   This example shows how to use OpenCL to raytrace a 4d Quaternion Julia-Set 
    //             Fractal and intermix the results of a compute kernel with OpenGL for rendering.
    //
    // Version:    <1.0>
    //
    /
    // Copyright ( C ) 2008 Apple Inc. All Rights Reserved.
    //
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    #ifndef WIDTH
    #define WIDTH                       (512)
    #endif
    #ifndef HEIGHT
    #define HEIGHT                      (512)
    #endif
    #define ASPECT                      ((float)WIDTH / (float)HEIGHT)
    #define SQR(x)                      ((x)*(x))
    #define BOUNDING_RADIUS             (2.0f)
    #define BOUNDING_RADIUS_SQR         (SQR(BOUNDING_RADIUS))
    #define ESCAPE_THRESHOLD            (BOUNDING_RADIUS * 1.5f)
    #define DELTA                       (1e-5f)
    #define ITERATIONS                  (10)
    #define EPSILON                     (0.003f)
    #define SHADOWS                     (0)
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    // Note that __float3_SPI is an unsupported vector type.  It is not part of the 
    // OpenCL specification, and is not officially supported by any platform or vendor
    // and it should not be used.
     
    #define FLOAT3_TYPE                 __float3_SPI
    #define FLOAT3_CONSTRUCTOR(x,y,z)   ((__float3_SPI){(x),(y),(z)})
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    #ifndef FLOAT3_TYPE
    #define FLOAT3_TYPE                 float4
    #endif
     
    #ifndef FLOAT3_CONSTRUCTOR(x,y,z)
    #define FLOAT3_CONSTRUCTOR(x,y,z)   ((float4){(x),(y),(z),(0.0f)})
    #endif
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    #define float3                      FLOAT3_TYPE
    #define make_float3(x,y,z)          FLOAT3_CONSTRUCTOR(x,y,z)
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float3 cross3(float3 b, float3 c)
    {
        return make_float3(mad(b.y, c.z,  -b.z * c.y),
                           mad(b.z, c.x,  -b.x * c.z),
                           mad(b.x, c.y,  -b.y * c.x));
     
    }
     
    float3 normalize3(float3 v)
    {
        return v * half_rsqrt(dot(v, v));
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float4 qmult( float4 q1, float4 q2 )
    {
        float4 r;
        float3 t;
     
        float3 q1yzw = make_float3(q1.y, q1.z, q1.w);
        float3 q2yzw = make_float3(q2.y, q2.z, q2.w);
        float3 c = cross3( q1yzw, q2yzw );
     
        t = q2yzw * q1.x + q1yzw * q2.x + c;
        r.x = q1.x * q2.x - dot( q1yzw, q2yzw );
        r.yzw = t.xyz;
     
        return r;
    }
     
    float4 qsqr( float4 q )
    {
        float4 r;
        float3 t;
     
        float3 qyzw = make_float3(q.y, q.z, q.w);
     
        t     = 2.0f * q.x * qyzw;
        r.x   = q.x * q.x - dot( qyzw, qyzw );
        r.yzw = t.xyz;
     
        return r;
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float3 
    EstimateNormalQJulia(
        float3 p,
        float4 c,
        int iterations )
    {
        float4 qp = make_float4( p.x, p.y, p.z, 0.0f );
        float4 gx1 = qp - make_float4( DELTA, 0.0f, 0.0f, 0.0f );
        float4 gx2 = qp + make_float4( DELTA, 0.0f, 0.0f, 0.0f );
        float4 gy1 = qp - make_float4( 0.0f, DELTA, 0.0f, 0.0f );
        float4 gy2 = qp + make_float4( 0.0f, DELTA, 0.0f, 0.0f );
        float4 gz1 = qp - make_float4( 0.0f, 0.0f, DELTA, 0.0f );
        float4 gz2 = qp + make_float4( 0.0f, 0.0f, DELTA, 0.0f );
     
        for ( int i = 0; i < iterations; i++ )
        {
            gx1 = qsqr( gx1 ) + c;
            gx2 = qsqr( gx2 ) + c;
            gy1 = qsqr( gy1 ) + c;
            gy2 = qsqr( gy2 ) + c;
            gz1 = qsqr( gz1 ) + c;
            gz2 = qsqr( gz2 ) + c;
        }
     
        float nx = fast_length(gx2) - fast_length(gx1);
        float ny = fast_length(gy2) - fast_length(gy1);
        float nz = fast_length(gz2) - fast_length(gz1);
     
        float3 normal = normalize3(make_float3( nx, ny, nz ));
     
        return normal;
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float4 
    IntersectQJulia(
        float3 rO,
        float3 rD,
        float4 c,
        float epsilon,
        float escape)
    {
        float rd = 0.0f;
        float dist = epsilon;
        while ( dist >= epsilon && rd < escape)
        {
            float4 z = make_float4( rO.x, rO.y, rO.z, 0.0f );
            float4 zp = make_float4( 1.0f, 0.0f, 0.0f, 0.0f );
            float zd = 0.0f;
            uint count = 0;
            while(zd < escape && count < ITERATIONS)
            {
                zp = 2.0f * qmult(z, zp);
                z = qsqr(z) + c;
                zd = dot(z, z);
                count++;
            }
     
            float normZ = fast_length( z );
            dist = 0.5f * normZ * half_log( normZ ) / fast_length( zp );
            rO += rD * dist;
            rd = dot(rO, rO);
        }
     
        return make_float4(rO.x, rO.y, rO.z, dist);
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float3
    Phong(
        float3 light,
        float3 eye,
        float3 pt,
        float3 normal,
        float3 base)
    {
        const float SpecularExponent = 10.0f; 
        const float Specularity = 0.45f;
     
        float3 light_dir = normalize3( light - pt );
        float3 eye_dir = normalize3( eye - pt );
        float NdotL = dot( normal, light_dir );
        float3 reflect_dir = light_dir - 2.0f * NdotL * normal;
     
        base += fabs(normal) * 0.5f;
        float3 diffuse = base * fmax(NdotL, 0.0f);
        float3 specular = Specularity * half_powr( fmax( dot(eye_dir, reflect_dir), 0.0f), SpecularExponent );
        return diffuse + specular;
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float
    IntersectSphere(
        float3 rO,
        float3 rD,
        float radius )
    {
        float fB = 2.0f * dot( rO, rD );
        float fB2 = fB * fB;
        float fC = dot( rO, rO ) - radius;
        float fT = (fB2 - 4.0f * fC);
        if (fT <= 0.0f)
            return 0.0f;
        float fD = half_sqrt( fT );
        float fT0 = ( -fB + fD ) * 0.5f;
        float fT1 = ( -fB - fD ) * 0.5f;
        fT = fmin(fT0, fT1);
        return fT;
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    float4 
    RaytraceQJulia(
        float3 rO,
        float3 rD,
        float4 mu,
        float epsilon,
        float3 eye,
        float3 light,
        float3 diffuse,
        float radius,
        bool shadows,
        int iterations )
    {
        const float4 background = make_float4( 0.15f, 0.15f, 0.15f, 0.0f );
        float4 color = background;
     
        rD = normalize3( rD );
        float t = IntersectSphere( rO, rD, radius );
        if ( t <= 0.0f )
            return color;
     
        rO += rD * t;
        float4 hit = IntersectQJulia( rO, rD, mu, epsilon, ESCAPE_THRESHOLD );
        float dist = hit.w;
        if (dist >= epsilon)
            return color;
     
        rO.xyz = hit.xyz;
        float3 normal = EstimateNormalQJulia( rO, mu, iterations );
     
        float3 rgb = Phong( light, rD, rO, normal, diffuse );
        color.xyz = rgb.xyz;
        color.w = 1.0f;
     
        if (SHADOWS)
        {
            float3 light_dir = normalize3( light - rO );
            rO += normal * epsilon * 2.0f;
            hit = IntersectQJulia( rO, light_dir, mu, epsilon, ESCAPE_THRESHOLD );
            dist = hit.w;
            color.xyz *= (dist < epsilon) ? (0.4f) : (1.0f);
        }
     
        return color;
    }
     
    ///////////////////////////////////////////////////////////////////////////////////////////
     
    float4 
    QJulia(
        float4 coord,
        float4 mu,
        float4 diffuse,
        float epsilon,
        float iterations,
        int shadows,
        uint width,
        uint height)
    {
        float zoom = BOUNDING_RADIUS_SQR;
        float radius = BOUNDING_RADIUS_SQR;
     
        float2 size = make_float2((float)width, (float)height);
        float scale = max(size.x, size.y);
        float2 half = make_float2(0.5f, 0.5f);
        float2 position = (coord.xy - half * size) / scale;
        float2 frame = (position) * zoom;
     
        float3 light = make_float3(1.5f, 0.5f, 4.0f);
        float3 eye = make_float3(0.0f, 0.0f, 4.0f);
        float3 ray = make_float3(frame.x, frame.y, 0.0f);
        float3 base = make_float3(diffuse.x, diffuse.y, diffuse.z);    
     
        float3 rO = eye;
        float3 rD = (ray - rO);
     
        float4 color = RaytraceQJulia( rO, rD, mu, epsilon, eye, light, base, radius, shadows, iterations);
     
        return color;
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////
     
    __kernel void
    QJuliaKernel(
        __global uchar4 *result,
        const float4 mu,
        const float4 diffuse,
        const float epsilon)
    {
        int tx = get_global_id(0);
        int ty = get_global_id(1);
        int sx = get_global_size(0);
        int sy = get_global_size(1);
        int index = ty * WIDTH + tx;
        bool valid = (tx < WIDTH) && (ty < HEIGHT);
     
        float4 coord = make_float4((float)tx, (float)ty, 0.0f, 0.0f);
     
        if(valid)
        {
            float4 color = QJulia(coord, mu, diffuse, epsilon, ITERATIONS, SHADOWS, WIDTH, HEIGHT);
            uchar4 output = convert_uchar4_sat_rte(color * 255.0f);
            result[index] = output;
        }
    }
     
    ////////////////////////////////////////////////////////////////////////////////////////////////////

  10. #20

Page 2 of 3 FirstFirst 123 LastLast

Similar Threads

  1. Image2D OpenCL & Snow Leopard
    By Letinono in forum Interoperability issues
    Replies: 6
    Last Post: 04-05-2012, 11:11 AM
  2. Replies: 3
    Last Post: 11-30-2009, 03:12 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •