Results 1 to 6 of 6

Thread: Small matrix operations

  1. #1
    Junior Member
    Join Date
    Oct 2011
    Posts
    15

    Small matrix operations

    After googling a bit it appears as though open source implementations of small matrix operations for OpenCL are not easy to come by.

    I frequently need such functionality so I have started with 3 by 3 matrix equations. I've tested pivoted gaussian elimination with back substitution
    Code :
    bool matrixSolve3x3f(float16 *M, float4 *x)
    {
        // find first pivot
        if(M->s4 > M->s0 && M->s4 > M->s8)
        {
            float4 t = {M->s0, M->s1, M->s2, x->x};
            M->s0 = M->s4; M->s1 = M->s5; M->s2 = M->s6; x->x = x->y;
            M->s4 = t.x; M->s5 = t.y; M->s6 = t.z; x->y = t.w;
        }
        else if(M->s8 > M->s0)
        {
            float4 t = {M->s0, M->s1, M->s2, x->x};
            M->s0 = M->s8; M->s1 = M->s9; M->s2 = M->sa; x->x = x->z;
            M->s8 = t.x; M->s9 = t.y; M->sa = t.z; x->z = t.w;
        }
     
        if(abs(M->s0) < FLOAT_EPSILON)
            return false;
     
        // eliminate
        float u = M->s4/M->s0;
        M->s5 -= M->s1*u; M->s6 -= M->s2*u; x->y -= x->x*u;
     
        u = M->s8/M->s0;
        M->s9 -= M->s1*u; M->sa -= M->s2*u; x->z -= x->x*u;
     
        // find second pivot
        if(M->s5 < M->s9)
        {
            float4 t = {M->s5, M->s6, x->y, 0.0f};
            M->s5 = M->s9; M->s6 = M->sa; x->y = x->z;
            M->s9 = t.x; M->sa = t.y; x->z = t.z;
        }
     
        if(abs(M->s5) < FLOAT_EPSILON)
            return false;
     
        // eliminate
        u = M->s9/M->s5;
        M->sa -= M->s6*u; x->z -= x->y*u;
     
        if(abs(M->sa) < FLOAT_EPSILON)
            return false;
     
        // back substitution
        x->z /= M->sa;
        x->y = (x->y - M->s6*x->z)/M->s5;
        x->x = (x->x - M->s1*x->y - M->s2*x->z)/M->s0;
     
        return true;
    }
    and Kramer's rule
    Code :
    bool matrixSolve3x3fKramer(const float16 *M, float4 *x)
    {
        float D = M->s0*(M->s5*M->sa - M->s6*M->s9)
                - M->s1*(M->s4*M->sa - M->s6*M->s8)
                + M->s2*(M->s4*M->s9 - M->s5*M->s8);
     
        if(abs(D) < FLOAT_EPSILON)
            return false;
     
        float D0 = x->s0*(M->s5*M->sa - M->s6*M->s9)
                 - M->s1*(x->s1*M->sa - M->s6*x->s2)
                 + M->s2*(x->s1*M->s9 - M->s5*x->s2);
     
        float D1 = M->s0*(x->s1*M->sa - M->s6*x->s2)
                 - x->s0*(M->s4*M->sa - M->s6*M->s8)
                 + M->s2*(M->s4*x->s2 - x->s1*M->s8);
     
        float D2 = M->s0*(M->s5*x->s2 - x->s1*M->s9)
                 - M->s1*(M->s4*x->s2 - x->s1*M->s8)
                 + x->s0*(M->s4*M->s9 - M->s5*M->s8);
     
        x->s0 = D0/D;
        x->s1 = D1/D;
        x->s2 = D2/D;
     
        return true;
    }
    The machine epsilon I set to approx 2^-32 and matrix multiplication were implemented as follows
    Code :
    float4 matrixMult3x3f(const float16 *M, const float4 *v)
    {
        return {M->s0*v->x + M->s1*v->y + M->s2*v->z, M->s4*v->x + M->s5*v->y + M->s6*v->z, M->s8*v->x + M->s9*v->y + M->sa*v->z, 0.0f};
    }
    To test the accuracy I used the following kernel (with random matrices)
    Code :
    __kernel void vecMathTest(__global const float16 *A, __global const float4 *b, __global const int *n, __global float *r)
    {
        int i = get_global_id(0);
     
        if(i < *n)
        {
            float16 M = A[i];
            float4 x = b[i];
            matrixSolve3x3f(&M, &x);
     
            float16 T = A[i];
     
            float4 R = matrixMult3x3f(&T, &x) - b[i];
            r[i] = length(R)/length(b[i]);
        }
    }
    Kramer's rule appears to produce more accurate results but I expect gaussian elimination to be faster as it has fewer floating point operations. I haven't gotten around to timing it though.

    Now I'm quite new to OpenCL so I'd be interested in any optimization tips or improvements as well as thoughts on whether to go for gauss or kramer. I'm leaning towards kramer but allthough it is more accurate for random matrices it is not given that it will be so in all cases.

    Also, if anyone notice any bugs I'd appreciate if you'd point the out.

  2. #2
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Small matrix operations

    Quote Originally Posted by Peccable
    After googling a bit it appears as though open source implementations of small matrix operations for OpenCL are not easy to come by.
    Kramer's rule appears to produce more accurate results but I expect gaussian elimination to be faster as it has fewer floating point operations. I haven't gotten around to timing it though.

    Now I'm quite new to OpenCL so I'd be interested in any optimization tips or improvements as well as thoughts on whether to go for gauss or kramer. I'm leaning towards kramer but allthough it is more accurate for random matrices it is not given that it will be so in all cases.

    Also, if anyone notice any bugs I'd appreciate if you'd point the out.
    Some ideas:

    o (divergent) branches are slow, so that Gaussian elimination implementation will necessarily be slow. At best branches are implemented by all threads executing all cases, and just ignoring the results in the locally untaken branches. At worst, they're implemented serially.

    For this problem you probably will not get it very fast, although for larger problems the idea is to parameterise the code so that it doesn't need to branch: the decisions are instead stored in indices and masks. This can either lead to branchless code or at least code with reduced control flow.

    Even something innocuous as "if(M->s4 > M->s0 && M->s4 > M->s8)" has to be implemented as 2 nested if's in order to honour the C short-circuit rules for &&

    Use non-short-cut logic like this instead:

    if ( (M->s4 > M->s0) & (M->s4 > M->s8) )

    o float multiply and add is really fast, i don't think you'd be able to beat kramers rule for this problem and if you're going for speed you're probably wasting your time trying anything else.

    o global memory access is slow, and much worse if it isn't coalesced. reading a float16 looks good on paper but it will be a fairly inefficient access pattern. assuming your typical device will only load float4 at a time, it is translated into 4x float4 accesses with a per-thread stride of 4.

    One can either re-arrange the data in global memory to be better - but that's usually impractical - or use local memory to re-arrange the data. All threads in the workgroup load consecutive float4's at the same time, and then write them out to local mem in a way that can be accessed efficiently later. local mem has it's own caveats though - e.g. threads working with consecutive float4s will cause bank conflicts.

    For small problems like this though, it may not be worth it since there are overheads in implementing the software cache too.

  3. #3
    Junior Member
    Join Date
    Oct 2011
    Posts
    15

    Re: Small matrix operations

    Interesting, thanks for he feedback. It seems quite difficult to avoid branching at times. For example now I'm writing a function to test for intersecting tetrahedrons. Getting rid of conditionals here is a challenge to say the least.

  4. #4
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Small matrix operations

    Quote Originally Posted by Peccable
    Interesting, thanks for he feedback. It seems quite difficult to avoid branching at times. For example now I'm writing a function to test for intersecting tetrahedrons. Getting rid of conditionals here is a challenge to say the least.
    Well that's the challenge. Sometimes you need different algorithms and sometimes you can't avoid it - the higher memory bandwidth and higher parallelism can usually be used to make the code run acceptably, if not amazingly.

    Do you have a code example of that particular problem?

  5. #5
    Junior Member
    Join Date
    Oct 2011
    Posts
    15

    Re: Small matrix operations

    Actually it was not too hard to figure out a way
    Code :
    // One dimensional intersection of the open interval <0,1>
    bool sect1d(const float a, const float b, const float c, const float d)
    {
        bool greater = (a >= 1.0f) & (b >= 1.0f) & (c >= 1.0f) & (d >= 1.0f);
        bool lower = (a <= 0.0f) & (b <= 0.0f) & (c <= 0.0f) & (d <= 0.0f);
     
        return !(greater | lower);
    }
     
    // evaluate if two tetrahedrons intersect without branching
    bool intersection(const float4 A0, const float4 B0, const float4 C0, const float4 D0,
                             const float4 A1, const float4 B1, const float4 C1, const float4 D1)
    {
        float16 M;
        M.s048c = B0 - A0;
        M.s159d = C0 - A0;
        M.s26ae = D0 - A0;
     
        float4 V[4] = {A1 - A0, B1 - A0, C1 - A0, D1 - A0};
     
        matNVecSolve9f(&M, V, 4);
     
        return sect1d(V[0].x, V[1].x, V[2].x, V[3].x)
             & sect1d(V[0].y, V[1].y, V[2].y, V[3].y)
             & sect1d(V[0].x + V[0].y + V[0].z,
                      V[1].x + V[1].y + V[1].z,
                      V[2].x + V[2].y + V[2].z,
                      V[3].x + V[3].y + V[3].z);
    }
    Basically it is evaluating all possible cases even if an intersection might already be determined.

    Still room for some optimizations though I think.

  6. #6
    Junior Member
    Join Date
    Nov 2011
    Posts
    4

    Re: Small matrix operations

    Well, small matrix operations are a bit on the edge if you would ask me. Anyway, nice to see you come up with an idea which seems to thwart branching in a easier manner. I think once you can manage to get clear of conditionals, you are off with most of the issues.

Similar Threads

  1. Atomic operations in OpenCL 1.0
    By yulia in forum OpenCL
    Replies: 7
    Last Post: 02-07-2011, 10:22 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •