Results 1 to 4 of 4

Thread: Code 10x faster on CPU device than GPU device

  1. #1
    Junior Member
    Join Date
    Apr 2010
    Posts
    11

    Code 10x faster on CPU device than GPU device

    Hi all,

    I'm using a 1D cellular automaton (CA) to generate random unsigned ints in OpenCL. Basically, the CA is an array of uchar. And we evolve the CA in iterations such that ca[i] depends on ca[i], ca[i-1] of the previous iteration. Each work item evolves 4 uchar's in a row and writes this (combined) uint to our output array after every iteration.
    (alg based on http://home.southernct.edu/~pasqualonia1/ca/report.html).

    The point of all this is that a relatively small (but necessary) change makes my GPU code an order of magnitude slower but has no real effect on my CPU code. Note also that the CPU is doing this MUCH faster than GPU. I've tried 1-512 for local_size(0).

    In the red section of code, if we change

    ca2[ar] = locRules[ca1[a3] + ca1[ar]];
    ca2[a3] = locRules[ca1[a2] + ca1[a3]];
    ca2[a2] = locRules[ca1[a1] + ca1[a2]];
    ca2[a1] = locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];


    to

    locRules[ca1[a3] + ca1[ar]];
    locRules[ca1[a2] + ca1[a3]];
    locRules[ca1[a1] + ca1[a2]];
    locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];


    the execution time speeds up by at least 10x. Any ideas why? Any ideas how to fix it? ca2 is still a __local variable, so the write shouldn't be *that* slow. I'm planning to release this code under GNU once it is polished. More details after the code.

    /*
    ASSERT: This needs to be called on num local cores divisible by lenGlobCAs.
    ASSERT: global_size == local_size!
    ASSERT: globCA is initialized to random (by burning-in gen or from global source)
    Generates nNums random 32-bit unsigned ints and stores result in outNums.

    The state of the ith cell at iteration j dependeds on the state of the
    ith and (i-1)th cell at iteration (j-1).

    */
    __kernel void ca_rng_get_ints (__global uchar *globCA, // the initial state of cellular automaton (ca)
    __local uchar *ca1, // local copies of globCA
    __local uchar *ca2, // we will swap between ca1 and ca2 for efficiency
    int lenCA,
    __global const uint *rules, // transition rules from one state to next
    __local uint *locRules,
    int nRules,
    int nRNs, // number of random numbers to generate
    __global uint *outNums) {

    __global uint *tgon = outNums;

    int gid = get_global_id(0);
    int nLoc = get_local_size(0);
    int nGlob = get_global_size(0);
    __local int toSave;
    event_t saveRNs;

    // load our 'cells' indecies into memory. Every 4 represents one uint
    int cright = (gid+1) * 4 - 1; // right idx.
    int c3 = cright-1;
    int c2 = cright-2;
    int c1 = cright-3;
    int c0 = cright-4; //ca[c0] is READ ONLY

    // load ca into local memory
    event_t loadCA = async_work_group_copy(ca1, globCA, lenCA, 0);

    event_t loadRules = async_work_group_copy(locRules, rules, nRules, 0);

    wait_group_events(1, &loadRules);
    wait_group_events(1, &loadCA);

    // Note that 1 iteration is one generation of CA state.
    int nIters = 1 + nRNs / (lenCA/4); // 4 chars per out number

    // run CA and generate rand nums
    // idea: bounce back and forth from ca1 to ca2
    for (int iter=0; iter < nIters; iter++){

    // Eveolve the CA according to rules
    for (int offset=0; offset < lenCA; offset+=nLoc*4){
    // update cell states using rule table:
    int a0=c0+offset; int a1=c1+offset; int a2 = c2+offset; int a3 = c3+offset; int ar = cright+offset;

    // for some reason saving to ca2 takes a lot of time!!
    ca2[ar] = locRules[ca1[a3] + ca1[ar]];
    ca2[a3] = locRules[ca1[a2] + ca1[a3]];
    ca2[a2] = locRules[ca1[a1] + ca1[a2]];
    ca2[a1] = locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];

    }


    // We've finished a complete revolution over every CA cell
    // save results to our output random number array (outNums)
    if (gid==0)
    toSave = min(lenCA/4,nRNs);

    barrier(CLK_LOCAL_MEM_FENCE);

    if (iter>0)
    wait_group_events(1, &saveRNs); // make sure previous iter's save has finished

    saveRNs = async_work_group_copy(outNums, (__local uint*)ca2, toSave , 0);

    // Reposition pointer and number of rand nums to generate
    if (gid==0){
    outNums = &outNums[toSave]; // this is a local variable, so all threads get info
    nRNs -= lenCA/4;
    }

    // swap c1 and c2
    if (gid==0){
    __local uchar *t=ca1;
    ca1 = ca2;
    ca2 = t;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    }

    // remember that we just swapped ca1 with ca2, so ca1 is most up to date
    event_t saveCA = async_work_group_copy(globCA, ca1, lenCA, 0);
    wait_group_events(1, &saveRNs);
    wait_group_events(1, &saveCA);
    }


    CA - 2056 bytes, rules - 256 char rules, nRNs ~ O(million)
    running on Macbook Pro i5 with

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Code 10x faster on CPU device than GPU device

    When you eliminate the assignments and leave the code like this:

    Code :
    locRules[ca1[a3] + ca1[ar]]; 
    locRules[ca1[a2] + ca1[a3]]; 
    locRules[ca1[a1] + ca1[a2]]; 
    locRules[(a0>=0?ca1[a0]:0) + ca1[a1]];

    The compiler is removing all those instructions since they have no effect on the output of the kernel. It's not surprising that it is faster.

    As for "why so much faster", I don't know since I'm not familiar with the implementation you are using.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Apr 2010
    Posts
    11

    Re: Code 10x faster on CPU device than GPU device

    To clarify. Without the code it takes less than a second. With the

    ca2[ar] = ...

    part it takes more like 10 seconds (on GPU only). It is just a local memory assignment, so the time increase is very unexpected to me.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Code 10x faster on CPU device than GPU device

    It's not just a local memory assignment from the compiler point of view: it makes a bunch of code dead or not dead, thus saving a ton of instructions. At least that's what it seems from what I can see.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Similar Threads

  1. CPU faster in vector addition than GPU
    By SabinManiac in forum OpenCL
    Replies: 5
    Last Post: 10-13-2011, 12:14 PM
  2. Faster on CPU than on the GPU
    By vijaykiran in forum OpenCL
    Replies: 1
    Last Post: 08-12-2010, 10:44 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •