Results 1 to 8 of 8

Thread: Mysterious kernel behavior. Can anyone explain?

  1. #1
    Junior Member
    Join Date
    Mar 2010
    Posts
    14

    Mysterious kernel behavior. Can anyone explain?

    The behavior of the test kernel below makes absolutely no sense to me. Any clues to what's going on will be much appreciated!

    The kernel basically initializes an output array, C, with 1's. It furthermore contains a dummy for-loop that does noting but initialize two local data structures. If I only make a few iterations in the for-loop, C is correctly initialized with 1's, but with many iterations (e.g. 1000) something goes horribly wrong and C is not initialized with 1's. Why does the 1000-iteration loop have this odd effect?

    Bonus info: for some reasons the problem does not occur when the input arrays are small. The problem do occur for a situation with
    C: array of size 11,182,336
    A: array of size 3,101,104
    rowWidths and rowStartIdxs: arrays of size 3,344
    I don't understand, why the array sizes have anything to do with this, but they do.

    I would welcome an explanation of the above mysterious phenomenon.

    (FYI: the kernel is a boil-down of a larger and more meaningful kernel that suffers from the same "bug")
    Code :
    #define BLOCK_SIZE 16
    __kernel void test(__global int* C, int CSize,  __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
    {
        int bi = get_group_id(0);
        int bj = get_group_id(1);
        int ti = get_local_id(0);
        int tj = get_local_id(1);
        int rowAIdx =  bi * BLOCK_SIZE + ti;
        int rowBIdx =  bj * BLOCK_SIZE + tj;
     
        int cOut = 1;
        for(int x=0; x<1000; x++) {
          __local int As[BLOCK_SIZE][BLOCK_SIZE];
          __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
          As[ti][tj] = 1;
          Bs[ti][tj] = 1;
          barrier(CLK_LOCAL_MEM_FENCE);
        }
        int c = rowBIdx * CSize + rowAIdx;
        C[c] = cOut;
    }

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Mysterious kernel behavior. Can anyone explain?

    It does look like a bug in the implementation. Can you tell us what OpenCL implementation are you using?

    Did you double and triple-check that the work group size is equal to BLOCK_SIZE x BLOCK_SIZE?

    The difference between a few (how many?) and many loop iterations could be that the implementation is doing some aggressive loop unrolling in the former case.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Mar 2010
    Posts
    14

    Re: Mysterious kernel behavior. Can anyone explain?

    I access the GPU using the python interface pyopencl-0.91.4. I'm not sure where to find the opencl SDK version, but according to the release notes in the doc/folder, it is "OpenCL R190 Update Release", including "SDK Version 1.2.0.15". Can I somehow verify that this is also the version that pyopencl uses?
    I work on a mac with Snow Leopard 10.6.3.

  4. #4
    Junior Member
    Join Date
    Mar 2010
    Posts
    14

    Re: Mysterious kernel behavior. Can anyone explain?

    I forgot: yes, I have double, triple and quadro checked that the group size is really BLOCK_SIZE x BLOCK_SIZE.

  5. #5
    Junior Member
    Join Date
    Mar 2010
    Posts
    14

    Re: Mysterious kernel behavior. Can anyone explain?

    Below is a complete boiled-down example that can be run directly.

    To see if the code works, verify that it outputs (prints) a number larger than zero.
    You can make it work in two ways:
    1) lower the number of for-iterations in the kernel from 1000 to e.g. 10,
    2) lower the "rows" variable in the python code from 3344 to e.g. 144.

    Code :
    import sys
    import struct
    import pyopencl as cl
    import numpy
     
    block_size = 16
    matrixLength = 3101104
    rows = 3344
     
    row2width = numpy.zeros(rows, numpy.int32)
    row2startIdx = numpy.zeros(rows, numpy.int32)
    matrix = numpy.zeros(matrixLength, numpy.int32)
     
    pl = cl.get_platforms()
    devs = pl[0].get_devices(cl.device_type.GPU)
    if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)):
       print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..."
       exit(1)
    ctx = cl.Context(devs)
    queue = cl.CommandQueue(ctx)
    mf = cl.mem_flags
     
    src = """
    // Thread block size
    #define BLOCK_SIZE 16
     
    __kernel void test(__global int* C, int CSize, __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
    {
        int bi = get_group_id(0);
        int bj = get_group_id(1);
        int ti = get_local_id(0);
        int tj = get_local_id(1);
     
        int rowAIdx =  bi * BLOCK_SIZE + ti;
        int rowBIdx =  bj * BLOCK_SIZE + tj;
     
        int cOut = 1;
        for(int x=0; x<1000; x++) {
          __local int As[BLOCK_SIZE][BLOCK_SIZE];
          __local int Bs[BLOCK_SIZE][BLOCK_SIZE];
          As[ti][tj] = 1;
          Bs[ti][tj] = 1;
          barrier(CLK_LOCAL_MEM_FENCE);
        }
     
        C[rowBIdx * CSize + rowAIdx] = cOut;
    }
    """;
    prg = cl.Program(ctx, src).build();
    matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(matrix).astype(numpy.int32))
    row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2width).astype(numpy.int32))
    row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2startIdx).astype(numpy.int32))
    o = numpy.zeros(rows * rows).astype(numpy.int32)
    o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o)
    w_o_buf = struct.pack("i", rows)
     
    prg.test(queue, [rows, rows], o_buf, w_o_buf, matrix_buf, row2width_buf, row2startIdx_buf, local_size=(block_size, block_size))
    cl.enqueue_read_buffer(queue, o_buf, o).wait()
     
    i = numpy.nonzero(o)
    print len(i[0])

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Mysterious kernel behavior. Can anyone explain?

    If I were you, the first thing I would do is try to recreate the bug in C instead of using PyOpenCL. That would eliminate the possibility of a problem in the Python bindings.

    Sorry, I don't have my mac at hand to try and do this myself
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Junior Member
    Join Date
    Mar 2010
    Posts
    14

    Re: Mysterious kernel behavior. Can anyone explain?

    Problem solved! It seems that Nvidia hardware that is also used for display has a hard limit of 5 seconds of kernel execution time. After that, the kernel is aborted and an error returned.

    I solved the problem by splitting it into smaller sub problems and constructing the final solution from the sub results.

    Since this 5-second hard limit is pretty essential for computation intensive applications (most OpenCL programs I guess) I think it should be more visible in the documentation. IMHO.

  8. #8

    Re: Mysterious kernel behavior. Can anyone explain?

    I experienced a similar problem some time ago when executing nested for loops on the kernel, but couldn't find at the time any exact explanation for it.

Similar Threads

  1. Kernel undeterministic behavior
    By T-man in forum OpenCL
    Replies: 4
    Last Post: 10-16-2012, 12:50 AM
  2. Weird behavior of my kernel
    By T-man in forum OpenCL
    Replies: 1
    Last Post: 10-13-2012, 04:47 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •