Results 1 to 4 of 4

Thread: OpenCL CPU issue: execution stop too early

  1. #1
    Junior Member
    Join Date
    Apr 2012
    Posts
    9

    OpenCL CPU issue: execution stop too early

    Hello,

    I am a beginner with OpenCL, and I am trying to run a basic kernel, in the aim of processing image analysing function on GPU. For now I just want to try to execute the following kernel, on CPU Intel Sandy Bridge. I am working on visual studio 2010.

    The problem is that my kernel is not executed on the whole data. However, trying to debug, I understood that leaving a printf into the kernel make it working perfectly x). I assume it is because it slows down the execution...

    Here is the kernel:

    Code :
    __kernel void ker_FLADIntra_sum_c(
    const int stride, 
    __global int* pix,
    __global int* p_flad_sum)
    {
    int i, j;
    j = get_global_id(0);
    i= get_global_id(1);
     
    /*printf which make it working*/
    //printf("j=%d\ti=%d\tsum=%d\n", j, i, *p_flad_sum);
     
     *p_flad_sum+= abs(pix[j*stride + i] - pix[(j+1)*stride + i]);
    }

    The host program is mainly based on OpenCL SDK example, and Intel tutorial. Here is how I manage memory and run the kernel:

    Code :
        //create OpenCL buffer using input array memory
        g_inputBuffer = clCreateBuffer(g_context, CL_MEM_READ_ONLY, sizeof(cl_int) * arraySize, NULL, NULL);
    	g_flad_sum = clCreateBuffer(g_context, CL_MEM_READ_WRITE, sizeof(cl_int), NULL, NULL);
     
        if (g_inputBuffer == (cl_mem)0)
        {
            printf("ERROR: Failed to create input data Buffer\n");
            return false;
        }
     
    	err = clEnqueueWriteBuffer( g_cmd_queue, g_inputBuffer, CL_TRUE, 0, sizeof(cl_int) * arraySize, inputArray, 0, NULL, NULL );
    	err = clEnqueueWriteBuffer( g_cmd_queue, g_flad_sum, CL_TRUE, 0, sizeof(cl_int), p_flad_sum, 0, NULL, NULL );
     
    	err  = clSetKernelArg(g_kernel, 0, sizeof(cl_int), (void *) &stride);
        err |= clSetKernelArg(g_kernel, 1, sizeof(cl_mem), (void *) &g_inputBuffer);
        err |= clSetKernelArg(g_kernel, 2, sizeof(cl_mem), (void *) &g_flad_sum);
        if (err != CL_SUCCESS)
        {
            printf("ERROR: Failed to set input kernel arguments\n");
            return false;
        }
     
     
                // set work-item dimensions
    			size_t global_work_size[2];
                 global_work_size[0] = (size_t) stride-1;	//number of quad items in input array
    			 global_work_size[1] = (size_t) stride;	//skip the last row not to go out of allocation
    			 nd=2; // execute kernel (2D)
     
             if (CL_SUCCESS != clEnqueueNDRangeKernel(g_cmd_queue, g_kernel, nd, NULL, global_work_size, NULL, 0, NULL, NULL))
             {
                    printf("ERROR: Failed to execute sorting kernel\n");
                    return false;
             }
     
    	err = clFinish(g_cmd_queue);
     
    	err = clEnqueueReadBuffer( g_cmd_queue, g_flad_sum, CL_TRUE, 0, sizeof(cl_int), p_flad_sum, 0, NULL, NULL );

    I also ran VTune to check the timeline threads behaviour with and without this printf:

    Without printf


    With
    http://img109.imageshack.us/img109/8...fladvtune2.png

    I first thought about the blocking or non-blocking memory read, but it is already on CL_TRUE.
    What am I missing?

    Thanks.

  2. #2
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: OpenCL CPU issue: execution stop too early

    Quote Originally Posted by chritolb29
    Hello,

    *p_flad_sum+= abs(pix[j*stride + i] - pix[(j+1)*stride + i]);
    This wont work, you're trying to add to the same address concurrently: this leads to a race condition when you're using multiple threads. Do a search on 'parallel reduction' or look at sdk examples of similar functions to see how it needs to be done. If you don't understand 'race condition', also search on it; these are both very widely known issues.

    Using printf probably just serialises the threads or something and just makes it appear to work.

  3. #3
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: OpenCL CPU issue: execution stop too early

    Quote Originally Posted by chritolb29
    *p_flad_sum+= abs(pix[j*stride + i] - pix[(j+1)*stride + i]);
    This wont work, you're trying to add to the same address concurrently: this leads to a race condition when you're using multiple threads. Do a search on 'parallel reduction' or look at sdk examples of similar functions to see how it needs to be done. If you don't understand 'race condition', also search on it; these are both very widely known issues.

    Using printf probably just serialises the threads or something and just makes it appear to be ok.

  4. #4
    Junior Member
    Join Date
    Apr 2012
    Posts
    9

    Re: OpenCL CPU issue: execution stop too early

    Thank you for your answer.
    Parallel reduction is a big issue indeed, I re-wrote my kernel, taking it into account and it works well now.

Similar Threads

  1. Usertime of execution almost same on CPU as well as GPU?
    By whitepearl in forum OpenCL - parallel programming of heterogeneous systems
    Replies: 3
    Last Post: 03-24-2012, 05:35 AM
  2. The execution in GPU is not the same that CPU
    By luizdrumond in forum OpenCL - parallel programming of heterogeneous systems
    Replies: 3
    Last Post: 09-02-2011, 03:00 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •