Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: GL-CL interoperability performances

  1. #1
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    GL-CL interoperability performances

    Hi.
    i'm developing a sample app that has to do the following:

    1) render a single frame to an offscreen framebuffer
    2) analyze pixel by pixel the generated image (in different manners).

    i developed a standard openGL app, which draws offscreen, calls the glReadPixels to retrieve rendering result and then do its stuff. this one takes about 0,15 seconds to perform 100 runs on a small rendering (300x300).

    then, i developed an opencl app that:
    1) prepares an opengl context
    2) prepares an opencl buffer from the framebuffer
    3) computes the rendered image on the GPU side (there is no explicit data copy between RAM and VRAM)
    4) retrievs the result of the evaluation from the GPU memory (this is just one float number)

    this app takes about 4 seconds to run!!!!

    a simplified snippet of code:
    Code :
    glutInit(...)
     
    glGenFramebufferEXT(...)
    glBindFramebufferEXT(...)
     
    glGenRenderBufferEXT(1, &colorId);
    glBindRenderbufferEXT(GL_RENDERBUFFER_EXT, colorId);
    glRenderbufferStorageEXT(GL_RENDERBUFFER_EXT, GL_RGBA8, 300, 300)
     
    glFramebufferRenderbufferEXT(GL_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_RENDERBUFFER_EXT, colorId);
     
    CGLContextObj kCGLContext = CGLGetCurrentContext();              
    CGLShareGroupObj kCGLShareGroup = CGLGetShareGroup(kCGLContext);
     
    cl_context_properties properties[] = { 
    	CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, 
    	(cl_context_properties)kCGLShareGroup, 
    	0 
    };
     
    // Create a context from a CGL share group
    context = clCreateContext(properties, NULL, NULL, NULL, NULL, &err);
     
    cl_device_id devices[1];
    clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, devices, NULL);
     
    program = clCreateProgramWithSource(context, 1, strings, lengths, &err);
    queue = clCreateCommandQueue(context, devices[0], NULL, &err);
    m_Kernel = clCreateKernel(program, "Evaluate", &err);
    res_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float), NULL, &err);
    clSetKernelArg(m_Kernel, 1, sizeof(cl_mem), &res_buffer);	
    m_Shared = clCreateFromGLRenderbuffer(context, CL_MEM_READ_ONLY, colorId, &err);
    clSetKernelArg(m_Kernel, 0, sizeof(cl_mem), &m_Shared);	
     
    Draw(); // openGL code to draw to the framebuffer
    glFinish();
     
    clEnqueueAcquireGLObjects(queue, 1, &m_Shared, NULL, NULL, NULL);
    const size_t number = 1;
    clEnqueueNDRangeKernel(queue, m_Kernel, 1, NULL, &number, &number, 0, NULL, NULL);
    clEnqueueReleaseGLObjects(queue, 1, &m_Shared, 0, 0, 0)
     
    float results[1];
    clEnqueueReadBuffer(queue, res_buffer, CL_TRUE, 0, sizeof(float), results, NULL, NULL, NULL);

    the kernel:
    Code :
    __kernel void Evaluate(
        read_only image2d_t framebuffer,
        __global int *results){
     
    int matching = 0;
    for(int i=0;i<3000;i++)
    {
    	for(int j=0;j<3000;j++)
    	{
    		float4 pixel = read_imagef(framebuffer, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, (int2)(i,j));
    		if(pixel matches)
                           matching++;
    	}
    results[0] = matching;
    }

    i can't understand why performances are so bad.
    any help would be really appreciated

    if you have any questions, please ask

    thanks a lot,
    christian

  2. #2
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: GL-CL interoperability performances

    It looks like your global work size is 1, which means you're only using 1/16th of 1 of the streaming processors on the GPU. (Which is far slower than 1 core on a CPU.)

    You should set your global work size to 3000, 3000 and remove the for-loops in your kernel. (BTW -- do you really mean 3k by 3k or do you want 300 by 300? The image is only 300x300.) At the end you then use an atomic add to increment the number matching if there is a match. Actually, since the atomic add is going to be very slow, you might want to set your global size to be much smaller (say 100, 100 for a 300x300 image) and then count up to 9 matches per work-item. That would reduce the number of atomic adds you'd have to do.

    Make sure you verify that atomics are supported on the device you're using, though!

  3. #3
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: GL-CL interoperability performances

    hi and thank you for the answer.
    actually images are 300x300. i saw i made many mistakes in the last post but i cannot edit it.
    tomorrow i'll surely try what you suggested.
    thanks again

    christian

  4. #4
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: GL-CL interoperability performances

    Hi,
    i tryied using several work sizes like suggested by you.
    i got the best performances using small computation areas for each kernel, such as 100 pixels for each kernel for a 300x300 image.
    now i have execution times of about 0,3seconds in openCL vs 0,1 seconds for the CPU algorithm which uses getPixels to retrieve the whole framebuffer from the VRAM.
    do you think this is right or not?

  5. #5
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: GL-CL interoperability performances

    You should be able to execute this (depending of course on the details of "pixel matches") *far* faster on the GPU. There's something wrong.

    Can you post your new kernel for doing the matching pixels counts? Are you using a global atomic add to have each parallel count added together?

  6. #6
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: GL-CL interoperability performances

    this is the kernel:
    Code :
    __kernel void Evaluate(
        read_only image2d_t framebuffer,
    	int width,
    	int height,
    	int block_size, 
    	__global int *results
    	){
     
    		int index = get_global_id(0);
     
    		int reds = 0;
    		int a = index*block_size;
    		int b = a+block_size;
    		for(int i=a;i<b;i+=1)
    		{
    			int row = trunc((float)i / (float)width);
    			int column = i - (row * width);
     
    			float4 pixel = read_imagef(framebuffer, CLK_ADDRESS_NONE | CLK_FILTER_NEAREST, (int2)(row,column));
    			if(pixel.x == 1 && pixel.y == 0 && pixel.z == 0)
    			{
    				reds+=1;
    			}
    		}
    		results[index] = reds;
    	}

    as you can see, each work item save in a different location in the results array, so there is no atomic code in the kernels.
    the final sum is made in CPU land, after reading back data.
    note that this is not actually a bottleneck (even with this part of code commented out performances are mostly the same).

  7. #7
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: GL-CL interoperability performances

    What is your global size? In general you need 1000+ global work-items to hide memory latencies.

  8. #8
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: GL-CL interoperability performances

    by now i set global work items number to 900 and local size to 1.
    what should i set?

  9. #9
    Junior Member
    Join Date
    Feb 2010
    Posts
    1

    Re: GL-CL interoperability performances

    Quote Originally Posted by whites11
    by now i set global work items number to 900 and local size to 1.
    what should i set?
    Local work size seems pretty small to me. On NVidia architecture, it must be a multiple of 32 for maximum efficiency. When no collaboration is required inside work group threads (ie no communication through shared mem), the local work size should be optimized using CUDA occupancy calculator (an excel spreadsheet available part of the SDK). What you want to do is maximize multiprocessor occupancy factor.

    I hope it helps...

    P-O

  10. #10
    Junior Member
    Join Date
    Feb 2010
    Posts
    10

    Re: GL-CL interoperability performances

    hi and thanks for your answer.
    so, let's say i need a total amount of 900 work items.
    which should be the global and local numbers in your opinion?

Page 1 of 2 12 LastLast

Similar Threads

  1. Replies: 1
    Last Post: 12-13-2012, 03:13 AM
  2. CL GL interoperability
    By sajis997 in forum OpenCL
    Replies: 0
    Last Post: 10-18-2012, 05:34 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •