Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: performance comparison between OpenCL & DirectCompute

  1. #1

    performance comparison between OpenCL & DirectCompute

    Hi, all
    I implemented the same algorithm using OpenCL and DirectCompute. I used the same totoal thread number, same group number. I think they will have approximate performance. but i am wrong. DirectCompute is about 2 times faster than OpenCL, I dont know why?can anybody tell me the reason?
    I am using win7 & Nvidia 8600gt. thanks!

  2. #2

    Re: performance comparison between OpenCL & DirectCompute

    take an image 1000*1000 for example,
    I set the global size 1000*64, set local size 64,
    both for OpenCL and DirectCompute
    I got the result DirectCompute is 2 times faster than OpenCL on Nvidia 8600gt
    .

  3. #3
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: performance comparison between OpenCL & DirectCompute

    We cannot give any advice unless we study the source code for both applications. The first place I would look at would be memory transfers since I've seen people misusing buffers (particularly buffer initialization) frequently.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  4. #4

    Re: performance comparison between OpenCL & DirectCompute

    Quote Originally Posted by david.garcia
    We cannot give any advice unless we study the source code for both applications. The first place I would look at would be memory transfers since I've seen people misusing buffers (particularly buffer initialization) frequently.
    Theoretically the same algorithm implemented by OpenCL or DirectCompute will have approximate performance? Is it right?

  5. #5
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: performance comparison between OpenCL & DirectCompute

    Yes, in principle executing the same algorithm in OpenCL and DX Compute on the same hardware will have the same performance.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  6. #6

    Re: performance comparison between OpenCL & DirectCompute

    I tried best to optimize OpenCL code but no use. OpenCL version is almost 2.5 times slower than DirectCompute version. I posted the OpenCL code here, and ask your help to optimize it.
    The algorithm is very simple, Add gauss blur to image pixels.
    Code :
    #define	NUMPIXELPERGROUP	256
    const sampler_t RowSampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
    __kernel void GaussRowProcess(__read_only image2d_t image, __global unsigned int* uiDest, __local float4* localData,
                                unsigned int uiWidth, unsigned int uiHeight, int iRadius, 
    							__global const float *pCo)
    {
    	unsigned int y = get_group_id(1);
    	unsigned int	x	=get_local_id(0);
    	int globalPosX = ((int)get_group_id(0)*NUMPIXELPERGROUP)+(int)x-iRadius;
    	__global unsigned int*	pDst	=uiDest+mul24(y, uiWidth);
        if (globalPosX>=0&&globalPosX<uiWidth) 
        {
    		int2 pos = {globalPosX , y};
            localData[x] = read_imagef(image, RowSampler, pos)*255;
        }
     
    	barrier(CLK_LOCAL_MEM_FENCE);
     
        if((globalPosX>=0) && (globalPosX<uiWidth)&&(x>=iRadius)&&(x<(iRadius+(int)NUMPIXELPERGROUP)))
    	{
    		float4	result	=(float4)0.0f;
    		float	nTotal	=0;
    		int iLen	=iRadius;
    		for (int i=-iLen; i<=iLen; ++i) 
    		{
    			float	tmpF	=pCo[iLen+i];
    			result	+=tmpF*localData[x+i];
    			nTotal	+=tmpF;
    		}
    		pDst[globalPosX]	=rgbaFloat4ToUint(result/nTotal);
    	}
    }

    Code :
    	int	n(0);
    	clSetKernelArg(clKernel, n++, sizeof(clSrc), (void*)&clSrc);
    	clSetKernelArg(clKernel, n++, sizeof(dstMem), (void*)&dstMem);
    	clSetKernelArg(clKernel, n++, sizeof(cl_float4)*(NUMPIXELPERGROUP+g_templateLen*2), NULL);
    	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_ImageWidth);
    	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_ImageHeight);
    	clSetKernelArg(clKernel, n++, sizeof(cl_uint), (void*)&g_templateLen);
    	clSetKernelArg(clKernel, n++, sizeof(coMem), (void*)&coMem);
     
    	size_t	global_size[2]	={0, g_ImageHeight};
    	size_t	local_size[2]	={0, 1};
    	local_size[0] = (size_t)(NUMPIXELPERGROUP+g_templateLen*2); 
    	global_size[0] = local_size[0] * (g_ImageWidth/NUMPIXELPERGROUP+(g_ImageWidth%NUMPIXELPERGROUP==0?0:1));
     
    	clEnqueueNDRangeKernel(g_Command_queue, clKernel, 2, NULL, global_size, local_size, 0, NULL, NULL);
    	clFinish(g_Command_queue);

  7. #7

    Re: performance comparison between OpenCL & DirectCompute

    "g_ImageHeight*2+1" is the template length of gauss blur. for every pixel I should use "g_ImageHeight*2+1" pixels to calculate its RGBA, g_ImageHeight before this pixel and g_ImageHeight after it.
    "__global const float *pCo" is the coefficient to be multiplied with "g_ImageHeight*2+1" pixels.
    I use "__local float4* localData" to speed up, but it is useless.
    As you see, I only do blur horizontally.
    what should I do to optimize this code?
    Use shared memory may increase transfer speeding, but it is still much slower than DirectCompute version.
    so it may have something to do with Nvidia? my card is Nvidia 8600 GT.

  8. #8
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: performance comparison between OpenCL & DirectCompute

    Can you show us you you create the buffers and load data into them? Also, how do you read back the result?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  9. #9
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: performance comparison between OpenCL & DirectCompute

    Also, have you done any profiling of the individual steps? The very first thing to do when you want to improve the performance of some code is to measure where the time is spent.

    Look in the spec for "CL_QUEUE_PROFILING_ENABLE", "CL_DEVICE_PROFILING_TIMER_RESOLUTION", "CL_PROFILING_COMMAND_QUEUED", "CL_PROFILING_COMMAND_START" and "CL_PROFILING_COMMAND_END".
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  10. #10

    Re: performance comparison between OpenCL & DirectCompute

    thank you david. the same program runs as fast as DirectCompute version on Geforce 240. so I think 8600gt may dont support OpenCL well.

Page 1 of 2 12 LastLast

Similar Threads

  1. Bring this DirectCompute features to OpenCL!
    By oscarbg in forum Suggestions for next release
    Replies: 1
    Last Post: 12-27-2010, 04:09 PM
  2. comparison of OpenCL implementations
    By dominik in forum OpenCL
    Replies: 1
    Last Post: 02-05-2010, 04:01 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •