PDA

View Full Version : OpenCL tradeoffs with driver



pelangi15
06-16-2011, 12:45 AM
Dear all,

I would like to share some info that I have investigated for the past few days.
In short, I find it tough to achieve speed up considering the tradeoffs.

In cases of both OpenCL implemenation, I achieve only the functionality that is producing the same values for 4 output arrays of each 53760 in size.

Here is the profiling information I obtained before porting to OpenCL.
http://www.the-passer.com/wp-content/uploads/2011/06/WithoutOpenCL.png

Here is the profiling information, when I modify 1 function to use OpenCL and only 1 kernel function.
http://www.the-passer.com/wp-content/uploads/2011/06/WithOpenCL_1_KernelFct.png

Finally, the last profiling information, when I modify the same function with 2 kernel functions.
http://www.the-passer.com/wp-content/uploads/2011/06/WithOpenCL_2_KernelFct.png

I know there is a high possibility that I might have coded them wrongly in OpenCL but taking a closer look, you will see that the driver does play a part as well, e.g. cllcdGetPlatformIDskHR (from amdocl.dll) and calddiGetVersion (from aticaldd.dll).

I have also found that clGetPlatformIDs and clBuildProgram (times 2 when running 2 kernel functions) have some poor utilization of CPU time.

It means I have to optimize the code (using OpenCL) fast enough to recover the losses I have in the driver.

If there is anyone can give a glimmer of hope, please kindly do so...

david.garcia
06-16-2011, 03:41 AM
I have also found that clGetPlatformIDs and clBuildProgram (times 2 when running 2 kernel functions) have some poor utilization of CPU time.

Does it mean you are calling clGetPlatformIDs more than once in your application?

pelangi15
06-16-2011, 07:26 PM
Hi David,

I believe the answer is yes.
I have embedded the OpenCL implementation into the application, so the implementation would be called each time an image frame is being processed.

Is it better to only make the call once?

And for clBuildProgram, will there be a difference between using precompiled binary and compilation during runtime?

Thanks!

david.garcia
06-16-2011, 07:53 PM
I have embedded the OpenCL implementation into the application, so the implementation would be called each time an image frame is being processed.

Is it better to only make the call once?


It would be a lot better if you did the setup calls only once, then kept the objects around for the next image. By setup calls I mean APIs like these:

clGetPlatformIDs
clGetDeviceIDs
clCreateContext
clCreateCommandQueue
clCreateBuffer
clCreateImage2D
clCreateProgramWithSource
clCreateProgramWithBinary
clBuildProgram
clCreateKernel


The only calls that you should be doing from one image to the next would be clSetKernelArg, clEnqueueXXX, and the like.



And for clBuildProgram, will there be a difference between using precompiled binary and compilation during runtime?

Loading precompiled binaries saves some time. However, look into avoiding performing setup calls over and over (see above) before thinking of binary programs.

pelangi15
06-17-2011, 01:19 AM
Thanks for the advice.

I will try to modify the code based on that and probably go up 1 or 2 level of the function call.

Will update if I have better profiling info.