I would like to share some info that I have investigated for the past few days.
In short, I find it tough to achieve speed up considering the tradeoffs.
In cases of both OpenCL implemenation, I achieve only the functionality that is producing the same values for 4 output arrays of each 53760 in size.
Here is the profiling information I obtained before porting to OpenCL.
Here is the profiling information, when I modify 1 function to use OpenCL and only 1 kernel function.
Finally, the last profiling information, when I modify the same function with 2 kernel functions.
I know there is a high possibility that I might have coded them wrongly in OpenCL but taking a closer look, you will see that the driver does play a part as well, e.g. cllcdGetPlatformIDskHR (from amdocl.dll) and calddiGetVersion (from aticaldd.dll).
I have also found that clGetPlatformIDs and clBuildProgram (times 2 when running 2 kernel functions) have some poor utilization of CPU time.
It means I have to optimize the code (using OpenCL) fast enough to recover the losses I have in the driver.
If there is anyone can give a glimmer of hope, please kindly do so...