I have a question about how to get better performance of my OpenCL application. The size of computations is quiet big - something like 10 millions of computations needed.
I'm not sure if I'm using OpenCL API right, because my GPU application is not any faster than CPU. Of course it's not a rule that GPU version will be 100x faster than CPU one, but just check my current approach to the problem:
Problem need to run a lot of computations, a lot of work items - something like 10 mln.
I set global_work_size to 640,
local_work_size to 320.
After every run of clEnqueueNDRangeKernel() I'm reading results to check if my problem is already solved with clEnqueueReadBuffer (blocking set to CL_TRUE).
The final performance is still very poor. I haven't done any measurements but I see it's just not fast enough. If I missed some basic information just tell. If code is required to analyze - tell which one.
PS. I'm computing on NVIDIA Quadro 140M NVS (laptop)