i have one algorithm, that can be implemented using 34 work items executing the same kernel (clEnqueueNDRangeKernel), i.e. SIMD (data parallel method) in OpenCL. in this case, only 34 work items are used, and the GPU is quite low utilized.
In order to measure the maximum throughput on the GPU, i want to push as many execution of such algorithm instance as possible to the GPU so that all computation elements can be used. i.e. i want to do task paralllel as the same time. Can anyone tell me to how to do that? my understanding is that command queue in opencl is like a one server queue, two clEnqueueNDRangeKernel commands can't be executed at the same time on the GPU even though there are resource available... how can i make the device execute multiple algorithm instances with data parallellism in the algorithm?