PDA

View Full Version : CL_DEVICE_MAX_WORK_GROUP_SIZE



ilektrik
06-05-2011, 01:36 PM
Hello

I'm getting a great support here on forum, but just a moment ago I encountered situation which is confusing so just started another thread.

Theoritically I should be able to run my kernel with 512 work items in workgroup. But it seems that when I specify local_work_size=512 for clEnqueueNDRangeKernel then no work is done and nearly 10 000 runs of my kernel are done surprisingly fast. When I decrease value of local_work_size to 256 kernel runs normally and performs computations.

This is part of device query for OpenCL:

---------------------------------
Device Quadro NVS 140M
---------------------------------
CL_DEVICE_NAME: Quadro NVS 140M
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 260.99
CL_DEVICE_VERSION: OpenCL 1.0 CUDA
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 2
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
CL_DEVICE_MAX_CLOCK_FREQUENCY: 800 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 128 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 113 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 16 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma

Is it possible that driver/OpenCL is just giving me wrong information here about maximum allowed number of work items in work group?

david.garcia
06-05-2011, 01:43 PM
Theoritically I should be able to run my kernel with 512 work items in workgroup.

I believe you are confusing the maximum work size supported by the device with the maximum work size that can be used to run a particular kernel. For a very complex kernel, the maximum work size will be smaller than the maximum that the device can support for a very simple kernel.

That's why in OpenCL there are two different queries. One of them is clGetDeviceInfo(..., CL_DEVICE_MAX_WORK_GROUP_SIZE, ...) -- this is the maximum for the device. The other one is clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...) -- this one is the maximum value you can pass to clEnqueueNDRangeKernel() for this kernel.

sean.settle
06-05-2011, 08:51 PM
In OpenCL there are two different queries. One of them is clGetDeviceInfo(..., CL_DEVICE_MAX_WORK_GROUP_SIZE, ...) -- this is the maximum for the device. The other one is clGetKernelWorkGroupInfo(..., CL_KERNEL_WORK_GROUP_SIZE, ...) -- this one is the maximum value you can pass to clEnqueueNDRangeKernel() for this kernel.

Is CL_KERNEL_WORK_GROUP_SIZE guaranteed to be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE, or would one have to explicily take the minimum of the two?

david.garcia
06-06-2011, 03:31 AM
Is CL_KERNEL_WORK_GROUP_SIZE guaranteed to be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE, or would one have to explicily take the minimum of the two?

Technically speaking, I don't think the spec guarantees that CL_KERNEL_WORK_GROUP_SIZE must be less than or equal to CL_DEVICE_MAX_WORK_GROUP_SIZE.

However, the spec describes CL_KERNEL_WORK_GROUP_SIZE as the "maximum work-group size that can be used to execute a kernel on a specific device
given by <device>", and that's all the application cares about.

CL_DEVICE_MAX_WORK_GROUP_SIZE is pretty much irrelevant.

ilektrik
06-06-2011, 05:11 AM
According to documentation here http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetKernelWorkGroupInfo.html I wrote a query and after executing it:

uint result;
size_t size_ret;
clGetKernelWorkGroupInfo(OpenCL, NULL, CL_KERNEL_WORK_GROUP_SIZE, sizeof(uint), (void*)&result, &size_ret);

result is 320.

Running my kernel with global_work_size=640 & local_work_size=320 provides right computations. Still computations are not faster than on CPU, but this is subject for separate thread...

Thanks