Info about device Query
I am using the NVIDIA opencl SDK and I have the two graphic cards on my system that is GTX 580 and GT 240...
I ran device query program that is
clGetDeviceInfo(devices[i], CL_DEVICE_NAME, sizeof(cBuffer), &cBuffer, NULL);
printf(" Device %s\n", cBuffer);
Now this is showing me the following output....(shown just of one)
CL_DEVICE_NAME: GeForce GTX 580
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DEVICE_VERSION: OpenCL 1.1 CUDA
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1544 MHz
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 383 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1535 MByte
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma
NUMBER OF MULTIPROCESSORS: 16
NUMBER OF CUDA CORES: 512
CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 1
Now I am confused that what is the number of cores and number of MULTI PROCESSORS and the local and global sizes with the normal terminology such as no of work groups and/or in a device and no of processing elements in each work group, total local memory available to each work gruop.....
It willl be really great if any one can explain the complete output wrt processing elements, work grp, compute units, local worksize of each work group and the completeglobal size ..
I am a newbie and really confused within the technical terms with the real output...
Here's some of my understanding about these concept. I'm also new to OpenCL and not familiar with nVIDIA GPU, so anybody please correct me if I am wrong.
1. every work item executes your kernel.
2. a work group is composed of a bunch of such work items (local work size is set to decide how many work items are there in one work grp)
3. max_work_group_size, 1024 in your case, means that there can be at most 1024 work items in one work grp
4. max_work_item_size ,(1024, 1024,64) means that your work groups' size can be (1024, 1, 1), (1,1024,1), or (256, 2, 2) etc. But it can not be (1, 1, 1024) since the third dimension has a limit of 64.
5. work items in one work group share the local memory.
5. The above are all abstract concept, while a compute unit is the real hardware component of your GPU. One work grp can only execute on one compute unit, but a compute unit can handle several work grps.
I'm not sure whether these are the answers you are looking for. Good luck.
thanks for the prompt reply.... it helped me some
Still having little doubts
thanks again leoamuro
This is quite a complex matter.
An NVIDIA multiprocessor is the hardware unit which corresponds to an OpenCL compute unit. Each multiprocessor can independently run concurrent threads.
In NVIDIA hardware, threads are grouped into warps. A warp contains 32 threads which are executed simultaneously (as long as there is no branching in fact, but that's another story).
A multiprocessor can keep up to 48 warps running at the same time (for a 2.0 compute capability device). So with 16 multiprocessors in a GTX 580, this gives a theoretical maximum 16*48*32 = 24,576 threads running concurrently.
However, a thread makes computations, accesses memory, and all this severely limits the number of threads really working at a time. A CUDA core is more or less an ALU that makes integer and floating-point operations. The GTX 580 has 512/16 = 32 CUDA cores per multiprocessor. So even if it can run 48*32 = 1,536 concurrent threads, it can only make 32 multiplications per clock cycle.
A logical OpenCL work-group is split into a (hardware) block of warps. For instance, if you have a work-group of size 16*16=256, it will be split into a block of 256/32=8 warps.
When you execute a kernel, the hardware tries to use the maximum number of blocks for your global work-size. The hardware can handle a maximum of 8 blocks simultaneously. As a result, if a work-group has a size lower than 1536/8=192, there will be too much blocks for the hardware to handle and the GPU occupancy will be lower than 100%. For instance, a work-group of size 128 with the maximum of 8 blocks will run only 128*8=1024 threads for an occupancy of 1024/1536=67%
Each multiprocessor also has its own local memory (48 KB for the GTX 580). This local memory is shared among the blocks running on the multiprocessor. As a result, this can also limit the number of blocks running at a time if a block uses a lot of local memory.
thanks , it cleared the whole ...