PDA

View Full Version : How do opencl kernel perform?



phoebe0105
05-26-2010, 12:00 AM
I don't understand how do the opencl kernel function to perform.
I want to calculate something in opencl kernel. So I use 256 work-item and perform it and I want to gain final result.

Is clEnqueueDNRangeKernel() function perform all of the 256 work-item?
Or am I using for loop?? ( If I use the for loop, I also write 16*16 number repeat?)

ibbles
05-26-2010, 04:57 AM
If you configure your call to launch 256 work items, using the localSize and globalSize arguments, then a single clEnqueueNDRangeKernel() call will launch all those 256 threads. No need for a loop.

phoebe0105
05-26-2010, 09:06 AM
If you configure your call to launch 256 work items, using the localSize and globalSize arguments, then a single clEnqueueNDRangeKernel() call will launch all those 256 threads. No need for a loop.

Thanks your advise.

I have one question. If I want to 256 work items, What's the number of localSize and globalSize?
Now in my source, globalSize is 16*16(=256) and localSize is 16.
Is it right?? or wrong??

ibbles
05-26-2010, 11:00 AM
That is correct. The global size is the total number of threads you want, 256 in this case. Setting localSize to 16 will split these 256 threads to 256/16 = 16 groups.

Note that the both of these sizes may be several values, one for each dimension of the NDRange. So setting the global size to 256 and work_dim to 1 gives you a consecutive range of thread ids from 0 up to, but not including, 256. Your question included a 16*16, which hints at a two-dimensional problem. If that is the case, then you may set the global size to [16,16] and work_dim to 2, which will spawn 256 threads in a 16-by-16 grid.

phoebe0105
05-26-2010, 12:07 PM
That is correct. The global size is the total number of threads you want, 256 in this case. Setting localSize to 16 will split these 256 threads to 256/16 = 16 groups.

Note that the both of these sizes may be several values, one for each dimension of the NDRange. So setting the global size to 256 and work_dim to 1 gives you a consecutive range of thread ids from 0 up to, but not including, 256. Your question included a 16*16, which hints at a two-dimensional problem. If that is the case, then you may set the global size to [16,16] and work_dim to 2, which will spawn 256 threads in a 16-by-16 grid.

I set the global size to [16,16] and work_dim to 2.
However the problem is occured.
This problem is " CL_INVALID_WORK_GROUP_SIZE ".

I changed the source code.

1. first change

szGlobalWorkSize[0] = iWidth; // iWidth = 16
szGlobalWorkSize[1] = iHeight; // iheight = 16
szLocalWorkSize[0]= NUM_THREADS;

err = clEnqueueNDRangeKernel(oclHandles.queue,
m_clHexEncodeKernel,
2, // work_dim value
NULL,
szGlobalWorkSize,
szLocalWorkSize,
0, NULL, NULL);


2. second change


szLocalWorkSize[0] = iWidth;
szLocalWorkSize[1] = iHeight;
szGlobalWorkSize[0] = shrRoundUp((int)szLocalWorkSize[0], iWidth);


What is wrong??? :?:

ibbles
05-27-2010, 07:11 AM
In both cases, you only set one value in one of the size. In 1), szLocalWorkSize[1] is still undefined, and in 2) the same is true for szGlobalWorkSize.

It would probably help if you added prints right before clEnqueue... that did something like



printf("Local: %d, %d, %d\n", <fill in here>);
printf("Global: %d, %d, %d\n", <fill in here>);
printf("Dim: %d", word_dim);

phoebe0105
05-27-2010, 10:48 AM
In both cases, you only set one value in one of the size. In 1), szLocalWorkSize[1] is still undefined, and in 2) the same is true for szGlobalWorkSize.

It would probably help if you added prints right before clEnqueue... that did something like



printf("Local: %d, %d, %d\n", <fill in here>);
printf("Global: %d, %d, %d\n", <fill in here>);
printf("Dim: %d", word_dim);


OK. I wrote the szLocalWorkSize[1] and catched the problem.

I have another question.
Is OpenCL kernel funtion possible to executed sequentially??

ibbles
05-28-2010, 11:24 AM
Typically no. The main idea behind OpenCL is data parallelism, which kind of implies parallel threads. You could of course launch only one thread at the time, but that would be horrible inefficient.

Different kernels (or the same kernel multiple times) can be run sequentially , if that was what you were asking.