Results 1 to 4 of 4

Thread: Best practice choosing right work group size

  1. #1
    Newbie
    Join Date
    Jul 2014
    Posts
    2

    Exclamation Best practice choosing right work group size

    After reading several books and googling, I still don't fully understand what is the right way to choose the dimension of work items.

    What I underatood is that we can let do it automatically to OpenCL or do it "manually" ourselves.
    In case of an image 1024*1024 as example:

    size_t globalThreads[] = { 1024, 1024 };
    • Automatically
      Code :
      status = clEnqueueNDRangeKernel(
      		commandQueue,
      		kernl,
      		2,
      		NULL,
      		globalThreads,
      		NULL,
      		0,
      		NULL,
      		NULL);
      Setting to NULL the work group size.
    • Manually
      The second way it is to take max work item size from infodevice and fill it up with data as much as possible. In this way I want to have less work groups as possible because among them the parallelism it is not garanteed, whereas among the work items it is indeed garanteed. So the main goal is to have less work groups, and on other hand to maximize the work items in the work group.

      Code :
      for (i = (int)deviceInfo.maxWorkGroupSize; i>0; i--){
      		if (1024%i == 0){
      			res2 = i;
      			break;
      		}
      	}
       
      	for (i = ((int)deviceInfo.maxWorkGroupSize) / res2; i>0; i--){
      		if (1024%i == 0){
      			res1 = i;
      			break;
      		}
      	}
       
      	size_t globalThreads[] = { 1024, 1024 };
      	size_t localThreads[] = { res2, res1 };
       
      	cl_event ndrEvt;
      	status = clEnqueueNDRangeKernel(
      		commandQueue,
      		kernl,
      		2,
      		NULL,
      		globalThreads,
      		localThreads,
      		0,
      		NULL,
      		&ndrEvt);

      The other way is also to play with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, which I'm not sure how really it does work. I've implemented in this way, but still not sure about it:


      Code :
       
              clGetKernelWorkGroupInfo(kernl,
      		devices[0],
      		CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
      		sizeof(size_t),
      		&preferredGroupSize,
      		NULL);
              size_t globalThreads[] = { 1024, 1024 };
      	size_t localThreads[] = { maxItems/preferredGroupSize, preferredGroupSize };
       
      	cl_event ndrEvt;
      	status = clEnqueueNDRangeKernel(
      		commandQueue,
      		kernl,
      		2,
      		NULL,
      		globalThreads,
      		localThreads,
      		0,
      		NULL,
      		&ndrEvt);



    https://software.intel.com/sites/pro...ns_Summary.htm

  2. #2
    Senior Member
    Join Date
    Dec 2011
    Posts
    170
    I'd love to be proven wrong, but in my opinion and based on my experience, it's a black art.

    It varies by hardware vendor, and I've even seen where non-multiples of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE are faster.

    The only consistent way I've seen is to try them all and benchmark.

    These days, using NULL is usually not a ton worse than the optimal size (a few years ago it could be much worse).

    So you might consider only hardcoding the work group size for kernels that are designed for a particular size (e.g., use shared local memory for caching).

  3. #3
    Newbie
    Join Date
    Jul 2014
    Posts
    2
    My intention is to assign workitems in hardware independent way... Could it be using NULL the best solution?

  4. #4
    Senior Member
    Join Date
    Dec 2011
    Posts
    170
    For many applications, yes. You can certainly try to write a function that calculates an optimal work group size, but it will be a challenge. Alternatively, you can benchmark all sizes on the user's machine and remember it (but run the test again if the hardware or driver changes).

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •