Results 1 to 5 of 5

Thread: Subdividing gloabl Workgoup Size

  1. #1
    Senior Member
    Join Date
    Oct 2012
    Posts
    165

    Subdividing gloabl Workgoup Size

    Hi,

    i try to automatically subdivide my global workgroup size (gws) into smaller pieces using the GW offset.
    Here an example:
    Code :
    size_t szGWS[3] = {1024,1024,1};
    size_t szLWS[3] = {256,1,1};
    size_t szGWO[3] = {0,0,0}
    if(1024*1024*uiWIComplexity > device.AvailaleFlops) //Test if we need to subdivide problem
    {
      int sub = 3;
      for(int i = 0; i < sub; i++)
      {
         szGWS[1] = 1024/sub;
         szGWO[1] = 1024 * i / sub;
      }
      clEnqKernel(..., szGWO, szGWS, szLWS,...);
    }

    I think indexing inside my Kernel works properly but synchronaization fails.
    I have a synchonized queue, which means all kernels equeued should synchronize by themself, correct?

    but if i do the following:
    (1) copy values from buffer A to B in muliple subkernels
    (2) edit values of A in multiple subkernels
    (3) edit values of B in multiple subkernels

    my data seems corrupted.
    Does openCl waits for the whole task (1) to complete before srating (2) and (3) or does it start with the first part of (2) or (3) when the first part of (1) is done?

    Thanks in advance,
    clint3112

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Posts
    105
    As stated in the OpenCL 1.2 specification, section 5.11:

    If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order.

    For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed.

    If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.
    Note also that in your code 1024 is not divisible by 3, so clEnqueueNDRangeKernel will fail because the global work size (1024/3=341) is not a multiple of the local work size anymore.

  3. #3
    Senior Member
    Join Date
    Oct 2012
    Posts
    165
    Thanks for your reply. This was just a quick shot from my mind. In my code i am checking for the correct division into LWG sizes.
    My syncproblem has been solved. I missed an iteration in the for loop.

  4. #4
    Senior Member
    Join Date
    Dec 2011
    Posts
    154
    Why are you needing to subdivide your work? The runtime automatically does that. You can submit any global work size and the runtime will run it in sections if needed.

  5. #5
    Senior Member
    Join Date
    Oct 2012
    Posts
    165
    I used this to automatically start multiple kernels when the problem size will be larger than the flops the gpu can achieve in 2 seconds. This will make shure the windows watchdog will never get triggered. Works fine but there is a little problem with my interface I have to deal with because with automatically subdivision of the kernel you have to wait for multiple kernels and I am passing just one event reference through the interface. But I don't think this will be a problem in the future.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •