Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Problems with multiple commandQueues

  1. #1
    Junior Member
    Join Date
    Nov 2010
    Posts
    14

    Problems with multiple commandQueues

    Hello together,

    I'm struggling with multiple commandQueues using OpenCL C++ Bindings.

    In a multi-GPU environments, I create a CommandQueue for each device. Each CommandQueue "contains" the following commands:
    1. Non-blocking enqueueWriteBuffer for input data 1
    2. Non-blocking enqueueWriteBuffer for input data 2
    3. Non-blocking enqueueWriteBuffer for input data 3
    4. Kernel execution
    5. Non-blocking enqueueReadBuffer for output data

    Input data 1-3 are written to the same buffer. Each device gets different data, but the same kernel is executed.

    When I run my code on two devices, I have the following ordering of OpenCL-methods, according to NVIDIA's Visual Profiler:

    1. Non-blocking enqueueWriteBuffer for input data 1 to device 0 (CmdQueue 0)
    2. Non-blocking enqueueWriteBuffer for input data 2 to device 0 (CmdQueue 0)
    3. Non-blocking enqueueWriteBuffer for input data 3 to device 0 (CmdQueue 0)
    4. Kernel execution on device 0 (CmdQueue 0)
    5. Non-blocking enqueueReadBuffer for output data from device 0 (CmdQueue 0)
    6. Non-blocking enqueueWriteBuffer for input data 4 to device 1 (CmdQueue 1)
    7. Non-blocking enqueueWriteBuffer for input data 5 to device 1 (CmdQueue 1)
    8. Non-blocking enqueueWriteBuffer for input data 6 to device 1 (CmdQueue 1)
    9. Kernel execution on device 1 (CmdQueue 1)
    10. Non-blocking enqueueReadBuffer for output data from device 1 (CmdQueue 1)

    However, this increases the total time of my application on 2 Devices compared to executing it on 1 Device.

    I would like the ordering to be as follows:
    1. Non-blocking enqueueWriteBuffer for input data 1 to device 0 (CmdQueue 0)
    2. Non-blocking enqueueWriteBuffer for input data 2 to device 0 (CmdQueue 0)
    3. Non-blocking enqueueWriteBuffer for input data 3 to device 0 (CmdQueue 0)
    4. Non-blocking enqueueWriteBuffer for input data 4 to device 1 (CmdQueue 1)
    5. Non-blocking enqueueWriteBuffer for input data 5 to device 1 (CmdQueue 1)
    6. Non-blocking enqueueWriteBuffer for input data 6 to device 1 (CmdQueue 1)
    7. Execute kernel on device 0 as soon as all data is available (CmdQueue 0)
    8. Execute kernel on device 1 as soon as all data is available (CmdQueue 1)
    9. Non-blocking enqueueReadBuffer for output data from device 0 as soon as data is calculated (CmdQueue 0)
    10. Non-blocking enqueueReadBuffer for output data from device 1 as soon as data is calculated(CmdQueue 1)

    I tried to synchronize the commandQueues with the help of cl::Event as follows:

    Code :
    std::vector<cl::Event> events;
    events.push_back(f()); //f is the kernel functor

    In the method, where I download the data:
    Code :
    cl::WaitForEvents(events);
    //enqueueReadBuffer...

    This does not seem to work, since I still get the above mentioned unwanted ordering.

    How can I get the ordering I would like to have?

    Any help is greatly appreciated!

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Problems with multiple commandQueues

    Could there be something else at play? It should not be necessary to try to force the ordering of commands to be different: since you have two command queues they should run independently of each other and progress more or less at the same time.

    Are you absolutely certain that the two command queues are associated with different devices? Can you double and triple check? clGetCommandQueueInfo() with CL_QUEUE_DEVICE should give us the answer.

    The other thing I can think of would be some blocking call like WaitForEvents(), a blocking read, clFinish() or similar after step 4 or 5. Can you verify through your profiler that commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE?

    Also, I'm a bit confused by descriptions like "enqueueWriteBuffer for input data 1". Do you mean that "data 1" is one buffer object and "data 2" is a different buffer object?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Nov 2010
    Posts
    14

    Re: Problems with multiple commandQueues

    Hi david.garcia,

    thanks for the quick reply!

    Are you absolutely certain that the two command queues are associated with different devices? Can you double and triple check? clGetCommandQueueInfo() with CL_QUEUE_DEVICE should give us the answer.
    I checked this with clGetCommandQueueInfo, and I am certain, that each command queue is associated with a different device.

    The other thing I can think of would be some blocking call like WaitForEvents(), a blocking read, clFinish() or similar after step 4 or 5. Can you verify through your profiler that commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE?
    After each download (enqueueReadBuffer) there is a clFinish for each command queue. This clFinish is necessary to be able to merge the data (when using multiple devices, the input data is split up and after calculation has to be set together again). Do you think the problem could be here?

    I could not verify that
    commands 1-10 are CL_QUEUED one right after another without having to wait for any of the previous commands to be CL_COMPLETE
    , because I could not find such a small grained summary in NVIDIA Visual Profiler.

    Also, I'm a bit confused by descriptions like "enqueueWriteBuffer for input data 1". Do you mean that "data 1" is one buffer object and "data 2" is a different buffer object?
    data 1 to 3 are data (simple arrays) created on the host, which are copied to the same buffer on the device. I hope this made it more clear.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Problems with multiple commandQueues

    After each download (enqueueReadBuffer) there is a clFinish for each command queue. This clFinish is necessary to be able to merge the data (when using multiple devices, the input data is split up and after calculation has to be set together again). Do you think the problem could be here?
    If I understood correctly, the code looks roughly like this?

    Code :
    for(int i = 0; i < 2; ++i)
    {
        clEnqueueWriteBuffer(queue[i], input1[i], ...);
        clEnqueueWriteBuffer(queue[i], input2[i], ...);
        clEnqueueWriteBuffer(queue[i], input3[i], ...);
     
        // calls to clSetKernelArg() here
     
        clEnqueueNDRangeKernel(queue[i], kernel, ...);
        clEnqueueReadBuffer(queue[i], output, ...);
        clFinish(queue[i]);
    }

    If that's the case, the call to clFinish() in the first iteration of the loop is forcing all commands in queue[0] to be complete before any of the commands for queue[1] are even enqueued.

    Sorry if I misunderstood something. Please correct me if I'm wrong.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Nov 2010
    Posts
    14

    Re: Problems with multiple commandQueues

    Your posted code is not exactly correct.

    It is more like this:

    Code :
    for(int i = 0; i < 2; ++i)
    {
        clEnqueueWriteBuffer(queue[i], input1[i], ...);
        clEnqueueWriteBuffer(queue[i], input2[i], ...);
        clEnqueueWriteBuffer(queue[i], input3[i], ...);
    }
     
    for(int i = 0; i < 2; ++i)
    {
    // calls to clSetKernelArg() here
        clEnqueueNDRangeKernel(queue[i], kernel, ...);
    }
     
    for(int i = 0; i<2; ++i)
    {
        clEnqueueReadBuffer(queue[i], output, ...);
        clFinish(queue[i]);
    }

    In your version, it is clear to me, that first the 0th Queue is completed before any command of the 1st queue is enqueued. That's what makes me wonder that much, that my above code behaves the same way...

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Problems with multiple commandQueues

    Thanks! Your app looks great from the OpenCL standard point of view. There must be something else going on inside the nVidia driver that is deciding to serialize the execution.

    I would contact them with the version of the drivers you are using, the exact model of the GPUs you are using, whether they are connected in SLI and the OS in your system.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Junior Member
    Join Date
    Nov 2010
    Posts
    14

    Re: Problems with multiple commandQueues

    I just looked over this thread again, and saw a small mistake made by me.

    In my version of the "rough" code the last for-loop is not exactly the same how it is in my programm. In my programm this for-loop is split up into two loops.

    So, the complete correct rough code:

    Code :
    for(int i = 0; i < 2; ++i)
    {
        clEnqueueWriteBuffer(queue[i], input1[i], ...);
        clEnqueueWriteBuffer(queue[i], input2[i], ...);
        clEnqueueWriteBuffer(queue[i], input3[i], ...);
    }
     
    for(int i = 0; i < 2; ++i)
    {
    // calls to clSetKernelArg() here
        clEnqueueNDRangeKernel(queue[i], kernel, ...);
    }
     
    for(int i = 0; i<2; ++i)
    {
        clEnqueueReadBuffer(queue[i], output, ...);
    }
     
    for(int i = 0; i<2; ++i)
    {
        clFinish(queue[i]);
    }

    But since clEnqueueReadBuffer is non-blocking in my programm, there should be no difference from the OpenCL-spec view, I think!?!

  8. #8
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Problems with multiple commandQueues

    But since clEnqueueReadBuffer is non-blocking in my programm, there should be no difference from the OpenCL-spec view, I think!?!
    Even if clEnqueueReadBuffer() was blocking there would be no difference: the code is correct either way.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  9. #9
    Junior Member
    Join Date
    May 2011
    Posts
    14

    Re: Problems with multiple commandQueues

    If the queue execution is lazy then there might be nothing until the finish to trigger anything to be executed at all. By doing a read and then a blocking flush there is nothing to trigger the second queue's execution until the first has completed.

    Does tweaking it to the following help?

    for(int i = 0; i<2; ++i)
    {
    clEnqueueReadBuffer(queue[i], output, ...);
    clFlush(queue[i]);
    }

    for(int i = 0; i<2; ++i)
    {
    clFinish(queue[i]);
    }

    The spec doesn't say that anything has to leave the queue until a flush.

  10. #10
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Problems with multiple commandQueues

    I feel dumb now LeeHowes is right.

    Alternatively you could do this:

    Code :
    cl_event final_event[2];
     
    for(int i = 0; i < 2; ++i)
    {
        clEnqueueWriteBuffer(queue[i], input1[i], ...);
        clEnqueueWriteBuffer(queue[i], input2[i], ...);
        clEnqueueWriteBuffer(queue[i], input3[i], ...);
     
        // calls to clSetKernelArg() here
     
        clEnqueueNDRangeKernel(queue[i], kernel, ...);
        clEnqueueReadBuffer(queue[i], output, ..., &final_event[i]);
    }
    clWaitForEvents(2, final_event);

    clWaitForEvents() performs an implicit flush.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

Page 1 of 2 12 LastLast

Similar Threads

  1. regarding some problems
    By chiraga213 in forum OpenCL
    Replies: 2
    Last Post: 01-29-2013, 11:49 PM
  2. How to do Multiple Meshes? Multiple Programs?
    By amadeustesla in forum OpenGL ES 2X - for programmable 3D graphics pipelines
    Replies: 1
    Last Post: 11-04-2010, 11:45 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •