Results 1 to 8 of 8

Thread: No speed up from using 2 GPUs

  1. #1
    Junior Member
    Join Date
    Aug 2011
    Posts
    20

    No speed up from using 2 GPUs

    I just got 2 new GPUs yesterday. They are both NVIDIA C2070. I wrote a simple program to compare the runtime of using 1 GPU and 2 GPUs. Surprisingly, 2 GPUs don't give me any speedup. Basically, I have 2 kernels that have their own independent inputs and outputs. I ran different variations of numbers of contexts and command queues, and the command queues are always in-order execution. This is the result:

    1 command queue on 1 device
    total time: 558,866 microseconds

    2 command queues on 1 context on 1 devices
    (run kernel A on command queue A; run kernel B on command queue B)
    total time: 717,828 microseconds

    2 command queues on 1 context on 2 devices
    total time: 826,846 microseconds

    2 command queues on 2 contexts on 2 devices
    (run kernel A on command queue A which is on context A that include only device A; run kernel B on command queue B which is on context B that include only device B)
    total time: 519,748 microseconds

    Running 1 kernel itself takes 198,018 microseconds (this is the time when the kernel starts running on gpu until finish. there is nothing to do with cpu side.).

    Can anyone explain what's going on? I expect to get some speedup when using 2GPUs but apparently not.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: No speed up from using 2 GPUs

    How are you measuring the time? Are you executing the same amount of work on the 1-device case and the 2-device case? I.e. if you are running 100 work-items for the 1-device example, are you then running 50 work-items per device in the two-device example? Is it possible that your execution time is bandwidth bound rather than ALU bound?
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Aug 2011
    Posts
    20

    Re: No speed up from using 2 GPUs

    Quote Originally Posted by david.garcia
    How are you measuring the time?
    For overall time, I use gettimeofday(&time, NULL). I put make a call before creating command queue and after reading output buffers are done. For the kernel runtime on gpu, I use:
    Code :
    clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event);
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(gpu1), &gpu1, NULL);
    clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,sizeof(gpu2), &gpu2, NULL);
    time = gpu2 - gpu1;

    Quote Originally Posted by david.garcia
    Are you executing the same amount of work on the 1-device case and the 2-device case?
    I execute 2 kernels. Each has 100 work-items. When I use 1 device, I run both kernels on that device (100 + 100 work-items). When I use 2 device, I run each kernel on each device (100 work-items on one and 100 work-items on another).

    Quote Originally Posted by david.garcia
    Is it possible that your execution time is bandwidth bound rather than ALU bound?
    I expect that read and write buffer are bandwidth bound, but run kernel shouldn't. These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
    write: 1 microseconds
    run: 198026 microseconds
    read: 80 microseconds

    These are for kernel A. Kernel B takes about the same time. And these runtime results apply for all variations. You can see that I spend most of the time on running kernel. In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them, so I expect the 198026 microseconds kernel runtime of the 2 kernels to overlap.

    It is also weird that write buffer takes only 1 microseconds.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: No speed up from using 2 GPUs

    These are the runtime of read buffer, run kernel, and write buffer using clGetEventProfilingInfo:
    write: 1 microseconds
    run: 198026 microseconds
    read: 80 microseconds

    You can see that I spend most of the time on running kernel.
    That's not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it's the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.

    clGetEventProfilingInfo() is the right way to do all time measurements.

    In the last variation that I have 2 command queues on 2 different contexts associated to 2 different devices, the 2 kernels should run concurrently since I put clEnqueueNDRangeKernel one right after another with clFlush for both of them
    It would be great if you could show us the whole source code to understand what's going on.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Aug 2011
    Posts
    20

    Re: No speed up from using 2 GPUs

    Quote Originally Posted by david.garcia
    That's not so clear to me. You are apparently measuring the time it takes to execute clEnqueueReadBuffer()/clEnqueueWriteBuffer(), which is not the same as the time it takes to actually read or write a buffer. If I may use an analogy, it's the difference between the time it takes to order a pizza and the time it takes to actually bake the pizza.

    clGetEventProfilingInfo() is the right way to do all time measurements.
    I did use clGetEventProfilingInfo() to measure those time. Only total time that I used gettimeofday.

    Here are my code.
    1 command queue on 1 device:
    Code :
        ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
        cxGPUContext = clCreateContext(0, 1, &device_id, NULL, NULL, &ciErrNum);
     
        gettimeofday(&start, NULL);
     
        commandQueue = clCreateCommandQueue(cxGPUContext, device_id, 0, &ciErrNum);
        cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
        input = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, DATA_SIZE * sizeof(double), NULL,NULL);
        output = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, NULL);
        clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
        clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
        global=DATA_SIZE;
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, NULL);
        clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event3);
     
        cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
        input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, DATA_SIZE * sizeof(double), NULL,NULL);
        output2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, NULL);
        clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
        clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
        clEnqueueNDRangeKernel(commandQueue, kernel2, 1, NULL, &global, NULL, 0, NULL, NULL);
        clEnqueueReadBuffer(commandQueue, output2, CL_TRUE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event6);
        gettimeofday(&end, NULL);
    2 command queues on 1 context on 2 devices:
    Code :
        ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, 0, NULL, &device_count );
        cl_device_id* device_ids = new cl_device_id[ device_count ];
        ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, device_count, device_ids, &device_count );
        cxGPUContext = clCreateContext(0, device_count, device_ids, NULL, NULL, &ciErrNum); // 1 context on all devices
     
        gettimeofday(&start, NULL);
     
        // queue1 for kernel1
        commandQueue = clCreateCommandQueue(cxGPUContext, device_ids[0], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue1 on device1
        cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
        input = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
        output = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event5);
        clFlush(commandQueue);
     
        // queue2 for kernel2
        commandQueue2 = clCreateCommandQueue(cxGPUContext, device_ids[1], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue2 on device2
        cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
        input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
        output2 = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue2, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event6);
        clFlush(commandQueue2);
     
        // run kernel1
        clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
        clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event3);
        clFlush(commandQueue);
     
        // run kernel2
        clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
        clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
        clEnqueueNDRangeKernel(commandQueue2, kernel2, 1, NULL, &global, NULL, 0, NULL, &event4);
        clFlush(commandQueue2);
     
        // read buffer from kernel1
        clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event1);
        clFlush(commandQueue);
     
        // read buffer from kernel2
        clEnqueueReadBuffer(commandQueue2, output2, CL_FALSE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event2);
        clFlush(commandQueue2);
     
        clFinish(commandQueue);
        clFinish(commandQueue2);
    2 command queues on 2 contexts on 2 devices:
    Code :
        ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, 0, NULL, &device_count );
        cl_device_id* device_ids = new cl_device_id[ device_count ];
        ciErrNum = clGetDeviceIDs( cpPlatform, CL_DEVICE_TYPE_GPU, device_count, device_ids, &device_count );
        cxGPUContext = clCreateContext(0, 1, &device_ids[0], NULL, NULL, &ciErrNum); // context1 on device1
        cxGPUContext2 = clCreateContext(0, 1, &device_ids[1], NULL, NULL, &ciErrNum); // context2 on device2
     
        gettimeofday(&start, NULL);
     
        // queue1 for kernel1
        commandQueue = clCreateCommandQueue(cxGPUContext, device_ids[0], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue1 on context1 on device1
        cl_program cpProgram = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel = clCreateKernel(cpProgram, "hello", &ciErrNum);
        input = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
        output = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue, input, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event5);
        clFlush(commandQueue);
     
        // queue2 for kernel2
        commandQueue2 = clCreateCommandQueue(cxGPUContext2, device_ids[1], CL_QUEUE_PROFILING_ENABLE, &ciErrNum); // queue2 on context2 on device2
        cl_program cpProgram2 = clCreateProgramWithSource(cxGPUContext,1, (const char **) &ProgramSource2, NULL, &ciErrNum);
        ciErrNum = clBuildProgram(cpProgram2, 0, NULL, "-cl-fast-relaxed-math", NULL, NULL);
        kernel2 = clCreateKernel(cpProgram2, "hello", &ciErrNum);
        input2 = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, DATA_SIZE * sizeof(double), NULL,NULL);
        output2 = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY,  DATA_SIZE * sizeof(double), NULL,NULL);
        clEnqueueWriteBuffer(commandQueue2, input2, CL_FALSE, 0, sizeof(double) * DATA_SIZE, inputData, 0, NULL, &event6);
        clFlush(commandQueue2);
     
        // run kernel1
        clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
        clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, NULL, 0, NULL, &event3);
        clFlush(commandQueue);
     
        // run kernel2
        clSetKernelArg(kernel2, 0, sizeof(cl_mem), &input2);
        clSetKernelArg(kernel2, 1, sizeof(cl_mem), &output2);
        clEnqueueNDRangeKernel(commandQueue2, kernel2, 1, NULL, &global, NULL, 0, NULL, &event4);
        clFlush(commandQueue2);
     
        // read buffer from kernel1
        clEnqueueReadBuffer(commandQueue, output, CL_FALSE, 0, DATA_SIZE * sizeof(double), results, 0, NULL, &event1);
        clFlush(commandQueue);
     
        // read buffer from kernel2
        clEnqueueReadBuffer(commandQueue2, output2, CL_FALSE, 0, DATA_SIZE * sizeof(double), results2, 0, NULL, &event2);
        clFlush(commandQueue2);
     
        clFinish(commandQueue);
        clFinish(commandQueue2);

    For measuring read buffer, run kernel, and write buffer times, I do something like this:
    Code :
        clGetEventProfilingInfo(eventX, CL_PROFILING_COMMAND_START,sizeof(time1), &time1, NULL);
        clGetEventProfilingInfo(eventX, CL_PROFILING_COMMAND_END,sizeof(time2), &time2, NULL);
        printf("eventX: %lld microseconds\n", (time2 - time1)/1000);
    Please help! Let me know if you need more information.

  6. #6
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: No speed up from using 2 GPUs

    Can I suggest doing all the clCreateXxx() calls as well as clBuildProgram() at the beginning of the code, then doing the actual clEnqueueXxx() calls? clBuildProgram() in particular is notoriously expensive and executing it between your first and your second call to clEnqueueNDRangeKernel() may be eliminating all possibility of concurrency between the two devices.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  7. #7
    Junior Member
    Join Date
    Aug 2011
    Posts
    20

    Re: No speed up from using 2 GPUs

    Okay. I'll try that, and I'll let you know the result.

  8. #8
    Junior Member
    Join Date
    Aug 2011
    Posts
    20

    Re: No speed up from using 2 GPUs

    Quote Originally Posted by david.garcia
    Can I suggest doing all the clCreateXxx() calls as well as clBuildProgram() at the beginning of the code, then doing the actual clEnqueueXxx() calls?
    That works! Now it runs faster on 2 GPUs. Thank you so much.

Similar Threads

  1. speed factor
    By naroqueen in forum OpenCL
    Replies: 1
    Last Post: 12-13-2010, 06:08 AM
  2. -O0 and -O3 speed difference
    By deNorma in forum OpenCL
    Replies: 2
    Last Post: 01-15-2010, 02:43 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •