Page 1 of 2 12 LastLast
Results 1 to 10 of 14

Thread: How to reduce the read buffer time

  1. #1

    How to reduce the read buffer time

    Hello,

    I am working on convolution of an image with a mask of size 3x3.

    I am facing a problem with reading the data back from GPU to CPU. The time required to execute kernel is just 30 milli sec. But to read data back it is taking more time nearly more than a second for a 6000 x 6000 image.

    I am using clEnqueReadBuffer() to get data back. I have tried using pinned memory also, but I didnt find any improvement.

    I tried the synchronous mode (CL_TRUE) reading. It is taking more time. I tried asynchronous mode (CL_FALSE). It is very fast. But, I am not getting the full image back. If I use clFinish(cqCommandQueue) after asynchronous mode then it is taking the same time as synchronous mode but I am getting full image.

    I cant able to make GPU is better than CPU. The time take by convolution on CPU is half the time the time taken for convolution on GPU. But the problem is in reading the data back. Reaming all is ok.

    Please help me if you know how to reduce it.

    Thanks in advance.

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: How to reduce the read buffer time

    The time required to execute kernel is just 30 milli sec. But to read data back it is taking more time nearly more than a second for a 6000 x 6000 image.
    How are you measuring this time? Are you using the command queue's profiling mode?

    I tried asynchronous mode (CL_FALSE). It is very fast. But, I am not getting the full image back.
    Did you wait for the enqueue event that is returned by clEnqueueMapImage2D() to complete or did you use the pointer right away without waiting? By waiting I mean calling clWaitForEvents().

    Generally speaking, using OpenCL on a GPU efficiently requires that the computation time must be long enough to compensate for the data transfers from and to the GPU. A 3x3 convolution matrix is not very intense computationally.

    Also, what is the pixel format of the input image and the output image? If you are using float4 it will be worth considering something smaller, such as char4.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3

    Re: How to reduce the read buffer time

    Thanks Mr. david.garcia

    I am measuring the kernel execution time using event profiling method.

    The execution time is in nano seconds. I also used CPU timer that is GetTickCount() also in Visual studio. Then also I got the same.

    I didnt use the command clEnqueueMapImage2D().

    If I use clWaitForEvents() then also the read buffer time is taking more even in asynchronous mode(CL_FALSE).

    As you said I am using float4 datatype. I will change it. Can I use int instead of char.

    What is the suggestible kernel size? Shall I have to use 5x5 or 7x7?

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: How to reduce the read buffer time

    If the image is 6000 x 6000 x float4, it means that its size is at least 550MB. One second to transfer 550MB sounds like a lot, but the OpenCL driver may be doing more than simply copying the data.

    Have you measured how long it takes to read the image from the CPU to the GPU? Can you show us the source code of your program? I would like to see the API calls used to copy memory and to measure time.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5

    Re: How to reduce the read buffer time

    #include <stdio.h>
    #include <conio.h>
    #include <windows.h>
    #include <assert.h>
    #include <sys/stat.h>
    #include <stdlib.h>
    #include <cstdlib>
    #include <fstream>
    #include <iostream>
    #include <cv.h>
    #include <cxcore.h>
    #include <highgui.h>
    #include <oclUtils.h>
    #include <string.h>
    #include <math.h>
    #include <time.h>
    #include <sys/timeb.h>

    #define MAX_SOURCE_SIZE (0x100000)

    float *srcA, *srcB; // Host buffers for OpenCL test
    float *CPUoutput, *GPUoutput; // Host buffer for GPU and CPU processing

    // OpenCL Vars
    cl_platform_id cpPlatform; // OpenCL platform
    cl_device_id cdDevice = NULL,CPU = NULL; // OpenCL device
    cl_context cxGPUContext; // OpenCL context
    cl_command_queue cqCommandQueue; // OpenCL command que
    cl_program cpProgram; // OpenCL program
    cl_kernel ckKernel; // OpenCL kernel
    cl_mem cmDevSrcA; // OpenCL device source buffer A
    cl_mem cmDevSrcB; // OpenCL device source buffer B
    cl_mem cmDevDst; // OpenCL device destination buffer
    cl_uint ret_num_platforms;
    size_t szGlobalWorkSize[2]; // Total # of work items in the 2D range
    size_t szLocalWorkSize[2]; // # of work items in the 2D work group
    cl_event GPUevent,transferevent; // OpenCL event
    cl_ulong start; // To store start clock time
    cl_ulong end; // To store end clock time
    //cl_event ev;
    //size_t szParmDataBytes; // Byte size of context information
    //size_t szKernelLength; // Byte size of kernel code
    cl_int ciErrNum = 0; // Error check variable
    char* cPathAndName = NULL; // var for full paths to data, src, etc.
    char* cSourceCL = NULL; // Buffer to hold source for compilation
    size_t source_size;
    char *source_str;
    size_t buffer_size;
    float *img,*img1, *mask;

    // demo config variables
    int iNumElements = 0; // Length of entire array to be created (Image size)
    int iFilterWidth = 3; // Specifies the width of the Filter for convolution
    int iWidth = 0; // Image Width Holding variable
    int iHeight = 0; // Image Height Holding variable
    shrBOOL bNoPrompt = shrFALSE;
    IplImage *srcimg, *tempFrame; //Image storing elements

    //Elements helps in creating the buffers
    size_t mem_sizeImage;
    unsigned int sizeKernel;
    size_t mem_sizeKernel;


    // Forward Declarations
    // ************************************************** ***********************************
    float* CPUConvolution(float* pfData1, float* pfData2);
    void GPUDevicePreparation();
    void GPUConvolution(float *inimage, float *mask,float *outimage);
    void Cleanup (int iExitCode);
    double getclock();
    void (*pCleanup)(int) = &Cleanup;
    void readsourceimage();

    //Function Definition

    //================================================== ===================================
    // Function to read an image and make it to an two dimensional array
    //================================================== ===================================
    void readsourceimage()
    {
    srcimg = cvLoadImage("road_6000_800.jpg",0);
    tempFrame = cvLoadImage("road_6000_800.jpg",0);

    iWidth = srcimg->width;
    iHeight = srcimg->height;
    iNumElements = iWidth * iHeight;

    IplImage *dstimg = cvCreateImageHeader(cvSize(iWidth,iHeight),IPL_DEP TH_32F,1);

    cvNamedWindow ("Inputimage",1);
    cvShowImage ("Inputimage",srcimg);
    cvWaitKey();

    BYTE *inimg = (BYTE *) srcimg->imageData;


    // ----------------------- Dynamic Mem in C++ -----------------------
    img = new float [iNumElements];
    img1 = new float [iNumElements];

    for (int i=0; i< iNumElements; i++)
    {
    img[i] = (float) inimg[i];
    }

    for (int i=0; i< iNumElements; i++)
    {
    img1[i] = (float) inimg[i];
    }

    }

    //$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    // Reading the kernel source file to access it
    //$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
    void loadprogramsource(const char *filename)
    {
    FILE *fp;
    fp = fopen(filename,"r");

    if(!fp)
    {
    fprintf(stderr,"Failed to load kernel\n");
    exit(1);
    }

    source_str = (char *)malloc(MAX_SOURCE_SIZE);
    source_size = fread(source_str,1,MAX_SOURCE_SIZE,fp);

    fclose(fp);
    }

    //-------------------------------------------------------------------------------------------
    // GPU Device Preparation and loading all the necessary variable and building the profgram
    //-------------------------------------------------------------------------------------------

    void GPUDevicePreparation()
    {
    shrLog("GPU Device Preparation for building and execution........\n");

    // Get the NVIDIA platform
    shrLog("Get the NVIDIA platform...\n");
    ciErrNum = oclGetPlatformID(&cpPlatform);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

    //Find GPU CL device, which is really needed for processing
    shrLog("Getting the GPU device...\n");
    ciErrNum = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &cdDevice, NULL);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

    // Create the context
    shrLog("clCreateContext...\n");
    cxGPUContext = clCreateContext(0, 1, &cdDevice, NULL, NULL, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

    // Create a command-queue
    shrLog("clCreateCommandQueue...\n");
    cqCommandQueue = clCreateCommandQueue(cxGPUContext, cdDevice, CL_QUEUE_PROFILING_ENABLE, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

    // Place here some thing
    const char fileName[] = "./Convolution.cl";
    loadprogramsource(fileName);

    // Create the program
    shrLog("clCreateProgramWithSource...\n");
    cpProgram = clCreateProgramWithSource(cxGPUContext, 1, (const char **)&source_str, (const size_t*)&source_size, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);

    // Build the program with 'mad' Optimization option
    //#ifdef MAC
    // char* flags = "-cl-fast-relaxed-math -DMAC";
    //#else
    char* flags = "-cl-fast-relaxed-math";
    //#endif

    shrLog("clBuildProgram...\n");
    ciErrNum = clBuildProgram(cpProgram, 1, &cdDevice, flags, NULL, NULL);

    if (ciErrNum != CL_SUCCESS)
    {
    // write out standard error, Build Log and PTX, then cleanup and exit
    shrLogEx(LOGBOTH | ERRORMSG, ciErrNum, STDERROR);
    oclLogBuildInfo(cpProgram, oclGetFirstDev(cxGPUContext));
    oclLogPtx(cpProgram, oclGetFirstDev(cxGPUContext), "oclConvolution.ptx");
    Cleanup(EXIT_FAILURE);
    }

    // Create the kernel
    shrLog("clCreateKernel (Convolution)...\n");
    ckKernel = clCreateKernel(cpProgram, "Convolution", &ciErrNum);


    }

    //
    // GPU Convolution starts in this function
    //
    void GPUConvolution(float *inimage, float *mask, float *outimage)
    {
    long ts,te,tst,tet;
    mem_sizeImage = sizeof(float) * iNumElements;
    sizeKernel = iFilterWidth * iFilterWidth;
    mem_sizeKernel = sizeof(float) * sizeKernel;
    unsigned char * out1;


    //ts = GetTickCount();
    // Allocate the OpenCL buffer memory objects for source and result on the device GMEM
    //shrLog("clCreateBuffer (SrcA, SrcB and GPUoutput in Device GMEM)...\n");
    cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, mem_sizeImage, NULL, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    cmDevSrcB = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, mem_sizeKernel, NULL, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    cmDevDst = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, mem_sizeImage, NULL, &ciErrNum);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    //te = GetTickCount();
    //printf("Create Buffer: %d [Milli sec] \n",(te - ts));

    //ts = GetTickCount();
    // Asynchronous write of data to GPU device
    //shrLog("clEnqueueWriteBuffer (SrcA and SrcB)...\n");
    ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &GPUevent);
    ciErrNum |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, mem_sizeKernel,(void*) mask, 0, NULL, &GPUevent);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    clFinish(cqCommandQueue);
    //te = GetTickCount();
    //printf("Write Buffer: %d [Milli sec] \n",(te - ts));

    //ts = GetTickCount();
    // Set the Argument values
    //shrLog("clSetKernelArg 0 - 4...\n\n");
    ciErrNum = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), &cmDevSrcA);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    ciErrNum = clSetKernelArg(ckKernel, 1, sizeof(cl_mem), &cmDevSrcB);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    ciErrNum = clSetKernelArg(ckKernel, 2, sizeof(cl_mem), &cmDevDst);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    ciErrNum = clSetKernelArg(ckKernel, 3, sizeof(cl_int), &iWidth);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_int), &iFilterWidth);
    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    //te = GetTickCount();
    //printf("Set Kernel Arguments: %d [Milli sec] \n",(te - ts));

    //Setting the local and global parameters
    szGlobalWorkSize[0] = iWidth;
    szGlobalWorkSize[1] = iHeight;
    szLocalWorkSize[0] = 1;
    szLocalWorkSize[1] = 1;

    //ts = GetTickCount();
    //clGetEventProfilingInfo(GPUevent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start,NULL);

    // Launch kernel
    //shrLog("clEnqueueNDRangeKernel (Convolution)...\n");
    //ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, &GPUevent);
    ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL);
    oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup); //CL_INVALID_KERNEL,CL_SUCCESS
    //tst = GetTickCount();
    //clWaitForEvents(1,&GPUevent);
    //tet = GetTickCount();
    //clFinish(cqCommandQueue);

    //clGetEventProfilingInfo(GPUevent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end,NULL);
    //te = GetTickCount();
    //printf("Launch Kernel: %d [Milli sec] \n",(te - ts));
    //printf("Event Wait Time for Kernel Execution: %d [Milli sec] \n",(tet - tst));
    //printf("Kernel Execution time: %ld [nano sec]\n",(end-start));

    //ts = GetTickCount();
    //clGetEventProfilingInfo(transferevent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start,NULL);
    // Read back results and check accumulated errors
    //shrLog("clEnqueueReadBuffer (GPU Output)...\n\n");
    //ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_FALSE, 0, mem_sizeImage, outimage, 0, NULL, &transferevent);
    ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_TRUE, 0, mem_sizeImage, outimage, 0, NULL,NULL);

    //Sleep(50);

    //oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    //clWaitForEvents(1,&transferevent);

    //clGetEventProfilingInfo(transferevent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end,NULL);
    //te = GetTickCount();
    //printf("Read Buffer: %d [Milli sec] \n",(te - ts));
    //printf("Data copying time back from GPU: %ld [nano sec]\n",(end-start));


    oclCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);
    //clFinish(cqCommandQueue);
    }

    //************************************************** ********************
    // "CPU" Host processing Convolution function for comparison purposes
    // ************************************************** *******************
    float* CPUConvolution(float *inimage, float *mask)
    {
    //printf("\nEntered into the CPU Convolution Function\n");
    int k = 0;
    float *pfResult;


    // ----------------------- Dynamic Mem in C++ -----------------------
    pfResult = new float [iNumElements];

    int idx = 0;
    int jdx = 0;
    for (int i = 0; i < iHeight; i++)
    {
    for (int j = 0; j < iWidth; j++)
    {
    float sum = 0;
    for (int m = 0; m < iFilterWidth; m++)
    {
    for (int n = 0; n < iFilterWidth; n++)
    {
    idx = i - m;
    jdx = j - n;
    if (idx >= 0 && jdx >= 0)
    sum += img[(i * iWidth) + j + n] * mask[(m * iFilterWidth) + n];
    }
    }
    if (sum > 255)
    sum = 255;
    if (sum < 0)
    sum = 0;
    pfResult[k] = sum;
    k++;

    }
    }

    return(pfResult);
    }

    // ************************************************** ********************************************
    // Main function
    // ************************************************** ********************************************
    int main(int argc, char **argv)
    {
    printf("The program execution has started(Entered into main).......\n");
    //float kernelconv[] = {1,1,1,1,1,1,0,1,0,1,1,1,-4,1,1,1,0,1,0,1,1,1,1,1,1};
    float kernelconv[] = {0,1,0,1,-4,1,0,1,0,};
    long ts,te;
    mask = kernelconv; //Creating the mask to execute the kernel

    // Calling the function to load image for processing
    readsourceimage();

    // Creating a memory location to store the output of the GPU
    GPUoutput = (float *)malloc(sizeof(cl_float) * iNumElements);

    // Calling of GPU Device Preparation function
    GPUDevicePreparation();

    ts = GetTickCount();
    // Calling of GPU convolution
    //for(int i = 0; i < 10; i++)
    //{
    GPUConvolution(img,mask,GPUoutput);
    //}
    te = GetTickCount();
    printf("\nExecution time on GPU: %d [Milli sec] \n",(te - ts));

    ts = GetTickCount();
    //Calling of CPU convolution
    //for(int i = 0; i < 10; i++)
    //{
    CPUoutput = CPUConvolution(img,mask);
    //}
    te = GetTickCount();
    printf("\nExecution time on CPU: %d [Milli sec] \n",(te - ts));

    byte *tmp1, *tmp2;

    tmp1 = new byte[iNumElements];
    tmp2 = new byte[iNumElements];

    //Retriving the results back from GPU
    for(int i = 0;i < iNumElements; i++)
    {
    // printf("%f\t",GPUoutput[i]);

    tmp1[i] = (byte) floor(abs(GPUoutput[i]));
    }

    srcimg->imageData = (char *) tmp1;

    srcimg->imageDataOrigin = srcimg->imageData;

    //Retriving the results back from CPU

    for(int i = 0;i < iNumElements; i++)
    {
    //printf("%f\t",CPUoutput[i]);

    tmp2[i] = (byte) floor(abs(CPUoutput[i]));
    }

    tempFrame->imageData = (char *) tmp2;

    tempFrame->imageDataOrigin = tempFrame->imageData;


    cvNamedWindow("GPU_output",1);
    cvNamedWindow("CPU_output",1);
    cvShowImage("GPU_output",srcimg);
    cvShowImage("CPU_output",tempFrame);

    clFinish(cqCommandQueue);
    shrLog("\nPress any Key to EXIT\n");
    cvWaitKey();
    //cvReleaseImage(&srcimg);
    //cvReleaseImage(&tempFrame);

    //shrLog(LOGFILE, 0, "%f\t", GPUoutput[i]);

    } // End of MAIN

    // Cleanup and exit code
    // ************************************************** *******************
    void Cleanup(int iExitCode)
    {

    // Cleanup allocated objects
    shrLog("Starting Cleanup...\n\n");
    getchar();
    if(cPathAndName)free(cPathAndName);
    if(cSourceCL)free(cSourceCL);
    if(ckKernel)clReleaseKernel(ckKernel);
    if(cpProgram)clReleaseProgram(cpProgram);
    if(cqCommandQueue)clReleaseCommandQueue(cqCommandQ ueue);
    if(cxGPUContext)clReleaseContext(cxGPUContext);
    if (cmDevSrcA)clReleaseMemObject(cmDevSrcA);
    if (cmDevSrcB)clReleaseMemObject(cmDevSrcB);
    if (cmDevDst)clReleaseMemObject(cmDevDst);


    // Free host memory
    free(srcA);
    free(srcB);
    free(GPUoutput);
    free(CPUoutput);

    // finalize logs and leave
    if (bNoPrompt)
    {
    shrLogEx(LOGBOTH | CLOSELOG, 0, "oclConvolution.exe Exiting...\n");
    }
    else
    {
    shrLogEx(LOGBOTH | CLOSELOG, 0, "oclConvolution.exe Exiting...\nPress <Enter> to Quit\n");
    getchar();
    }
    //cvWaitKey();
    getchar();
    exit (iExitCode);
    }

  6. #6

    Re: How to reduce the read buffer time

    Here is the code I have used to execute the convolution on Image

  7. #7
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: How to reduce the read buffer time

    Okay. I have a few comments about the code.

    First, calling GetTickCount() before and after doing some OpenCL operations is going to give you incorrect times in some cases. You will find, for example, that the time it takes to execute clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &GPUevent) is very little. In fact, most clEnqueueXXX functions will be very fast. Why? Because they are asynchronous. What this means is that these functions return before the work has finished. The best way to measure how long it really takes to execute a kernel or to do any of these clEnqueueXXX operations is using the profiling API that you already know how to use.

    Second, if you are not sure of what to do with the "local_size" argument to clEnqueueNDRangeKernel, simply pass NULL. Forcing the local_size to be 1 will reduce the performance of your kernel a lot. So try this instead:

    ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, NULL, &NDRangeEvent);

    Third, you are not checking the error code returned by clGetEventProfilingInfo. If the information you are requesting is not ready yet, clGetEventProfilingInfo will return CL_PROFILING_INFO_NOT_AVAILABLE. My recommendation is this: first, use a different cl_event variable for each function call you want to measure; second, don't call clGetEventProfilingInfo in the middle of the program. Instead, make all of your OpenCL calls, then call clFinish() or clWaitForEvents() on the last call and only after all of this is done then query the profiling information with clGetEventProfilingInfo. So try something like this:

    Code :
    ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_TRUE, 0, mem_sizeImage,(void*) inimage, 0, NULL, &writeSrcABufferEvent);
    [...]
    ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 2, NULL, szGlobalWorkSize, NULL, 0, NULL, &NDRangeEvent);
    ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_TRUE, 0, mem_sizeImage, outimage, 0, NULL, &readDstEvent);
    // This clFinish here is to make sure that our calls to clGetEventProfilingInfo will not return an error.
    // OpenCL experts: yes, I know this is not actually necessary in this case; bear with me :)
    clFinish(cqCommandQueue);
     
    // After the clFinish we can finally call clGetEventProfilingInfo
    ciErrNum  = clGetEventProfilingInfo(writeSrcABufferEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&start, NULL);
    oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup);
     
    ciErrNum  = clGetEventProfilingInfo(writeSrcABufferEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&end, NULL);
    oclCheckErrorEX(ciErrNum,CL_SUCCESS, pCleanup);
     
    // Print "end - start" here if you want.
     
    // Use clGetEventProfilingInfo on the events NDRangeEvent and readDstEvent here if you want.

    Try this and let us know what results you get
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #8

    Re: How to reduce the read buffer time

    Thanks for ur support....
    I will try this and say you the results....

  9. #9

    Re: How to reduce the read buffer time

    Dear david garcia,

    As you said I have used the event profiling and measured the execution times for different size of images

    Image -> GPU => Writing
    Kernel -> GPU => Writing
    GPU -> Image => Reading

    Observation 1:

    Image Size : 375 x 565
    Image -> GPU : 2596800 nano sec
    Kernel -> GPU : 4352 nano sec
    Execution of Kernel : 6994400 nano sec
    GPU -> Image : 1764992 nano sec
    Execution on CPU : 32 milli sec (Using GetTickCount())

    GPU in total takes 11.37 milli sec

    Observation 2:

    Image Size : 2560 x 1024
    Image -> GPU : 10595680 nano sec
    Kernel -> GPU : 5088 nano sec
    Execution of Kernel : 58760128 nano sec
    GPU -> Image : 14339776 nano sec
    Execution on CPU : 359 milli sec (Using GetTickCount())

    GPU in total takes 83.85milli sec

    Observation 3:

    Image Size : 640 x 480
    Image -> GPU : 3803072 nano sec
    Kernel -> GPU : 5056 nano sec
    Execution of Kernel : 6881120 nano sec
    GPU -> Image : 1447264 nano sec
    Execution on CPU : 47 milli sec (Using GetTickCount())

    GPU in total takes 12.3 milli sec

    Observation 4:

    Image Size : 8192 x 8192
    Image -> GPU : 249863520 nano sec
    Kernel -> GPU : 5120 nano sec
    Execution of Kernel : 1508864704 nano sec
    GPU -> Image : 288091776 nano sec
    Execution on CPU : 8034 milli sec (Using GetTickCount())

    GPU in total takes 2046 milli sec

    Observation 5:

    Image Size : 5000 x 5000
    Image -> GPU : 96716384 nano sec
    Kernel -> GPU : 5152 nano sec
    Execution of Kernel : 792512800 nano sec
    GPU -> Image : 151397632 nano sec
    Execution on CPU : 2980 milli sec (Using GetTickCount())

    GPU in total takes 1039 milli sec

    Observation 6:

    Image Size : 6000 x 6000
    Image -> GPU : 138521568 nano sec
    Kernel -> GPU : 5152 nano sec
    Execution of Kernel : 1141955072 nano sec
    GPU -> Image : 154220928 nano sec
    Execution on CPU : 4181 milli sec (Using GetTickCount())

    GPU in total takes 1434 milli sec

  10. #10

    Re: How to reduce the read buffer time

    Now I have some questions

    1. Is there any method to compare the execution time between GPU and CPU?
    (As we are using event profiling for GPU and GetTickCount for CPU)

    2. The command clBuildProgram() is taking so much time. Why it happens?

Page 1 of 2 12 LastLast

Similar Threads

  1. Read GL depth buffer from OpenCL?
    By Dark Photon in forum OpenCL
    Replies: 2
    Last Post: 04-14-2011, 04:33 PM
  2. Only read specific chunks of a buffer
    By Moe in forum OpenCL
    Replies: 6
    Last Post: 11-19-2010, 05:08 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •