Results 1 to 9 of 9

Thread: Low performance of OpenCL application

  1. #1
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Low performance of OpenCL application

    Hi

    I have a question about how to get better performance of my OpenCL application. The size of computations is quiet big - something like 10 millions of computations needed.

    I'm not sure if I'm using OpenCL API right, because my GPU application is not any faster than CPU. Of course it's not a rule that GPU version will be 100x faster than CPU one, but just check my current approach to the problem:

    Problem need to run a lot of computations, a lot of work items - something like 10 mln.
    I set global_work_size to 640,
    local_work_size to 320.

    After every run of clEnqueueNDRangeKernel() I'm reading results to check if my problem is already solved with clEnqueueReadBuffer (blocking set to CL_TRUE).

    The final performance is still very poor. I haven't done any measurements but I see it's just not fast enough. If I missed some basic information just tell. If code is required to analyze - tell which one.

    PS. I'm computing on NVIDIA Quadro 140M NVS (laptop)

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Low performance of OpenCL application

    Whenever you want to improve the performance of a piece of code the first thing you need to do is to measure where is the time being spent. There's no performace tuning without performance measurement. Your hardware vendor surely has a nice profiling tool that you can use to find out how long each operation in your program is taking.

    Generally speaking you want to minimize the number of synchronization points. By this I mean places where the GPU is waiting for the CPU to do something or vice-versa. Blocking memory reads are an example of synchronization points.

    Currently from what you describe you are running a tiny amount of work (640 work-items), then reading back data, then run some test on the data ("check if my problem is already solved"), then run a little amount of work again.

    Why not execute bigger NDRanges? Also, why check if the solution was found so many times? It would probably be cheaper to do the check with a kernel instead of using the CPU.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Re: Low performance of OpenCL application

    Why not execute bigger NDRanges?
    Every workitem to compute it's own result need to have among other things three arrays; two of them of size like ~1500bytes (size is not fixed) and one of size 256bytes (this array is always 256bytes long). So to provide space for "global_work_size" number of work items I'm allocating these arrays in advance by using clCreateBuffer() and pass pointers as kernel arguments. Every array is "global_work_size" times bigger than it should be for one computation, because "at the same time" "global_work_size" number of computations is performed.

    Inside kernel I'm computing offset for every work item so it has independent space (like I said 3 arrays) where temporary results (needed to compute final result) can be stored.

    In kernel there is also little piece of code which check if problem was solved. If it was - some values are copied to small array which also was created by clCreateBuffer() and passed as a kernel argument. After running clEnqueueNDRangeKernel() I read this small array to check if something is there (if this array is filled with data - this is my final result I'm looking for).

    Of course I can provide also simplified code so it will be easier to see any bigger code mistakes and technical issues with memory management.

    It could be great if there is any way to improve my approach because now it performs not so bad and probably I'm not taking any advantage of running my code on GPU.

  4. #4
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Low performance of OpenCL application

    I insist that the first thing you need is to run a performance profiler. Anything I say will just be guessing based on what you've described.

    Every workitem to compute it's own result need to have among other things three arrays; two of them of size like ~1500bytes (size is not fixed) and one of size 256bytes (this array is always 256bytes long).
    So each work-item needs less than 2KB of private data. Let's say that we want to continue this simple approach. Let's also say that you allocate a 256MB buffer for this purpose. This means you could run NDRanges with up to 128x1024 work-items in them.

    After running clEnqueueNDRangeKernel() I read this small array to check if something is there (if this array is filled with data - this is my final result I'm looking for).
    Why not read that small array inside another kernel? What you want to avoid is this:

    1. Enqueue small NDRange.
    2. Blocking read.
    3. If solution not found, go to step 1.

    Instead, you can do something like this:

    1. Initialize an integer in global memory with the value zero. Let's call this the "found solution" variable. When your kernel has found a solution, it will set that variable to one.
    2. Enqueue large NDRange that will run the main kernel. First of all this kernel checks if the "found solution" variable is zero or not. If it's not zero, all work items return immediately.
    3. Repeat step 2 a bunch of times. You could do step 2 only once but we have to limit the size of the NDRange due to the memory requirements mentioned above.
    4. Read the small buffer with the "found variable" and stop if it's not zero.

    I'm just guessing here how your algorithm works. Hopefully this will be close enough to give you some ideas.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  5. #5
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Re: Low performance of OpenCL application

    I changed my kernel and now arrays are in private memory. Complexity of kernel decreased and value given by clGetKernelWorkGroupInfo() jumped to 448 (from 320) - so it's good I think.

    Once again info about my GPU:
    Code :
    OpenCL SW Info:
     
     CL_PLATFORM_NAME:      NVIDIA CUDA
     CL_PLATFORM_VERSION:   OpenCL 1.0 CUDA 4.0.1
     OpenCL SDK Revision:   7027912
     
     
    OpenCL Device Info:
     
     1 devices found supporting OpenCL:
     
     ---------------------------------
     Device Quadro NVS 140M
     ---------------------------------
      CL_DEVICE_NAME:                       Quadro NVS 140M
      CL_DEVICE_VENDOR:                     NVIDIA Corporation
      CL_DRIVER_VERSION:                    275.33
      CL_DEVICE_VERSION:                    OpenCL 1.0 CUDA
      CL_DEVICE_TYPE:                       CL_DEVICE_TYPE_GPU
      CL_DEVICE_MAX_COMPUTE_UNITS:          2
      CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:   3
      CL_DEVICE_MAX_WORK_ITEM_SIZES:        512 / 512 / 64
      CL_DEVICE_MAX_WORK_GROUP_SIZE:        512
      CL_DEVICE_MAX_CLOCK_FREQUENCY:        800 MHz
      CL_DEVICE_ADDRESS_BITS:               32
      CL_DEVICE_MAX_MEM_ALLOC_SIZE:         128 MByte
      CL_DEVICE_GLOBAL_MEM_SIZE:            82 MByte
      CL_DEVICE_ERROR_CORRECTION_SUPPORT:   no
      CL_DEVICE_LOCAL_MEM_TYPE:             local
      CL_DEVICE_LOCAL_MEM_SIZE:             16 KByte
      CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:   64 KByte
      CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
      CL_DEVICE_QUEUE_PROPERTIES:           CL_QUEUE_PROFILING_ENABLE
      CL_DEVICE_IMAGE_SUPPORT:              1
      CL_DEVICE_MAX_READ_IMAGE_ARGS:        128
      CL_DEVICE_MAX_WRITE_IMAGE_ARGS:       8
      CL_DEVICE_SINGLE_FP_CONFIG:           INF-quietNaNs round-to-nearest round-to-zero round-to-inf fma
     
      CL_DEVICE_IMAGE <dim>                 2D_MAX_WIDTH     4096
                                            2D_MAX_HEIGHT    32768
                                            3D_MAX_WIDTH     2048
                                            3D_MAX_HEIGHT    2048
                                            3D_MAX_DEPTH     2048
     
      CL_DEVICE_EXTENSIONS:                 cl_khr_byte_addressable_store
                                            cl_khr_icd
                                            cl_khr_gl_sharing
                                            cl_nv_d3d9_sharing
                                            cl_nv_d3d10_sharing
                                            cl_khr_d3d10_sharing
                                            cl_nv_d3d11_sharing
                                            cl_nv_compiler_options
                                            cl_nv_device_attribute_query
                                            cl_nv_pragma_unroll
                                            cl_khr_global_int32_base_atomics
                                            cl_khr_global_int32_extended_atomics
     
     
      CL_DEVICE_COMPUTE_CAPABILITY_NV:      1.1
      NUMBER OF MULTIPROCESSORS:            2
      NUMBER OF CUDA CORES:                 16
      CL_DEVICE_REGISTERS_PER_BLOCK_NV:     8192
      CL_DEVICE_WARP_SIZE_NV:               32
      CL_DEVICE_GPU_OVERLAP_NV:             CL_TRUE
      CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV:     CL_TRUE
      CL_DEVICE_INTEGRATED_MEMORY_NV:       CL_FALSE
      CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t>  CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 0

    And what I need in my kernel:
    S[256]
    keystream[1512]
    plaintext[1512] (value 1512 can be a little bit different - it depends on WiFi packet length)

    Here is my kernel code:
    Code :
    __kernel void crash(int messageLength,  
                        int rootKeySize,
                        __global uchar *iv,
                        __global uchar *cipher,
                        __global uchar *passwordfound,
    		    __global uint *solved)
    {
     
    if(solved[0]==1) return;
     
    int length = 5;
    char secretKey[5];
    for(int k = 0; k < length; k++) {
    	secretKey[k] = ch[0];
    }
     
    int tmpi = get_global_id(0);
    int rest;
    for(int j = 0; j < length; j++) {
    	rest = tmpi % 25;
    	tmpi = tmpi / 25;	
    	secretKey[j] = secretKey[j] + rest;
    }
     
    // RC4-KSA
    uchar S[256];
    int i, j;
    for (i = 0; i < N; i++) {
    	S[i] = i;
    }
    j = 0;
    for (i = 0; i < N; i++) {
    	j = (j + S[i] + ((i%rootKeySize<IV_SIZE) ? iv[i % rootKeySize] : secretKey[i%rootKeySize-IV_SIZE])) % N;
           SwapElements(S, i, j);
    }
     
    // RC4-PRGA
    uchar keystream[1512];
    i = 0;
    j = 0;
    for (int repetition = 0; repetition < messageLength; repetition++) {
    	i = (i + 1) % N;
    	j = (j + S[i]) % N;
    	SwapElements(S, i, j);
    	keystream[repetition] = S[((S[i] + S[j]) % N)];
    }
     
    // compute plaintext from cipher 
    uchar plaintext[1512];
    for (i = 0; i < messageLength; i++) {
    	plaintext[i] = (int)(cipher[i] ^ keystream[i]);
    }
     
    bool keyValid = true;
    uchar hash[4];
    Crc32(plaintext, 0, messageLength - CRC_SIZE, hash);
    for (int i = 0; i < CRC_SIZE; i++) {
            keyValid &= hash[i] == plaintext[messageLength - 1 - i];
    }
    if(keyValid) {
       solved[0]=1;
       for(int z=0; z<length; z++) passwordfound[z]=secretKey[z];
    }
    }

    And now could you please explain how can I manage to do my computations without blocking read? This part is still not clear for me.

  6. #6
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Re: Low performance of OpenCL application

    And now host code.

    Buffers:
    Code :
    // cipherBody and iv arrays were somewhere initialized and filled with data.  
    CipherBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(uchar)*(size - HEADER_SIZE), cipherBody, &errcode);
    assert(errcode==CL_SUCCESS);
     
    IVBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(uchar)*IV_SIZE, iv, &errcode);
    assert(errcode==CL_SUCCESS);
     
    // Allocate output memory on GPU (password is a final result)
    PassBuffer = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY, sizeof(uchar)*PASSLENGTH, NULL, &errcode);
    assert(errcode==CL_SUCCESS);
     
    // Solved - indicator you suggested
    uint *solved = new uint[1];
    solved[0]=0;
     
    cl_mem solvedBuffer = clCreateBuffer(GPUContext, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(uint), solved, &errcode);
    assert(errcode==CL_SUCCESS);

    Setting arguments for kernel:
    Code :
    clSetKernelArg(OpenCLVectorAdd, 0, sizeof(int), &siz); 
    clSetKernelArg(OpenCLVectorAdd, 1, sizeof(int), &ROOTKEYLENGTH); 
    clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&IVBuffer); 
    clSetKernelArg(OpenCLVectorAdd, 3, sizeof(cl_mem), (void*)&CipherBuffer); 
    clSetKernelArg(OpenCLVectorAdd, 4, sizeof(cl_mem), (void*)&PassBuffer);  
    clSetKernelArg(OpenCLVectorAdd, 5, sizeof(cl_mem), (void*)&solvedBuffer);

    Running kernel with avoiding blocking read - how? WrokSize parameter should be a size_t value of something really big now. 1512+1512+256+(some other small arrays and variables something like 30 bytes )= approx. 3.5 KB lets say. How much memory can I allocate ? 82 MB ? (CL_DEVICE_GLOBAL_MEM_SIZE).

  7. #7
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Low performance of OpenCL application

    I thought you couldn't put that data in private memory and that was why you were using global memory. I must have been thinking of somebody else.

    Running kernel with avoiding blocking read - how? WrokSize parameter should be a size_t value of something really big now. 1512+1512+256+(some other small arrays and variables something like 30 bytes )= approx. 3.5 KB lets say. How much memory can I allocate ? 82 MB ? (CL_DEVICE_GLOBAL_MEM_SIZE).
    Sorry, I don't understand. You seem to be saying several different things at once. Since you are now using private memory for all those temporary variables, you should be able to use very large global work sizes, independently of how much global memory you have -- because private memory is managed internally by the OpenCL driver.

    BTW, I hope I'm not breaking any local laws by helping you to write a wifi password cracker
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  8. #8
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Re: Low performance of OpenCL application

    No you don't break laws ;p Do I ? ;p Purpose of code is to learn OpenCL

    I successfully run kernel with global_work_size=20000. To run all my computations I executed clEnqueueNDRange() many times in loop and after loop I placed clEnqueueReadBuffer() with blocking set to CL_TRUE like you suggested on some other thread. Result was calculated faster than on CPU.

    I do not know how big can improvement be in comparison to CPU version, but running my application on laptop with better graphic card than mine give results even better so it's good indicator.

  9. #9
    Junior Member
    Join Date
    Mar 2011
    Posts
    28

    Re: Low performance of OpenCL application

    I forgot to add that setting bigger global_work_size (like 30k or 40k) causes my GPU reset or screens (both laptop screen and external screen) get dark and I have to restart my computer.

Similar Threads

  1. opencl performance
    By opencl_beginner in forum OpenCL
    Replies: 2
    Last Post: 11-09-2010, 10:44 PM
  2. Replies: 2
    Last Post: 01-04-2010, 12:58 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •