Results 1 to 6 of 6

Thread: Writing var to output array is very slow.

  1. #1
    Junior Member
    Join Date
    Sep 2010
    Posts
    4

    Writing var to output array is very slow.

    Hello,

    i got a problem with the performance of writing my results of the kernel function to the output array. The whole kernel function time increases from 3ms to like 80ms just because of one writing operation.
    Surprisingly, the performance does not decrease if I write a constant value into the output array.

    The Code of the kernel is the following:
    Code :
    __kernel void linearMatching(                                           
       __global float* list,                                               
       __global float* list2,                                              
       __global float* resultList,  
       __global float2* tempList,                                      
     
       const unsigned int count,                                            
       const unsigned int count2,                                           
       const unsigned int nDim,
       const unsigned int nBufferSize
       )                                         
    {                     
        const int nWorkgrps = get_num_groups(0);
        const int global_id = get_group_id(0);                                            
        const int local_id = get_local_id(0);
        const int LOCAL_WORKSIZE = min(get_local_size(0), ARRAY_SIZE);
        //work groups needed for list2
        int nWGs_L2;
        if(count2 % LOCAL_WORKSIZE == 0)
            nWGs_L2 = (count2 / LOCAL_WORKSIZE);
        else
            nWGs_L2 = (count2 / LOCAL_WORKSIZE) + 1;
     
        //current list1 index
        int L1_index = global_id / nWGs_L2;
        //current list2 index
        int L2_index = (global_id % nWGs_L2) * LOCAL_WORKSIZE + local_id;
        //l2 items left to check
        int leftL2 = min(LOCAL_WORKSIZE, count2 - (global_id % nWGs_L2) * LOCAL_WORKSIZE);
        if(L1_index >= count || L2_index >= count2)  
            return;
        if(local_id >= ARRAY_SIZE)
            return;
     
     
        //builds sums for LOCAL_WORKSIZE items of l2 with one item of l1
        float diff;    
        __local float2  sums[ARRAY_SIZE];
     
        sums[local_id].x = 0;
        sums[local_id].y = (float)L2_index;
        for(int k = 0; k < nDim; k++)  {                                
            diff = (list[L1_index*nDim + k] - list2[L2_index*nDim + k]);                       
            sums[local_id].x += diff * diff;                         
        }    
     
        //get minimum
        const int nSearchThreads = (8 < LOCAL_WORKSIZE)? 8 : LOCAL_WORKSIZE;
        float cur_min_error = sums[0].x;
     
        int best_index = sums[0].y;
        if(local_id % (nDim/nSearchThreads) == 0){
            const int nItemstosearch = nDim/nSearchThreads;
            const int offset = (local_id / nSearchThreads) * nItemstosearch;
            const int end = min(nItemstosearch + offset, count2%LOCAL_WORKSIZE);
            for(int k = offset; k < end; k++){
                if(sums[k].x < cur_min_error){
                    cur_min_error = sums[k].x;
                    best_index = sums[k].y;
                }
            }
     
     
     
            sums[(local_id / nSearchThreads)].x = cur_min_error;
            sums[(local_id / nSearchThreads)].y = best_index;
        }
     
        if(local_id == 0){
            for(int k = 0; k < nSearchThreads; k++)
                if(sums[k].x < cur_min_error){
                    cur_min_error = sums[k].x;
                    best_index = sums[k].y;
     
                }    
     
            int index = (L1_index*nWGs_L2 + L2_index/nWGs_L2) * 2;
            resultList[index] = 11.3f; //<---- with this line, no performance decrease
            resultList[index+1] = best_index; //<- with this line it is like 30x slower
        }
     
     
     
    }

    The last lines with "resultList" are the lines I'm talking about. With the first one, it is fast. With the second one, it is slow.

    The workgrp size is 256 and a total of 256*2048 work items.

    Has anybody an idea why it is so slow and how to fix it?

    Another question on the side: is it in any way faster if split the work items in 2 dimensions instead of one?

    kind regards,
    veio

  2. #2
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Writing var to output array is very slow.

    When you write a constant value the compiler is going to optimize away the best_index variable and some of the computations that go into producing it.

    In particular, any reasonable compiler will remove the code between the two comments:
    Code :
    if(local_id == 0){
            // Start of code that will be removed
            for(int k = 0; k < nSearchThreads; k++)
                if(sums[k].x < cur_min_error){
                    cur_min_error = sums[k].x;
                    best_index = sums[k].y;
     
                }    
            // End of code that will be removed
            int index = (L1_index*nWGs_L2 + L2_index/nWGs_L2) * 2;
            resultList[index] = 11.3f; //<---- with this line, no performance decrease
            resultList[index+1] = best_index; //<- with this line it is like 30x slower
        }

    Also notice that the code in red is only executed by one work-item for each work-group. That in turn is slowing down the execution of the whole work-group since the work group can't finish until that last work-item has also finished. This means that the hardware will be heavily under-utilized.

    I hope this explanation makes sense to you
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  3. #3
    Junior Member
    Join Date
    Sep 2010
    Posts
    4

    Re: Writing var to output array is very slow.

    thanks for the answer.
    it seems that u are right.
    if i delete the whole function and just assign the best_index var to the output it is faster.

    Also notice that the code in red is only executed by one work-item for each work-group. That in turn is slowing down the execution of the whole work-group since the work group can't finish until that last work-item has also finished. This means that the hardware will be heavily under-utilized.
    Well, the loop will be only repeated 8 times..that shouldnt take really long.
    He one before that 16 times, and the first one 128 times.
    The whole function is only like 500 hundred operations per worker item. That shouldnt take 80 ms (the data copying from host to device and back takes like 3ms).

    And since i dont use a barrier or some other sync stuff all the 256*2048 Threads should be indepedent, shouldnt they? How many threads are really parallel? Only one workgroup? all of them?


    My CPU needs like 300ms to do this sequential. And this part:
    Code :
            diff = (list[L1_index*nDim + k] - list2[L2_index*nDim + k]);                       
            sums[local_id].x += diff * diff;
    is repeated 67 million times.
    So I dont understand why it takes so long to exec so few lines.

    My Hardware is ATI Radeon 5850 and AMD Phenom II x4 965 (3.4GHz).

    kind regards,
    veio

  4. #4
    Junior Member
    Join Date
    Sep 2010
    Posts
    4

    Re: Writing var to output array is very slow.

    Another thing: is clEnqueueWriteBuffer with blocking call really done after the function returns or does the compiler "optimize away" stuff as well?

  5. #5
    Senior Member
    Join Date
    May 2010
    Location
    Toronto, Canada
    Posts
    845

    Re: Writing var to output array is very slow.

    How many threads are really parallel? Only one workgroup? all of them?
    That's going to depend completely on the device where you are running the app. I suggest reading their developers guide.

    Another thing: is clEnqueueWriteBuffer with blocking call really done after the function returns or does the compiler "optimize away" stuff as well?
    EnqueueWriteBuffer with blocking enabled is really done before the call returns. This is mandated by the spec.
    Disclaimer: Employee of Qualcomm Canada. Any opinions expressed here are personal and do not necessarily reflect the views of my employer. LinkedIn profile.

  6. #6
    Junior Member
    Join Date
    Sep 2010
    Posts
    4

    Re: Writing var to output array is very slow.

    thank you.

    I just found out about the type t_image2d und used them to store my data und now it's 38x times faster than the CPU.
    Thats a result i can live with, but i still dont get why the other way is so slow
    But t_image2d for large images is not supported by to many devices if understood the literature correctly.

Similar Threads

  1. Garbage output by array of struct in OpenCL
    By skstronghold in forum OpenCL
    Replies: 2
    Last Post: 07-23-2012, 03:20 AM
  2. 2nd Output Array gives garbage!
    By pelangi15 in forum OpenCL
    Replies: 6
    Last Post: 06-07-2011, 12:34 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •