Results 1 to 8 of 8

Thread: A little optimization help anyone?

  1. #1
    Junior Member
    Join Date
    Oct 2011
    Posts
    26

    Question A little optimization help anyone?

    Here is the kernel code:

    Code :
    __kernel void assembleMatrix(const int R, const int r0, const int c0, __global const REAL_TYPE *glo_A, __global const REAL_TYPE *glo_B, __global REAL_TYPE *glo_M)
    {
        int row = get_global_id(0);
        int col = get_global_id(1);
     
        if(row < NUM_CELL_VAR && col < NUM_CELL_VAR)
        {
            __global const REAL_TYPE *gA = &glo_A[row*NUM_SAMPLES];
            __global const REAL_TYPE *gB = &glo_B[col*NUM_SAMPLES];
     
            REAL_TYPE dM = 0.0;
     
            for(int s = 0; s < NUM_SAMPLES; s++)
                dM += gA[s]*gB[s];
     
            glo_M[(r0 + row)*R + c0 + col] += dM;
        }
    }

    The identifiers in caps are defined in a separate source (generated before compilation) as
    Code :
    #define REAL_TYPE double
    #define NUM_SAMPLES 4096  // may be defined as anything from 4096 to 65536
    #define NUM_CELL_VAR 64    // may be defined as 64, 216 or 512

    It seems to me that accessing both global memory arrays in the loop may cause a bottleneck, so I've been experimenting with precopying parts of the global data into local and private memory. However the current code still runs the fastest. Perhaps because sizeof(REAL_TYPE)*NUM_SAMPLES is too large to fit into local memory all at once on my device.

    As you may see, this is a part of a numerical integration and the use of double presicion is neccesary. Anyway, clever optimization tips would be greatly appreciated.

    Peccable

  2. #2
    Junior Member
    Join Date
    Jun 2014
    Posts
    10
    To me it looks like this is essentially normal matrix multiplication, no? I hate to not actually give you any explicit help, but googling (or searching these forums) for "OpenCL local memory matrix multiplication" gives a number of results, unfortunately I'm not sure which are the best. You could also take a look at Nvidia's CUDA documentation, since your approach will be very similar to theirs.

  3. #3
    Junior Member
    Join Date
    Oct 2011
    Posts
    26
    Thanks, yes its a sum of outer products, essentially equivalent to multiplication of two rectangular matrices. I've found a few examples, maybe my google-fu is no good, but they all seem to suggest copying a buffer into local or private memory which, in my case, causes a significant slowdown.

    I've also tried casting *gA and *gB into double4 and summing up dot(.. , ..) products which gave a slight improvement.

    The only thing I've found that is significant is to precompute the starting memory address in the source data (the pointers to &glo_A[row*NUM_SAMPLES] and &glo_B[row*NUM_SAMPLES]). That gave about a 15 % improvement.

    Currently this kernel performs about 51 Gflops. Compared to the rated 152 Gflops of my device (for double precision), this seems a bit low, no?

  4. #4
    Junior Member
    Join Date
    Jun 2014
    Posts
    10
    I haven't had a chance to try it out yet, but you might try transposing gA and gB so that the reads from global memory are coalesced. That is to say, if NUM_SAMPLES was 8 and you had 8 threads, have the memory layout be something like:

    [A00][A08][A16]...[A56]
    [A01][A09][A17]...[A57]
    ...
    [A07][A15][A23]...[A63]

    As it is now, you read from A[0], A[NUM_SAMPLES], ... , A[(global_size(0) - 1) * NUM_SAMPLES], etc, all at the same time, which I doubt is very efficient in terms of memory bandwidth. You could also probably rewrite the kernel a different way to instead process each line from A/B at the same time instead of transposing the data, but either would probably work. I forgot to mention, but if you have a profiler for your device, make sure you use it, since I am sort of guessing on what might be the bottleneck here.

  5. #5
    Junior Member
    Join Date
    Oct 2011
    Posts
    26
    Thanks a lot for your input. I've just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel

    Code :
    Method				assembleMatrix__k3_Pitcairn1
    ExecutionOrder			543		
    ThreadID			11548	
    CallIndex			2949	
    GlobalWorkSize			{    216     216       1}		
    WorkGroupSize			{   16    16     1}		
    Time				13.14148
    LocalMemSize			0		
    VGPRs				48
    SGPRs				26
    ScratchRegs			0	
    FCStacks			NA	
    KernelOccupancy			50		
    Wavefronts			729	
    VALUInsts			27676	
    SALUInsts			1742	
    VFetchInsts			6913	
    SFetchInsts			8	
    VWriteInsts			1	
    LDSInsts			0	
    VALUUtilization (%)		100			
    VALUBusy (%)			56.12		
    SALUBusy (%)			1.09		
    FetchSize			440963.38	
    CacheHit (%)			39.30		
    MemUnitBusy (%)			91.60		
    MemUnitStalled (%)		0.03			
    WriteUnitStalled (%)		0				
    LDSBankConflict (%)		0			
    GDSInsts			0	
    WriteSize			389.38

    I havent profiled a GPU kernel before so I'm reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.

  6. #6
    Junior Member
    Join Date
    Jun 2014
    Posts
    10
    Quote Originally Posted by Peccable View Post
    Thanks a lot for your input. I've just tried the CodeXL profiler from AMD. Here is what it has to say about the kernel

    I havent profiled a GPU kernel before so I'm reading a bit to see what to make of these numbers. But right away I notice especially the VALUBusy at 56.12% (time used for vector instructions) and SALUBusy at 1.09% (time used for scalar instructions), which are supposedly bad. But I guess SALUBusy beeing low is only due to most instructions being vector type.
    I'm tempted to say that 50% isn't terrible, all things considered, but I guess theoretically your code could be twice as fast (if you were aiming to be compute-bound I guess?). I tried running a test version of your kernel with similar parameters here:
    Code c++:
    #define REAL_TYPE double
    #define NUM_SAMPLES 6144
    #define NUM_CELL_VAR 216
     
    __kernel void assembleMatrix(__global const REAL_TYPE *glo_A, __global const REAL_TYPE *glo_B, __global REAL_TYPE *glo_M)
    {
        int row = get_global_id(0),
            col = get_global_id(1);
     
        if(row < NUM_CELL_VAR && col < NUM_CELL_VAR)
        {
            __global const REAL_TYPE *gA = &glo_A[row*NUM_SAMPLES];
            __global const REAL_TYPE *gB = &glo_B[col*NUM_SAMPLES];
     
            REAL_TYPE dM = 0.0;
     
            for(int s = 0; s < NUM_SAMPLES; ++s)
                dM += gA[s] * gB[s];
     
    //        // mad version
    //        for(int s = 0; s < NUM_SAMPLES; ++s)
    //            dM = mad(gA[s], gB[s], dM);
     
            glo_M[row * NUM_CELL_VAR + col] = dM;
        }
    }

    For the non mad() version I got the following (abbreviations here, as I think you already found):
    Code :
    Method           assembleMatrix__k1_Tahiti1
    ExecutionOrder   11
    ThreadID         9907
    CallIndex        61
    GlobalWorkSize   {216 216 1}
    WorkGroupSize    {16  16  1}
    Time             45.23496
    LocalMemSize     0
    VGPRs            34
    SGPRs            16
    ScratchRegs      0
    FCStacks         NA
    Wavefronts       729
    VALUInsts        12686
    SALUInsts        205
    VFetchInsts      6144
    SFetchInsts      8
    VWriteInsts      1
    LDSInsts         0
    GDSInsts         0
    VALUUtilization  100
    VALUBusy         1.6
    SALUBusy         0.02
    FetchSize        295928.5
    WriteSize        401.06
    CacheHit         54.29
    MemUnitBusy      98.77
    MemUnitStalled   0.05
    WriteUnitStalled 0
    LDSBankConflict  0

    Interestingly, the mad() version performs slightly worse, seemingly because of more cache misses? Honestly not really sure whats going on there. The only different values I got from profiling are as follows:
    Code :
    Time             48.70119
    VALUInsts        6542
    VALUBusy         1.12
    CacheHit         47

    Anyway, that's all besides the point. The fact that MemUnitBusy is 90%+ while VALUBusy is small (and in the case of my device, extremely so) means that, as far as I can tell, your program is memory bound. We can tell the reads are not as good as they could be by looking at the FetchSize:
    Quote Originally Posted by AMD
    FetchSize: The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
    For my parameters glo_A and glo_B each had a size of 216 * 6144 * (8 bytes), which comes out to about 21.23 MB total, but as you can see from my profile results the total fetch size is 295.929 MB, or over 10x more. Luckily the cache does help us some, so it's not as big as it could be, but I think it could be improved by trying to coalesce the global reads and/or explicitly use local memory to store reused data. Obviously we won't be able to get it down to just 21.3 MB, but I think it could be cut down some, which should raise VALUBusy.

  7. #7
    Senior Member
    Join Date
    Dec 2011
    Posts
    161
    Just an observation: The GlobalWorkSize is not an integer multiple of the WorkGroupSize (216 is not evenly divisible by 16). In OpenCL 1.x, if you specify the work group size then the global size must be a multiple of it. Then you can pass the real size as a parameter and your kernel can check to see if the global_id is within the valid size before doing work.

  8. #8
    Junior Member
    Join Date
    Oct 2011
    Posts
    26
    So I transposed the global memory buffers and this gave some improvement. Now the VALUBusy is typically 70% - 80% (SALUBusy is 10%). And overall the kernel performs about 60 Gflops.

    Fetch size is still ranges from 100MB to 200MB. The fact that this varies a lot from one run to another is a bit strange. Maybe it could be due to the fact that the GPU also is connected to a screen and renders stuff for other applications?

    Quote Originally Posted by Dithermaster View Post
    Just an observation: The GlobalWorkSize is not an integer multiple of the WorkGroupSize (216 is not evenly divisible by 16). In OpenCL 1.x, if you specify the work group size then the global size must be a multiple of it. Then you can pass the real size as a parameter and your kernel can check to see if the global_id is within the valid size before doing work.
    Hmm you're right. It has not been a problem but I've changed the wg size to 8x8 to be sure.

    BTW, I've been using dot(A,B) instead of A*B. Even if A and B are not vectors it seems dot() is slightly faster than *. Guess this is highly system dependent though.
    Last edited by Peccable; 06-25-2014 at 09:26 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •