Results 1 to 3 of 3

Thread: Help required understanding some results(of matrix mult.)

  1. #1
    Junior Member
    Join Date
    May 2012
    Posts
    2

    Help required understanding some results(of matrix mult.)

    Hi,
    I have recently started learning OpenCl, i started with a simple matrix multiplication example to see by how much a gpu can reduce computation time and also to learn how to optimize data movement.i tried the following
    Matrices A,B,C are all 1024x1024
    GPU: C(i, j) per work-item, all global memory [1D Work Space](1024*1024 work item)
    GPU: C(i, j) per work-item, all global memory [2D Work Space]
    GPU: C row per work-item, all global memory [1D Work Space]
    GPU: C row per work-item, A private, B in global memory [1D Work Space]
    GPU: C row per work-item, A private, B in local memory [1D Work Space]

    The results are as follows 0.4308s,3.9784s,2.3082s,1.6315s,1.6561s.
    i have already checked and all give the correct results.
    i am using opencl 1.2, catalyst 12.4 drivers on amd 3400m APU.the following are the kernel for the first.
    "__kernel \n"\
    "void matrixmultiply(__global float *A, \n"\
    " __global float *B, \n"\
    " __global float *C,int WidthA,int WidthB) \n"\
    "{ \n"\
    " \n"\
    " // Get the work-itemís unique ID \n"\
    " int idx = get_global_id(0); \n"\
    " float sum=0; \n"\
    " int row; \n"\
    " int column; \n"\
    " row=idx/WidthB; \n"\
    " column=idx%WidthB; \n"\
    " // Add the corresponding locations of \n"\
    " // 'A' and 'B', and store the result in 'C'. \n"\
    " for(int i=0;i<WidthA;i++) \n"\
    " { \n"\
    " sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\
    " } \n"\
    " C[idx]=sum; \n"\
    "} \n"\
    kernel for the second
    "__kernel \n"\
    "void matrixmultiply(__global float *A, \n"\
    " __global float *B, \n"\
    " __global float *C,int WidthA,int WidthB) \n"\
    "{ \n"\
    " \n"\
    " // Get the work-itemís unique ID \n"\
    " float sum=0; \n"\
    " int row = get_global_id(0); \n"\
    " int column = get_global_id(1); \n"\
    " // Add the corresponding locations of \n"\
    " // 'A' and 'B', and store the result in 'C'. \n"\
    " for(int i=0;i<WidthA;i++) \n"\
    " { \n"\
    " sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\
    " } \n"\
    " C[row*WidthB+column]=sum; \n"\
    "} \n"\

    My question is why is the first fastest when it access all data from the global memory.

  2. #2
    Senior Member
    Join Date
    Aug 2011
    Posts
    271

    Re: Help required understanding some results(of matrix mult.

    Be a lot easier to read if you de-c-stringed it.

    It's down to the fact that 2d work-items are assigned in dimension order from 0, i.e. you're using different addressing in the two cases.

    The first case:
    - reads A with a large stride - not very fast
    But:
    - reads B coalesced (each work-item reads adjacent values)
    - writes the result coalesced.

    In the second case all reads and writes are almost worst-case access pattern: largish order-of-2 stride which causes bank conflicts, and all non-coalesced.

    So the code is different. If you set row=get_global_id(1), col=get_global_id(0) instead, the performance should be the same.

  3. #3
    Junior Member
    Join Date
    May 2012
    Posts
    2

    Re: Help required understanding some results(of matrix mult.

    Thank you for the quick reply.I tried the changes u mentioned and the times are now nearly equal.So the primary reason here is how the data is read/written.

Similar Threads

  1. Required atomic built-in functions
    By sean.settle in forum OpenCL
    Replies: 2
    Last Post: 02-07-2012, 04:16 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •