Help required understanding some results(of matrix mult.)

Hi,

I have recently started learning OpenCl, i started with a simple matrix multiplication example to see by how much a gpu can reduce computation time and also to learn how to optimize data movement.i tried the following

Matrices A,B,C are all 1024x1024

GPU: C(i, j) per work-item, all global memory [1D Work Space](1024*1024 work item)

GPU: C(i, j) per work-item, all global memory [2D Work Space]

GPU: C row per work-item, all global memory [1D Work Space]

GPU: C row per work-item, A private, B in global memory [1D Work Space]

GPU: C row per work-item, A private, B in local memory [1D Work Space]

The results are as follows 0.4308s,3.9784s,2.3082s,1.6315s,1.6561s.

i have already checked and all give the correct results.

i am using opencl 1.2, catalyst 12.4 drivers on amd 3400m APU.the following are the kernel for the first.

"__kernel \n"\

"void matrixmultiply(__global float *A, \n"\

" __global float *B, \n"\

" __global float *C,int WidthA,int WidthB) \n"\

"{ \n"\

" \n"\

" // Get the work-item’s unique ID \n"\

" int idx = get_global_id(0); \n"\

" float sum=0; \n"\

" int row; \n"\

" int column; \n"\

" row=idx/WidthB; \n"\

" column=idx%WidthB; \n"\

" // Add the corresponding locations of \n"\

" // 'A' and 'B', and store the result in 'C'. \n"\

" for(int i=0;i<WidthA;i++) \n"\

" { \n"\

" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\

" } \n"\

" C[idx]=sum; \n"\

"} \n"\

kernel for the second

"__kernel \n"\

"void matrixmultiply(__global float *A, \n"\

" __global float *B, \n"\

" __global float *C,int WidthA,int WidthB) \n"\

"{ \n"\

" \n"\

" // Get the work-item’s unique ID \n"\

" float sum=0; \n"\

" int row = get_global_id(0); \n"\

" int column = get_global_id(1); \n"\

" // Add the corresponding locations of \n"\

" // 'A' and 'B', and store the result in 'C'. \n"\

" for(int i=0;i<WidthA;i++) \n"\

" { \n"\

" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\

" } \n"\

" C[row*WidthB+column]=sum; \n"\

"} \n"\

My question is why is the first fastest when it access all data from the global memory.

Re: Help required understanding some results(of matrix mult.

Be a lot easier to read if you de-c-stringed it.

It's down to the fact that 2d work-items are assigned in dimension order from 0, i.e. you're using different addressing in the two cases.

The first case:

- reads A with a large stride - not very fast

But:

- reads B coalesced (each work-item reads adjacent values)

- writes the result coalesced.

In the second case all reads and writes are almost worst-case access pattern: largish order-of-2 stride which causes bank conflicts, and all non-coalesced.

So the code is different. If you set row=get_global_id(1), col=get_global_id(0) instead, the performance should be the same.

Re: Help required understanding some results(of matrix mult.

Thank you for the quick reply.I tried the changes u mentioned and the times are now nearly equal.So the primary reason here is how the data is read/written.