Hi,

I have recently started learning OpenCl, i started with a simple matrix multiplication example to see by how much a gpu can reduce computation time and also to learn how to optimize data movement.i tried the following

Matrices A,B,C are all 1024x1024

GPU: C(i, j) per work-item, all global memory [1D Work Space](1024*1024 work item)

GPU: C(i, j) per work-item, all global memory [2D Work Space]

GPU: C row per work-item, all global memory [1D Work Space]

GPU: C row per work-item, A private, B in global memory [1D Work Space]

GPU: C row per work-item, A private, B in local memory [1D Work Space]

The results are as follows 0.4308s,3.9784s,2.3082s,1.6315s,1.6561s.

i have already checked and all give the correct results.

i am using opencl 1.2, catalyst 12.4 drivers on amd 3400m APU.the following are the kernel for the first.

"__kernel \n"\

"void matrixmultiply(__global float *A, \n"\

" __global float *B, \n"\

" __global float *C,int WidthA,int WidthB) \n"\

"{ \n"\

" \n"\

" // Get the work-item’s unique ID \n"\

" int idx = get_global_id(0); \n"\

" float sum=0; \n"\

" int row; \n"\

" int column; \n"\

" row=idx/WidthB; \n"\

" column=idx%WidthB; \n"\

" // Add the corresponding locations of \n"\

" // 'A' and 'B', and store the result in 'C'. \n"\

" for(int i=0;i<WidthA;i++) \n"\

" { \n"\

" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\

" } \n"\

" C[idx]=sum; \n"\

"} \n"\

kernel for the second

"__kernel \n"\

"void matrixmultiply(__global float *A, \n"\

" __global float *B, \n"\

" __global float *C,int WidthA,int WidthB) \n"\

"{ \n"\

" \n"\

" // Get the work-item’s unique ID \n"\

" float sum=0; \n"\

" int row = get_global_id(0); \n"\

" int column = get_global_id(1); \n"\

" // Add the corresponding locations of \n"\

" // 'A' and 'B', and store the result in 'C'. \n"\

" for(int i=0;i<WidthA;i++) \n"\

" { \n"\

" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\

" } \n"\

" C[row*WidthB+column]=sum; \n"\

"} \n"\

My question is why is the first fastest when it access all data from the global memory.