Hi,
I have recently started learning OpenCl, i started with a simple matrix multiplication example to see by how much a gpu can reduce computation time and also to learn how to optimize data movement.i tried the following
Matrices A,B,C are all 1024x1024
GPU: C(i, j) per work-item, all global memory [1D Work Space](1024*1024 work item)
GPU: C(i, j) per work-item, all global memory [2D Work Space]
GPU: C row per work-item, all global memory [1D Work Space]
GPU: C row per work-item, A private, B in global memory [1D Work Space]
GPU: C row per work-item, A private, B in local memory [1D Work Space]

The results are as follows 0.4308s,3.9784s,2.3082s,1.6315s,1.6561s.
i have already checked and all give the correct results.
i am using opencl 1.2, catalyst 12.4 drivers on amd 3400m APU.the following are the kernel for the first.
"__kernel \n"\
"void matrixmultiply(__global float *A, \n"\
" __global float *B, \n"\
" __global float *C,int WidthA,int WidthB) \n"\
"{ \n"\
" \n"\
" // Get the work-itemís unique ID \n"\
" int idx = get_global_id(0); \n"\
" float sum=0; \n"\
" int row; \n"\
" int column; \n"\
" row=idx/WidthB; \n"\
" column=idx%WidthB; \n"\
" // Add the corresponding locations of \n"\
" // 'A' and 'B', and store the result in 'C'. \n"\
" for(int i=0;i<WidthA;i++) \n"\
" { \n"\
" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\
" } \n"\
" C[idx]=sum; \n"\
"} \n"\
kernel for the second
"__kernel \n"\
"void matrixmultiply(__global float *A, \n"\
" __global float *B, \n"\
" __global float *C,int WidthA,int WidthB) \n"\
"{ \n"\
" \n"\
" // Get the work-itemís unique ID \n"\
" float sum=0; \n"\
" int row = get_global_id(0); \n"\
" int column = get_global_id(1); \n"\
" // Add the corresponding locations of \n"\
" // 'A' and 'B', and store the result in 'C'. \n"\
" for(int i=0;i<WidthA;i++) \n"\
" { \n"\
" sum+= A[row*WidthA+i]*B[i*WidthB+column]; \n"\
" } \n"\
" C[row*WidthB+column]=sum; \n"\
"} \n"\

My question is why is the first fastest when it access all data from the global memory.