Hello,

The following kernel is used to multiply matrix by vector. Is taken from the book: "OpenCL in action"

__kernel void matvec_mult(__global float4* matrix,
__global float4* vector,
__global float* result) {

int i = get_global_id(0);
result[i] = dot(matrix[i], vector[0]);

}

How does the GPU knows that 'i' returned from "get_global_id" means a row and not a column ?
How does it knows that each row has 4 elements ?


The host calls:

/* Enqueue the command queue to the device */
work_units_per_kernel = 4; /* 4 work-units per kernel */
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &work_units_per_kernel,
NULL, 0, NULL, NULL);

Does it mean that 4 cores (4 work units) are used to compute the result ?
mat_buff = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(float)*16, mat, &err);

This is how the host created the input matrix buffer. The total size of the matrix is 16 elements but who tells GPU the number of rows, columns ?

Thanks,
Zvika