This is not a real problem. It is a simple starter question about OpenCL memory model.

(I know basics to run a simple kernel with OpenCL)

I want to multiply 2 really big matrices.

30000x30000 x 30000x30000

ONE THREAD CPU:

They don't fit on physical RAM but multiplication can be transparent for C++ because of big swap file. Speed is very low off-course.

OPENCL:

What approach is used? It is transparent or I must slice the matrices?

I think this:

- Move to GPU first row-vector from matrix A.

- Move to GPU N column-vectors from matrix B.

- Create N elements of matrix C in first row.

- Move to GPU next N column-vectors from matrix B.

- Create next N elements of matrix C in first row.

- .......

- Move to GPU next row-vector from matrix A.

- .......

Are all of these needed, or I loose something transparent?

The problem with above code: you don't know in which device will be executed, so you don't know if the device has enough available memory.

I am in a mess!