This is not a real problem. It is a simple starter question about OpenCL memory model.
(I know basics to run a simple kernel with OpenCL)
I want to multiply 2 really big matrices.
30000x30000 x 30000x30000
ONE THREAD CPU:
They don't fit on physical RAM but multiplication can be transparent for C++ because of big swap file. Speed is very low off-course.
What approach is used? It is transparent or I must slice the matrices?
I think this:
- Move to GPU first row-vector from matrix A.
- Move to GPU N column-vectors from matrix B.
- Create N elements of matrix C in first row.
- Move to GPU next N column-vectors from matrix B.
- Create next N elements of matrix C in first row.
- Move to GPU next row-vector from matrix A.
Are all of these needed, or I loose something transparent?
The problem with above code: you don't know in which device will be executed, so you don't know if the device has enough available memory.
I am in a mess!