Ok, so lately I have been writing a simple rendering engine for iOS using OpenGL es 2.0 spec. To do anything moderately exciting requires adding back in the model, view and projection matricies.

One of the most common operations a graphics engine performs is 4x4 matrix multiplication, which in itself performed many times in a given frame is quite computationally expensive:

Code :
matrix[0]  = m1[0]*m2[0]  +  m1[1]*m2[4]  + m1[2]*m2[8]   + m1[3]*m2[12];
matrix[1]  = m1[0]*m2[1]  +  m1[1]*m2[5]  + m1[2]*m2[9]   + m1[3]*m2[13];
matrix[2]  = m1[0]*m2[2]  +  m1[1]*m2[6]  + m1[2]*m2[10]  + m1[3]*m2[14];

matrix[15] = m1[12]*m2[3] +  m1[13]*m2[7] + m1[14]*m2[11] + m1[15]*m2[15];

Now for the sake of optimisation on the CPU side, I could setup a matrix multiplication daemon with threading to split the load.

However, the shader language gives the inbuilt matrix primitive and operations.

Code :
uniform mat4 m_model;
uniform mat4 m_view;
uniform mat4 m_projection;
gl_Position = m_projection * m_view * m_model * v_position;

Now my question is, is the matrix multiplication as expressed in the shader language (and presumable executed on the GPU) optimised? Does the matrix multiplication happen serially or in parallel? Is it better to send a precomputed on the CPU model view projection matrix to the vertex shader or is what I am doing here ok?