OpenGL es 2.0, shader language, matrices and optimisation question

BunnyKillBot · 22 Apr 2011 at 10:02

Ok, so lately I have been writing a simple rendering engine for iOS using OpenGL. As of 2.0, calls such as glBegin have been removed in favour of the programmable shader approach, so to do anything moderately exciting requires adding back in the model, view and projection matricies.

Now, one of the most common operations a graphics engine performs is matrix multiplication, which in itself performed many times in a given frame is quite computationally expensive:

Code:

matrix[0]  = m1[0]*m2[0]  +  m1[1]*m2[4]  + m1[2]*m2[8]   + m1[3]*m2[12];
matrix[1]  = m1[0]*m2[1]  +  m1[1]*m2[5]  + m1[2]*m2[9]   + m1[3]*m2[13];
matrix[2]  = m1[0]*m2[2]  +  m1[1]*m2[6]  + m1[2]*m2[10]  + m1[3]*m2[14];
…

matrix[15] = m1[12]*m2[3] +  m1[13]*m2[7] + m1[14]*m2[11] + m1[15]*m2[15];

Now for the sake of optimisation on the CPU side, I could setup a matrix multiplication daemon with threading to split the load.

However, the OpenGL es 2.0 language gives an inbuilt matrix primitive and operations.

Code:

uniform mat4 m_model;
uniform mat4 m_view;
uniform mat4 m_projection;

gl_Position = m_projection * m_view * m_model * v_position;

Now my question is, is the matrix multiplication as expressed in the shader language (and presumable executed on the GPU) optimised? Does the matrix multiplication happen serially or in parallel? Is it better to send a precomputed on the CPU model view projection matrix to the vertex shader or is what I am doing here ok?

Alex74 · 22 Apr 2011 at 16:04

Never used OpenGL shader, but if it's the same principle as DX w/HLSL, then I would imagine the opengl shader operations are done serially, otherwise it doesnt make sense when SIMD shaders run on each vertex in parallel. I would do as much common processing on the CPU to prevent the shaders from doing common operations, then let the shaders do any vertex-specific calculations to finalise the output.

W.R.T your matrix multiply, if you're dealing with 16*16 matrices, then you may want to consider a matrix multiplication algorithm more efficient than O(n^3). Inbuilt/library functions for matrix multiplies may offer better performance, I'm not sure what iOS/OGL 2.0 offers on that front.

Hope this helps a bit, but I'm no expert on OGL so feel free to completely ignore this post if it's no help

BunnyKillBot · 22 Apr 2011 at 16:36

Nah thats great thanks

Can you point me towards any better algorithms? The Strassen Algorithm is only marginally better at O(n^2.807). The Coppersmith-Winograd at O(n^2.376) but im having a tough time tracking down sources about it that arent unreadably mathematical

But aye thats the thing, 4x4 matrix multiplication is so common in a 3d graphics engine i cant see why it wouldn't be done in parallel, calculating each component part on a different unit. Its one of those 'would benefit massively from parallelisation' things, executing 16 calculations in 1 step.

Alex74 · 22 Apr 2011 at 23:19

BunnyKillBot said:
Nah thats great thanks Can you point me towards any better algorithms? The Strassen Algorithm is only marginally better at O(n^2.807). The Coppersmith-Winograd at O(n^2.376) but im having a tough time tracking down sources about it that arent unreadably mathematical

But aye thats the thing, 4x4 matrix multiplication is so common in a 3d graphics engine i cant see why it wouldn't be done in parallel, calculating each component part on a different unit. Its one of those 'would benefit massively from parallelisation' things, executing 16 calculations in 1 step.

It'll be tough to find algorithms that arent knee deep in formulae i think

. Had quite a nice one in a book just a couple weeks ago but had to give it back to the lecturer -.-. Will see if I can track the book down again. To be honest though, i was a bit stupid skim-reading the code you posted and thinking it was 16*16 and not just 16 elements of a 4*4 matrix

. Probably be fine with what you have tbh. Does OpenGL ES not have inbuilt matrix objects and functions that you can use? :confused:

As far as I see it, the parallelisation of matrix multiplication in graphics comes in when we're calculating a matrix for each vertex, using each individual vector. If we had 10000 vertices, then it makes more sense to do each full matrix multiplication on a single processing unit and just parallelise across the vertices, rather than worry about tasking within the shader function itself.

I agree though. It does certainly lend itself on its own to thread-based optimisation for normal CPU processing

.