A lot of conventions as to how we do things are arbitrary. For example, driving on the left side of the road is not inherently better or worse than driving on the right side. But if users of the same road don’t all agree to drive on the same side, then the result is grief.
The same is true of computer graphics. For example, coordinate systems can be right-handed or left-handed; if you imagine placing your eye at the (0, 0, 0) point and looking in turn in the direction of the positive-X, positive-Y and positive-Z axes, if your gaze describes a clockwise rotation, then the coordinate system is right-handed, while anticlockwise means it is left-handed. One way to remember which is which is to hold your thumb, index finger and middle finger out at right angles to each other, calling the thumb direction positive-X, index finger pointing at positive-Y, and middle finger at positive-Z; if you do this with your right hand, then the coordinate system is (naturally) right-handed; while your left hand defines a left-handed coordinate system.
The recommended convention (used in most 3D software) is to define your model/scene in a right-handed coordinate system.
It is quite common in computer graphics to be working in a number of different coordinate systems. For example, a model of a car is defined in terms of its own model coordinate system. To place the car in a scene, perhaps moving along a road, involves transforming those model coordinates to the world coordinates of the scene. The placement of the “camera” (actually the representation of the eye position of the person viewing the scene) then involves transforming these world coordinates (via the viewing transformation) into eye coordinates, which are defined relative to the viewer such that the X axis is horizontal and increasing to the right, Y is vertical and increasing upwards, and Z is horizontal and increasing directly towards the viewer. These are then finally transformed into normalized device coordinates and mapped to pixels on the user’s display.
And just to add to the fun, the car model itself may have multiple coordinate systems. For example, each wheel may be defined in its own child coordinate system, in which it rotates relative to its parent, namely the body of the car. Because the wheel is transformed relative to the car, it automatically gets the car transformation as well, so moving the car through the scene is sufficient to bring the wheels along, they don’t need to be separately repositioned.
So the transformation pipeline, as far as eye coordinates, looks like this (where the steps in square brackets are transformations, the ones that are not represent coordinate values):
model coords → [parent xform 1] → ... → [parent xform n] → [viewing xform] → eye coords
Then the eye coordinates are further transformed as follows:
eye coords → [projection xform] → clip coords → [÷ w] → normalized dev coords → [viewport xform] → window coords
I said above that it is recommended to use right-handed coordinate systems. This is true up to the eye-coordinates stage. It is normal for the projection transformation to then flip the Z-values around to make the coordinate system left-handed, so that Z-values are now increasing away from the viewer, instead of towards them. This is so that the Z-values can be converted and stored in the depth buffer, with increasing values representing increasing depth.
The standard range for eye coordinates is for an interval of [-1 .. +1] along each axis. Of course you can define your models and scenes using any numbers your computer’s floating-point representation can handle, and you can call the units whatever you like—metres, feet, microns, light-years, whatever. All you have to do is make sure that the combination of all the model and viewing transformations brings the part of the scene you want to see down within that [-1 .. +1] range of eye-coordinate values.
Normalized device coordinates then simply takes these eye coordinates and maps them so (x, y) = (-1, -1) is at the lower-left corner of the viewing area (window, full-screen, whatever), while (x, y) = (+1, +1) is at the top-right (the z-value, of course, ends up in the depth buffer). The final viewport transformation remaps these in units of actual pixels, corresponding to the actual size of your viewing area.
The normal kind of transformations applied to computer graphics (and the only kind supported by OpenGL) is called a linear transformation. This is because straight lines before the transformation end up still straight after being transformed. Such a transformation can be handily represented by a matrix. A 4×4 matrix can represent any possible linear transformation in 3 dimensions.
Multiple transformations can be combined (“concatenated”) by multiplying their respective matrices. Unlike multiplication of conventional scalar numbers, the order of the transformations matters: a scaling followed by a translation (repositioning) is not the same if the translation is done first, since the scaling then happens around a different centre point.
You will quite commonly see a 3-dimensional position vector written, not as (x, y, z), but as (x, y, z, w). And the transformation matrices are correspondingly 4×4, rather than 3×3. Why is this?
It’s so that all linear transformations can be represented uniformly in terms of matrix multiplication. Otherwise translation would have to be separated out as an addition step:
(x', y', z') = [scaling+rotation+shear] × (x, y, z) + [translation]
With homogeneous coordinates, everything can be rolled into one matrix:
(x', y', z', w') = [scaling+rotation+shear+translation] × (x, y, z, w)
It is normal to set w to 1.0, at least to start with. And most transformations will produce vectors with w = 1.0 if they were given them to start with. But beware of transformations (particularly perspective ones) that produce vectors where w ≠ 1.0, you must then normalize the x, y and z values (divide them by w) if you want to do any calculations with them individually, otherwise you are liable to get wrong results!
And here we have another case where you need to pick a convention and stick to it. Transforming a vector V by a transformation M to a vector V' can be written as a premultiplication of column vectors:
V' = M × V
or as a postmultiplication of row vectors:
V' = V × M
It is recommended that you stick to the premultiplication convention, as this is more consistent with how OpenGL operates. For example, in the (pre-OpenGL-3.0) fixed-function pipeline, the ordering of calls (in pseudocode) would be:
apply matrix M; draw vector V
which corresponds to the left-to-right ordering in the premultiplication formula. And if M is decomposed into separate components, for example M1 × M2, then the premultiplication formula becomes
V' = M1 × M2 × V
and the corresponding fixed-function calls are:
apply matrix M1; concatenate matrix M2; draw vector V