Small Float Formats
Small Float Formats, are floating-point values that use less than the standard 32-bits of precision. An example of these are 16-bit half-floats. This article details how these are encoded and used.
IEEE float review
We start with a quick review on how 32-bit floating-point numbers are encoded; detailed information can be found on Wikipedia.
The IEEE 754 specification defines a floating-point encoding format that breaks a floating-point number into 3 parts: a sign bit, a mantissa, and an exponent.
The mantissa is an unsigned binary number (the sign of the number is in the sign bit) with some particular bitdepth. For 32-bit floats, this depth is 23 bits. The absolute value of the mantissa, when converting it into an actual number, is the mantissa divided by 2^bitdepth.
The exponent has special handling. This is again an unsigned binary number with a particular bitdepth. A bias value is based on the particular bitdepth: it is (2^(bitdepth - 1)) - 1. For example, 32-bit floats have an exponent bitdepth of 8, so they have a bias of 127 (2^7 - 1). The bias is applied to the absolute value of the exponent depending on the exponent's value.
The actual interpretation of the number depends on the value of the exponent. The exponent can be one of the following:
- 0: The resulting number is the mantissa's absolute value directly multiplied by 2^(bias - 1). If the mantissa is also zero, you get 0.0. In floating-point numbers, you can have positive and negative 0, thanks to the sign bit.
- (0, 2^bitdepth - 1): The resulting number is: (1.0 + mantissa) * 2^(exponent - bias). Note the addition of 1.0 to the mantissa in this case. This allows the mantissa to effectively gain one bit of extra precision.
- 2^bitdepth - 1: This will either be infinity or NaN based on the mantissa. A zero mantissa gives infinity, while any other number gives NaN. OpenGL's internal processes will produce undefined results when given Inf or NaN, but they will not crash.
The result of the above is negated if the sign bit is set.
This process works for floating-point numbers of any bitdepth.
OpenGL supports a number of low bitdpeth floating-point formats. These are:
|Overall bitdepth||Sign bitdepth||Mantissa bitdepth||Exponent bitdepth|
* No sign bits mean that the value is always positive.
** Used only in RGB9_E5 textures.
32-bit floats are often called "single-precision" floats, and 64-bit floats are often called "double-precision" floats. 16-bit floats therefore are called "half-precision" floats, or just "half floats".
The 11 and 10 bit floats are used exclusively for the GL_R11F_G11F_B10F image format. They have no sign bit, as they're generally used to represent image data in floating-point format. And negative colors is (usually) not a concept that makes sense.
The 14-bit float format is exclusively used in the GL_RGB9_E5 image format. These have no sign bit, as they're generally used to represent image data. They have individual mantissa bits, but share a single exponent for all 3 values.
Numeric limits and precision
Floating point values have limits and precision. It is important to remember the numeric precision of small float values:
|Floating Point Bitdepth||Largest value||Smallest value*||Decimal digits of precision|
|16-bit Float||6.55 × 104||6.10 × 10-5||3.31|
|14-bit Float||6.55 × 104||6.10 × 10-5||3.16|
|11-bit Float||6.50 × 104||6.10 × 10-5||2.5|
|10-bit Float||6.50 × 104||6.10 × 10-5||2.32|
- Smallest in this case meaning the value most near zero. Without denormalization.
Take note of the number of decimal digits of precision each format offers. All of these values use the same exponent (5-bits), so they all range over approximately the same breadth of values. What matters is how many digits of precision you get.
When you add floats together, you get the precision of the one with the greatest exponent. So if the smaller number cannot be represented with the size of the larger exponent, then the addition does nothing. It is possible therefore for X+Y == X, depending on the relative sizes of X and Y.
Note that in a shader, you always get the full 32-bits of floating-point precision; these formats are typically only used for image storage.