# Understanding & Tracing Numerical Errors in C++

Notes on a talk by Gino van den Bergen (@dtecta)

• Fixed point numbers snap to a regular grid; in floating point numbers, the rounding error increases the farther you get from zero
• IEEE 754 single-precision (32-bit) float is:
• 1 bit for sign
• 8 bits for the exponent
• 23 bits for the fraction
• (-1)^sign * 1.fraction * 2^(exponent-127)
• For non-negative numbers, you can actually use an integer comparison to compare relative magnitudes
• Can’t represent zero; 2^-127 is the smallest value; thus, zero is a special case
• Exponent and fraction portions are both zero… but this leads to +0 and -0 due to the sign bit
• Behave identically except when dividing by zero: divide by positive zero gives +Inf, divide by negative zero gives -Inf
• Filling the gap between zero and the smallest number:
• Subnormal numbers: exponent is zero, but fraction is not
• Vector processing units don’t support this
• Not very useful; often “flush” these to zero to avoid subnormal numbers
• If your units were meters, the gap between the smallest “normal” float and zero is smaller than the Planck length
• Thus, underflow is not a real problem!
• A story about a train sim
• World coordinates are single-precision floats; entire game world has a 300 km radius
• The little blue engine moves forward via Euler
• When it gets far from the world origin, it grinds to a halt
• Velocity multiplied by the time stamp is small enough that when adding back to the current position, it rounds to adding zero
• Better to use fixed-point for world coordinates: guarantees that at any point in the world, you see the same precision
• Optionally keep a float for storing the remainder after rounding to fixed-point
• Also prefer fixed-point for absolute time—prevents timestamp from getting “jumpy”
• Relative error
• Absolute rounding error doubles by 2 for each power of 2
• For single precision, the machine error (epsilon) is 2^-24
• A single rounding operation results in a relative error no greater than the machine epsilon
• Problem: errors accumulate with each operation
• Notably, additions and subtractions really tend to hurt us
• Cancellation: Given two numbers that are nearly equal and which are already contaminated by rounding, if you take the difference of the two, the best approximation of the result can be way off from the actual value
• E.g., the 20 least significant bits are garbage
• This loss of significant bits is called cancellation and is the main source of numerical issues
• Multiplication, division, exponentiation don’t really affect this… just addition and subtraction
• E.g., compute the normal of a triangle by taking the cross product of two of its edges
• If the length of two edges is almost the same (a sliver triangle), the relative error in the cross product will be high
• Note that the choice of edges is arbitrary when computing the normal
• So, you should pick the two shortest edges, which will have the least amount of rounding error in the result (or, put another way, the two edges that form an angle closest to 90 degrees)
• Order of operations
• Can have a great effect on error—especially if the result of adding/subtracting might cancel out the addition/subtraction entirely
• If you have a choice between doing a(b+c) or ab+ac, prefer the former because it’s more robust!
• Factorize! Try to perform additions and subtractions before multiplications whenever possible.
• Automatic error tracing in C++
• Make floating-point types abstract (ignorant of the actual floating point type you’re using)
• This lets you quickly tell a numerical issue from a bug by substituting double (or higher!) precision
• Maintain a bound for the relative error by substituting the ErrorTracer proxy class (which maintains the relative error for the values you’re computing)
• Never use float or double explicitly… instead, use a type name, like:
using Scalar=float;
• Can’t use float literals, C-style casts, or static_cast for initialization or conversion
• E.g., use Scalar(2) rather than 2.0f or (Scalar)2 or static_cast<Scalar>(2)
• Use a traits class for numerical limits
• Use the overloaded C++ math functions from <cmath> rather than the C-style math functions from <math.h>
• ErrorTracer<T> (found in Motion Toolkit Github project)
• keeps two private T values: the value and the max relative error
• Overload all the operators that compute both the value and the error
• Transparently replaces built-in types like float
• You can query both the relative error or the number of contaminated/untrustworthy bits at any point
• Note that FPUs may use higher precision for intermediate results (e.g., Intel uses 80 bits for intermediate values)
• Thus, the relative error values may be a bit pessimistic
• Still useful for checking where precision is lost
• Summary
• Choose a formulation that keeps your values as small as possible
• Factorize to do additions and subtractions first
• Abstract away from raw types
• See Math for Game Programmers talks