Floating Point Basics

Mathematics | . Edited . 8 min read (2011 words).

What are floating-point numbers? When is it a bad idea to compare them? How precise are they? Let’s explore this.


A floating-point mystery

It’s common knowledge in programming that comparing floating-point numbers for equality is rarely a good idea. My favourite example is:

$$1.1 \times 1.1 = 1.21\text{.}$$

This is mathematically true. However, in almost all programming languages, it happens to be false! You can try this in Python, JavaScript, Java, Haskell, C++, and many more languages. Most languages on most platforms will give the same result. Some languages (such as C++) leave details unspecified so that it can vary by platform, but almost all hardware platforms handle floating-point numbers the same or similarly anyway.

Let’s try Python:

>>> 1.1 * 1.1 == 1.21

Another simple example is:

>>> 0.1 + 0.2 == 0.3

Why does this happen? It’s not a bug. Let’s dive deeper to find out.

What is a floating-point number?

In mathematics, there are the real numbers. Unfortunately, the set of all real numbers is infinite (uncountably infinite even!) and it’s therefore impossible to represent them all on physical computers with limited memory. The same occurs when writing down real numbers on a finite amount of paper. The typical solution for writing (some) real numbers is to use the decimal system.

The decimal system uses the base ten to represent numbers through digits (from 0 to 9) that are arranged after each other with an optional decimal separator for fractions. This is well known. For example:

$$31.416 = 3 \times 10^1 + 1 \times 10^0 + 4 \times 10^{-1} + 1 \times 10^{-2} + 6 \times 10^{-3}\text{.}$$

This system has its limitations. An infinitely long decimal number is needed to represent some rational numbers and all irrational numbers. For example:

$$\frac{1}{3} = 0.333333 \dots$$

Both on paper and in computer hardware, only a finite number of digits can be represented. Therefore, some real numbers can’t be accurately described.

Floating-point numbers are very similar to this concept and usually follow the IEEE 754 standard, at least partially, nowadays. IEEE 754 defines floating-point formats using base 10, but these are unusual. It’s more efficient to use base 2, which is what most hardware supports natively.

Each floating-point format can only represent a finite set of real numbers accurately. They have limited precision and range.

One way to represent floating-point numbers mathematically is:

$$s \times \beta^{e}\quad s,\beta,e \in \Z.$$

Here, $s$ is the significand (including sign), $\beta$ is the base (usually 2), and $e$ is the exponent. This leaves some redundancy that IEEE 754 employs extra tricks to get around and to be able to represent certain special numbers (infinity and not-a-number / NaN). Such extensions will be ignored here.

Depending on the exact floating-point format, both $s$ and $e$ are limited to some finite integer intervals. The IEEE 754 formats all have a fixed size, such that the number of bits allocated for each part put fixed restrictions on how large these intervals can be. The valid interval of $s$ determines the precision, whereas the valid exponents determine the range.

IEEE 754 floating-point numbers are not represented exactly like this, but the simplified representation conveys the basic concept while it avoids having to deal with non-essential special cases.

Representation errors

As already seen above, some real numbers can’t be represented accurately as finite decimal numbers. $1/3$ is one such number.

For rational numbers of the form $p/q$, where $p$ and $q$ are integers with no common non-trivial divisor (i.e. other than $\pm 1$), the number can be accurately represented as a finite decimal number if and only if it’s possible to write

$$m q = 10^n,$$

for some $m, n \in \Z$ (i.e. any integers). We can then rewrite the number:

$$\frac{p}{q} = \frac{m p}{m q} = m p \times 10^{-n},$$

which is easy to write as a finite decimal number. By referring to the unique-prime-factorization theorem, it can be seen that this is possible only if $q$ contains no other prime factors than those of the base, which are 2 and 5 for base 10.

As such, $1/2$, $1/5$, and $1/800$ are all possible to write as decimal numbers with finitely many digits. The opposite goes for $1/3$, $1/6$, $1/7$, and similar. Irrational numbers such as $\pi$ and $\sqrt{2}$ are infinitely long in all integer1 bases.

For base 2, the difference is that only fractions that are powers of two (i.e. $1/2^n$, $n \in \Z^+$) can be represented accurately without infinite digits. Since this is not the case for $1/10$, many numbers that can be represented in base 10 using finite digits require infinite digits to be represented in base 2.

The same reasoning applies directly to floating-point numbers. Another potential reason that a real number can’t be represented as a floating-point number is that it’s out of range. This is much less interesting, however, and not unique to floating-point numbers, so it will be ignored.

Let’s decide on an example floating-point format for some examples. Assume that $|s| \leq 999$ and $\beta = 10$. The closest floating-point number to $1/3$ is then:

$$\frac{1}{3} \approx 333 \times 10^{-3} = 0.333.$$

By changing to $\beta = 2$, a close approximation becomes:

$$\frac{1}{3} \approx 683 \times 2^{-11} \approx 0.3335.$$

This is due to limited precision. If the valid interval of $s$ was extended, we could get better approximations. With infinite precision (i.e. a never-ending continuation of digits), an exact representation would be theoretically possible.

The same happens when approximating 0.1, 0.2, 0.3, and 1.1 using binary IEEE 754 floating-point numbers. This can be seen directly in Python:

>>> '{:.20}'.format(0.1)

Using decimal numbers on computers

There are both base-10 floating-point numbers defined by IEEE 754 and there are libraries in most programming languages for representing decimal numbers without conversion to base 2.

By switching to base 10 in Python, our original problem disappears:

>>> from decimal import Decimal
>>> Decimal('1.1') * Decimal('1.1') == Decimal('1.21')

This can, of course, be convenient since decimal numbers are more intuitive to most people. As have already been demonstrated above, the fundamental limitations are still there, however, and decimal numbers usually bring a considerable performance decrease with them.

Other alternatives are fixed-point arithmetic, where the exponent is fixed. This used to be much faster than floating-point arithmetic on old hardware without hardware support for floating-point arithmetic. Nowadays, floating-point arithmetic is very fast due to built-in floating-point units in processors.

Rounding and machine epsilon

To approximate numbers that can’t be represented exactly, rounding is used. This is used when converting decimal numbers to binary floating-point numbers, as seen above. Rounding is also necessary when performing operations on the floating-point numbers, i.e. floating-point arithmetic.

IEEE 754 requires correct rounding for floating-point arithmetic, which means that the errors introduced are only cased by rounding. More precisely, this requires that the result of an operation is the same as if it was carried out exactly and the result was then rounded. IEEE 754 defines multiple rounding modes and the default is the familiar round to nearest. IEEE 754 hardware uses extra guard digits during computations to achieve correct rounding. This means that we only need to analyse rounding errors.

The absolute size of rounding errors will vary depending on the exponent for floating-point numbers. For error analysis, it’s therefore more useful to study the relative error. Let $b = round(a)$ be the value after rounding the number $a$. The relative rounding error is then:

$$\eta = \left |\frac{a-b}{a} \right |, \quad a \neq 0.$$

It’s possible to determine an upper bound of the relative error $\eta$ for each floating-point format, which is called machine epsilon. This is a useful measure of the format’s precision.

Definition. Machine epsilon ($\epsilon$) for a specific format:

$$\epsilon \geq \left |\frac{a-round(a)}{a} \right |,$$

for all $a \in \mathbb{R \backslash \{0\}}$ that are within the valid range of the floating-point format.

The relative rounding error for large out-of-range numbers is clearly close to 1 and not interesting. Note the $\geq$ here. It’s useful as long as it’s an upper bound and relatively tight. It can be chosen as the supremum (least upper bound), but simplicity is also important. Machine epsilons like this are presented in the table further down for the most common IEEE 754 formats.

Unfortunately, machine epsilon is sometimes defined as the distance between $1.0$ and the smallest larger number. This is about twice the (smallest) machine epsilon according to the definition above for binary formats (so it can be converted easily and is also an upper bound). Many programming languages and libraries use this definition instead. Depending on the analysis performed a constant factor of 2 might give a tight-enough bound regardless.

Once we know the machine epsilon according to the definition above, we can calculate a bound of the absolute error when performing floating-point arithmetic. Let $x$ and $y$ be two floating-point numbers of the same format with machine epsilon $\epsilon$. The absolute rounding error of their product is then:

$$|x \times y - round(x \times y)| \leq \epsilon |x \times y| \approx \epsilon |round(x \times y)|.$$

It’s worth pointing out that the rounding error can be zero, if the number can be exactly represented. Therefore, the lower bound of rounding errors is always zero for all floating-point formats, since they can all represent some numbers.

There’s more to say about rounding errors and how they accumulate, of course, but that’s a topic for another time.

Common floating-point number formats

The most common IEEE 754 types with approximate properties:

Single precision3227$2^{-24}$$10^{38}$$10^{-38}$
Double precision64216$2^{-53}$$10^{308}$$10^{-308}$
Quadruple precision128234$2^{-113}$$10^{4932}$$10^{-4932}$

These correspond to the well-known types ‘float’ and ‘double’ in C and related languages, in addition to the less common ‘_float128’ C type.5

As can be seen, rounding errors can be reduced significantly by using double precision during calculations if that’s feasible. On CPUs, the difference in performance compared to single precision is typically non-existant. On GPUs, it usually has a bigger impact. The storage size and its effect on caches and memory bandwidth makes a difference too.

If even larger precision is required, there are libraries that offer arbitrary-precision floating-point arithmetic, such as mpmath for Python and GNU MPFR for C. These libraries make it possible to achieve any precision as long as the numbers fit in memory (and you’re willing to pay the computational cost).


I hope it’s clear by now why rounding errors show up in the use of floating-point numbers on computers, but also that these rounding errors are small and have upper bounds that are easy to reason about.

Floating-point arithmetic is the base for the entire mathematical area of numerical analysis and also modern machine learning, computer graphics, and almost everything else done using GPUs. It’s necessary to be aware of rounding errors, but binary floating-point arithmetic is also one of the areas that computers are best at. After all, they beat us humans in speed any day.

Floating-point arithmetic is very useful to know enough about to use effectively and safely. There has been deadly floating-point arithmetic bugs, but I believe the success stories outnumber the harmful ones. This text is a mix of aspects that I find interesting to explore.

For more detailed analysis of floating-point numbers, read up on the various IEEE 754 formats. Python also has a long tutorial on floating point arithmetic. Finally, there’s a famous paper called What every computer scientist should know about floating-point arithmetic that is well worth reading! I’ve based this blog post on a combination of all these sources (and previous knowledge).

  1. Non-integer bases are possible. ↩︎

  2. Approximate precision in number of decimal digits. ↩︎

  3. Machine epsilon, defined as upper bound of relative rounding error. ↩︎

  4. Smallest positive number (approximately; ignoring subnormal numbers). ↩︎

  5. A detail is that C and C++ don’t require IEEE 754 compatibility. ↩︎