Floating Point Numbers

Why floating-point numbers are needed

Since computer memory is limited, you cannot store numbers with infinite precision, no matter whether you use binary fractions or decimal ones: at some point you have to cut off. But how much accuracy is needed? And where is it needed? How many integer digits and how many fraction digits?

To satisfy the engineer and the chip designer, a number format has to provide accuracy for numbers at very different magnitudes. However, only relative accuracy is needed. To satisfy the physicist, it must be possible to do calculations that involve numbers with different magnitudes.

Basically, having a fixed number of integer and fractional digits is not useful - and the solution is a format with a floating point.

How floating-point numbers work

The idea is to compose a number of two main parts:

Such a format satisfies all the requirements:

Decimal floating-point numbers usually take the form of scientific notation with an explicit point always between the 1st and 2nd digits. The exponent is either written explicitly including the base, or an e is used to separate it from the significand.

Significand Exponent Scientific notation Fixed-point value
1.5 4 1.5 ⋅ 104 15000
-2.001 2 -2.001 ⋅ 102 -200.1
5 -3 5 ⋅ 10-3 0.005
6.667 -11 6.667e-11 0.00000000006667

The standard

Nearly all hardware and programming languages use floating-point numbers in the same binary formats, which are defined in the IEEE 754 standard. The usual formats are 32 or 64 bits in total length:

Format Total bits Significand bits Exponent bits Smallest number Largest number
Single precision 32 23 + 1 sign 8 ca. 1.2 ⋅ 10-38 ca. 3.4 ⋅ 1038
Double precision 64 52 + 1 sign 11 ca. 5.0 ⋅ 10-324 ca. 1.8 ⋅ 10308

Note that there are some peculiarities:

© Published at floating-point-gui.de under the Creative Commons Attribution License (BY)

Fork me on GitHub