Floating point: the early years

When Konrad Zuse designed his first mechanical computer in 1936, it supported binary floating point operations, much like modern IEEE-754, but with less precision (only 22 bits). It even had special values for infinity. Its successor, the Z3 was based on relays and was completed in 1941. It used the same type of floating point numbers.

Electronic computers started to get floating point in the late 1950s. Every computer architecture had its own floating point format and its own quirks. In those years, floating point got a bad reputation of being intrinsically inexact. When I studied at the university in the late 1980s. that was still the common wisdom. Never rely on any exact value, not even 2.0 × 2.0 is exactly equal to 4.0.

Some early machines, like the IBM 7094, had a 36-bit word size and proper binary floating point. Single precision was 36 bits and double precision was 72 bits. The IBM System 360, introduced in 1965, had a word size of 32 bits instead of 36 bits. Worse still. the floating point format was base-16 instead of base-2. For normal numbers, the mantissa could have a value between 2²⁰ and 2²⁴ – 1, or 0x100000.. 0xFFFFFF. Incrementing the exponent by 1 would increase the value of the number by a factor of 16. This way we had only 21 significant bits for some values and 24 for other values. Even worse still, the computations were not as accurate as they could have been within the constraints of this format, because no guard bits were used. Needless to say, some users of the IBM System 360 were not happy with its much inferior floating point accuracy.

Numerical programming was emerging as a significant discipline in computer science. Using clever tricks they could get much more accurate results from floating point operations than they were entitled to. The problem was: different computer architectures required different clever tricks. Floating point hardware was primarily designed to be fast, even if that meant loss of precision in some corner cases.

IEEE-754

Something had to be done to clear up this mess. Some good ideas:

Use a pure binary floating point format: 32 bits for single precision and 64 bits for double precision. Pure binary formats allow for an implicit leading one, so even though only 23 mantissa bits are stored, a 24-bit number in the range 2²³… 2²⁴-1 can be represented. This format was used by Digital Equipment Corporation (DEC) in the PDP-11 and VAX machines. This turned out to be much better than IBM’s base-16 format.
Specify the exact value that must be returned for each operation. Do this with four different rounding modes that each implementation must support. No more stupid shortcuts to compromise accuracy
Specify distinct values for + and – Infinity and a set of NaN (Not a Number) values for completely invalid results.

One thing that was annoying on every floating point system before IEEE-754: there was a large gap between zero and the smallest positive floating point number. This gap was much larger than the gaps between numbers with the lowest exponent. For example: there ware millions of numbers between 2^-126 and 2^-125, but there are no numbers between 0 and 2^-126. Therefore, A == B was not equivalent to A – B == 0. For example the number 1.5×2^-126 – 2^-126 would be zero, while these numbers are different.

A solution to this problem would be to reserve the lowest possible exponent value for the value 0 and any numbers between 0 and 2^-126. These numbers were to be called subnormal numbers. This feature was costly to implement in hardware, so many hardware vendors (most notably DEC) were opposed to it, but the software guys won this one.

Work on IEEE-754 started in the 1970s and while the standard was only released in 1985, Intel announced the 8087 floating point unit already in 1980. This processor implemented the standard before it was a published standard.

IEEE-754

Comments