Short roundup of IEEE-754 floating point numbers (32-bit) off the top of my head:
- 1 bit sign (0 means positive number, 1 means negative number)
- 8 bit exponent (with -127 bias, not important here)
- 23 bits “mantissa”
- With exceptions for the exponent values 0 and 255, you can calculate the value as:
(sign ? -1 : +1) * 2^exponent * (1.0 + mantissa)
- The mantissa bits represent binary digits after the decimal separator, e.g.
1001 0000 0000 0000 0000 000 = 2^-1 + 2^-4 = .5 + .0625 = .5625
and the value in front of the decimal separator is not stored but implicitly assumed as 1 (if exponent is 255, 0 is assumed but that’s not important here), so for an exponent of 30, for instance, this mantissa example represents the value1.5625
- The mantissa bits represent binary digits after the decimal separator, e.g.
Now to your example:
16777216 is exactly 224, and would be represented as 32-bit float like so:
- sign = 0 (positive number)
- exponent = 24 (stored as 24+127=151=
10010111
) - mantissa = .0
- As 32 bits floating-point representation:
0 10010111 00000000000000000000000
- Therefore: Value =
(+1) * 2^24 * (1.0 + .0) = 2^24 = 16777216
Now let’s look at the number 16777217, or exactly 224+1:
- sign and exponent are the same
- mantissa would have to be exactly 2-24 so that
(+1) * 2^24 * (1.0 + 2^-24) = 2^24 + 1 = 16777217
- And here’s the problem. The mantissa cannot have the value 2-24 because it only has 23 bits, so the number 16777217 just cannot be represented with the accuracy of 32-bit floating points numbers!