CS
Floating-point
32-bit
64-bit
Sign
+
0
Exponent
2127 - 127
20
+
-
0
1
1
1
1
1
1
1
Significand
1 + 2097152 × 2-23
1.25
+
-
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

(1 + 2097152 × 2-23) × 2127 - 127
10485760 × 2-23
= 1.25

+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001.01000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

What is going on here ?


Floating-point numbers can seem intimidating at first, so here's a way to visualize what they actually represent:

Let's say we want to be able to use huge numbers like one bilion trillion 1000000000000000000000
and tiny numbers like one one-billionth-trillionth 0.000000000000000000001

If we were to use a fixed-point format it would need to store hundreds or thousands of bits for a single number.

Instead of wasting all that memory what we can do is store the most significant digits and remember where those digits are located relative to the dot.

And that is exactly what floating-point numbers do.


Try it !


There's no better way to get familiar with something than to mess with it yourself.

If you try incrementing or decrementing the exponent you will see that this simply moves our digits on the resulting base-2 number.

And by changing the significand bits you will see that those changes are directly mirrored on our base-2 result.


How useful is it ?


This way we can use a 32-bit float to fake having a 279-bit number.
Or we can use 64-bit float to fake having a 2100-bit number.

We do lose a lot of precision, which is why you sometimes simply can't use floats.