Floating-point - Zoran Ravić

16-bit

32-bit

64-bit

Sign

Exponent

2^{127 - 127}

2⁰

Significand

1 + 2097152 × 2^-23

1.25

Enter value:
Saved value:

(1 + 2097152 × 2^-23) × 2^{127 - 127}

10485760 × 2^-23

= 1.25

Equivalent fixed-point representation:

+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001.01000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

What is going on here ?

Floating-point numbers can seem intimidating at first, so here's a way to visualize what they actually represent:

Let's say we want to be able to use huge numbers like one bilion trillion 1000000000000000000000
and tiny numbers like one one-billionth-trillionth 0.000000000000000000001

If we were to use a fixed-point format it would need to store hundreds or thousands of bits for a single number.

Instead of wasting all that memory what we can do is store the most significant digits and remember where those digits are located relative to the dot.

And that is exactly what floating-point numbers do.

Try it !

There's no better way to get familiar with something than to mess with it yourself.

If you try incrementing or decrementing the exponent you will see that this simply moves our digits on the resulting base-2 number.

And by changing the significand bits you will see that those changes are directly mirrored on our base-2 result.

How useful is it ?

This way we can use a 64-bit float to fake having a 2098-bit number.

Or we can use a 32-bit float to fake having a 277-bit number.

And some devices like graphics cards have a 16-bit float, which lets them fake having a 40-bit number.

We do lose a lot of precision, which is why you sometimes simply can't use floats.