Floating Point
Floating Point
Basic structure
Floating-point numbers are represented in 3 parts: a sign bit, exponent bits, and the fractional bits
(denoted here as the mantissa, but they’re slightly different as explained later). A 32-bit float looks like
this:
However, floats need not follow this exact format. The following instructions work for exponent/fractional
parts of any size. This would include 64-bit floats (1 sign bit, 11 exponent bits, and 52 fractional bits), or
even imaginary formats (like a made-up format that has 1 sign bit, 4 exponent bits, and 5 fractional bits).
One way you can sort of think of this structure is basically like scientific notation, but for binary - the
fractional bit represents some really precise significant numbers, and the exponent bit represents bit shifts
used to scale the number a desired power.
1. Find what the exponent bits equal if interpreted as an unsigned int. We will call this e. In the above
example, e = 0111111 = 127.
3. Find the biased exponent. We will call this E. E = e − bias. In the example, E = 127 − 127 = 0.
1
1 1 0 0 ... 0
f = = 2−1 + 2−2 = 0.75
2−1 2−2 2−3 2−4 ... 2 −23
5. Find the mantissa/significand M . For normalized values, this is M = f + 1 (known as the implicit
1). In our example, M = 0.75 + 1 = 1.75.
6. Find the sign of the number, as denoted by the sign bit. If it is 0, then the number is positive. If it
is 1, then the number is negative. This is manifested by taking (−1)s , where s is the sign bit.
7. Finally, we put all of this together with the equation (−1)s ∗ 2E ∗ M . In our example:
(−1)s ∗ 2E ∗ M
= (−1)0 ∗ 20 ∗ 1.75
= 1 ∗ 1 ∗ 1.75
= 1.75
1. The bias is found the same way as before, 2(number of exponent bits)−1
− 1. In this example, bias =
28−1 − 1 = 127.
3. Find the value of the fractional bits, same as with normalized. The fractional bits in this example
are the same as the normalized example, so we’ll skip calculation. f = 0.75.
4. However, the mantissa for denormalized numbers does not have an implicit 1. Therefore M = f =
0.75.
5. Finally, we put all of this together with the same equation as before - (−1)s ∗2E ∗M . In this example:
(−1)s ∗ 2E ∗ M
= (−1)0 ∗ 2−126 ∗ 0.75
= 1 ∗ 2−126 ∗ 0.75
= Really really small and long number that I won’t put here because it’s too long.
Denormalized numbers are used to represent really really small numbers, as you can see here.
If the fractional bits are all zeroes, then the number is infinity. Sign bit determines if positive or
negative infinity in the same way it does with numbers.
Example: 0 11111111 00000000000000000000000 = Infinity
Example: 1 11111111 00000000000000000000000 = -Infinity
If there is anything at all in the fractional part, then it’s NaN (not a number).
Example: 0 11111111 00000000000000000000001 = NaN
2
How to turn a decimal number into one of these
Normalized Example: Represent 5.375 in binary with the IEEE floating-point standard for a 32-bit float.
1. If the number is negative, remove the negative before proceeding and remember to set the sign bit
to 1. Our number isn’t negative, so the sign bit s will be 0.
2. Separate the number into two parts: the whole number part and the decimal part. In our example,
the whole number part is 5 and the decimal part is 0.375.
3. Represent the whole number part in binary. Do this however you normally do. In this example,
that’s 101.
4. Represent the decimal part in binary. One way to do this is to multiply the decimal part by 2
continuously. Whenever the result is greater than 1, mark down a 1 and subtract 1 before continuing
to multiply. Otherwise, mark down a 0. End when you’ve reached 1. This is more easily illustrated
with an example:
0.375 * 2 = 0.75
0.75 * 2 = 1.5
0.5 * 2 =1
Thus 0.375 in binary is 011.
5. Put the two parts back together again, this time in binary decimal point form: 101.011.
6. Move the decimal point just behind the leftmost 1. Record how many decimal points you moved -
that’s the exponent of 2 you multiplied by.
In our example, after moving our decimal point to the correct position we would have 1.01011. We
moved the decimal point left by 2, so we multiplied 22 . Therefore we have that 101.011 = 1.01011 ∗ 22
This is basically scientific form, but for binary. Works in the other direction, too: 0.0000101 =
1.01 ∗ 2−5 (nothing to do with example, just a demonstration).
7. The power of 2 you multiplied by is the exponent, E, from back when we were turning binary into
decimal. The number in front is the mantissa, M . We can now understand the original number as
(−1)0 ∗ 22 ∗ 1.01011 = (−1)s ∗ 2E ∗ M . So E = 2 and M = 1.01011.
8. Now what remains is to do what we used to do to achieve E and M , but in reverse in order to get e
and f .
a. We know that E = e − bias.
We know that bias = 2(# exp bits)−1
− 1 = 28−1 − 1 = 127.
Therefore 2 = e − 127
Therefore → e = 2 + 127
e = 129
In binary, that’s 10000001.
b. Now that we have found e, we can check to see if the number is normalized. If e > 0, then it’s
normalized. If not, then it’s denormalized. In our case it’s normalized.
We know that for normalized values, M = 1 + f .
Therefore, 1.01011 = 1 + f
Therefore, f = 1.01011 − 1
f = 0.01011
F is already in binary form, so no conversion is needed. The fractional part will begin with
01011.
3
9. Now that we have the sign bit, the original unbiased exponent bits, and the fractional bits, we can
put it all together:
5.375 = 0 10000001 01011000000000000000000
Non-Standard Example: Represent -1.625 in binary with an imaginary floating-point format based off the
IEEE Standard that has 1 sign bit, 4 exponent bits, and 3 fractional bits.
As we said at the start, all steps apply no matter the number of exponent/fractional bits. Remember
that the bias changes, though, based on the number of exponent bits!
1. Find sign bit: Our number is negative, so the sign bit will be 1.
6. Move decimal point and convert to scientific form: It’s already in the correct form, so 1.101 ∗ 20 .
a. E = e − bias
bias = 2# exp bits−1 − 1 = 23 − 1 = 7
Thus 0 = e − 7
Thus e = 7
In binary, that’s 0111
b. e is greater than 0, so the number’s normalized.
For normalized floats: M = 1 + f
Thus 1.101 = 1 + f
Thus f = 0.101
The fractional part will begin with 101
Now we look at representing a denormalized number. The reason we use a non-standard format is because
4
32-bit denormalized numbers are REALLY small and hard to work with without clogging absolutely ev-
erything up. We won’t easily know the number’s denormalized until we find e, so everything we do up to
that point will be the same as normal.
1. Find sign bit: Our number is positive, so the sign bit will be 0.
7. Find E and M: 2−9 ∗ 1.1 = 2E ∗ M (ignoring sign bit for brevity). Thus E = -9 and M = 1.1.
a. E = e − bias
bias = 24−1 − 1 = 7
Thus −9 = e − 7
Thus e = −2
But now we’ve reached a problem: The exponent bits are unsigned and thus can’t be nega-
tive. What do we do? This is why denormalized exists.
When e ≤ 0 (as it is now), the number is denormalized. Under this condition, E now be-
comes 1 − bias.
Therefore, E = −6
The exponent bits will all be set to 0 (signifies denormalized).
b. Since E now equals -6 instead of the required -9, we must redo the fractional part to accommo-
date. Think of it this way - before we had 0.0000000011 = 1.1 ∗ 2−9 . Now we must get it in the
format x ∗ 2−6 .
Since 0.0000000011 = 1.1 ∗ 2−9 :
We can shift the decimal in the fractional part left by 3 (think of it like ”applying” -3 of the
powers of 2): 0.0000000011 = 0.0011 ∗ 2−6
Therefore M = 0.0011.
For a denormalized float: M = f (no implicit 1)
Thus f = 0.0011. The fractional bits will begin with 0011.
5
9. Finally, let’s put it all together:
0.0029296875 = 0 0000 0011
Tada! 0.0029296875 represented with our imaginary floating-point format is 000000011
Normalized
1.
Denormalized
1.