0% found this document useful (0 votes)
29 views6 pages

Floating Point

This document explains how floating point numbers are represented and how to convert between decimal and binary floating point formats. Floating point numbers use a sign bit, exponent bits, and fraction bits. The exponent indicates the power of two to scale the fraction, which represents a binary number less than one. Normalized values have an implied leading one, while denormalized values do not. Special cases like infinity and NaN are also covered.

Uploaded by

Sir Bob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Floating Point

This document explains how floating point numbers are represented and how to convert between decimal and binary floating point formats. Floating point numbers use a sign bit, exponent bits, and fraction bits. The exponent indicates the power of two to scale the fraction, which represents a binary number less than one. Normalized values have an implied leading one, while denormalized values do not. Special cases like infinity and NaN are also covered.

Uploaded by

Sir Bob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

How To Floating Point

Basic structure

Floating-point numbers are represented in 3 parts: a sign bit, exponent bits, and the fractional bits
(denoted here as the mantissa, but they’re slightly different as explained later). A 32-bit float looks like
this:

However, floats need not follow this exact format. The following instructions work for exponent/fractional
parts of any size. This would include 64-bit floats (1 sign bit, 11 exponent bits, and 52 fractional bits), or
even imaginary formats (like a made-up format that has 1 sign bit, 4 exponent bits, and 5 fractional bits).

Why like this??

One way you can sort of think of this structure is basically like scientific notation, but for binary - the
fractional bit represents some really precise significant numbers, and the exponent bit represents bit shifts
used to scale the number a desired power.

How to turn one of these into a decimal number

There are 3 scenarios, determined by the contents of the exponent bits.

If exponent section has mix of zeroes and ones (normalized)

Example: 0 0111111 11000000000000000000000

1. Find what the exponent bits equal if interpreted as an unsigned int. We will call this e. In the above
example, e = 0111111 = 127.

2. Find the bias. bias = 2(number of exponent bits)−1


− 1. In the above example, bias = 28−1 − 1 = 127.

3. Find the biased exponent. We will call this E. E = e − bias. In the example, E = 127 − 127 = 0.

4. Find the value of the fractional bits. We will call this f .

1
1 1 0 0 ... 0
f = = 2−1 + 2−2 = 0.75
2−1 2−2 2−3 2−4 ... 2 −23

5. Find the mantissa/significand M . For normalized values, this is M = f + 1 (known as the implicit
1). In our example, M = 0.75 + 1 = 1.75.

6. Find the sign of the number, as denoted by the sign bit. If it is 0, then the number is positive. If it
is 1, then the number is negative. This is manifested by taking (−1)s , where s is the sign bit.

7. Finally, we put all of this together with the equation (−1)s ∗ 2E ∗ M . In our example:
(−1)s ∗ 2E ∗ M
= (−1)0 ∗ 20 ∗ 1.75
= 1 ∗ 1 ∗ 1.75
= 1.75

Tada! The binary floating-point number 0011111111000000000000000000000 equals 1.75.

If exponent section is all 0 (denormalized)

Example: 0 00000000 11000000000000000000000

1. The bias is found the same way as before, 2(number of exponent bits)−1
− 1. In this example, bias =
28−1 − 1 = 127.

2. The biased exponent E is now 1 − bias. In this example, E = 1 − 127 = −126.

3. Find the value of the fractional bits, same as with normalized. The fractional bits in this example
are the same as the normalized example, so we’ll skip calculation. f = 0.75.

4. However, the mantissa for denormalized numbers does not have an implicit 1. Therefore M = f =
0.75.

5. Finally, we put all of this together with the same equation as before - (−1)s ∗2E ∗M . In this example:
(−1)s ∗ 2E ∗ M
= (−1)0 ∗ 2−126 ∗ 0.75
= 1 ∗ 2−126 ∗ 0.75
= Really really small and long number that I won’t put here because it’s too long.

Denormalized numbers are used to represent really really small numbers, as you can see here.

If exponent section is all 1 (special)

ˆ If the fractional bits are all zeroes, then the number is infinity. Sign bit determines if positive or
negative infinity in the same way it does with numbers.
Example: 0 11111111 00000000000000000000000 = Infinity
Example: 1 11111111 00000000000000000000000 = -Infinity

ˆ If there is anything at all in the fractional part, then it’s NaN (not a number).
Example: 0 11111111 00000000000000000000001 = NaN

2
How to turn a decimal number into one of these

Normalized Example: Represent 5.375 in binary with the IEEE floating-point standard for a 32-bit float.

1. If the number is negative, remove the negative before proceeding and remember to set the sign bit
to 1. Our number isn’t negative, so the sign bit s will be 0.
2. Separate the number into two parts: the whole number part and the decimal part. In our example,
the whole number part is 5 and the decimal part is 0.375.
3. Represent the whole number part in binary. Do this however you normally do. In this example,
that’s 101.
4. Represent the decimal part in binary. One way to do this is to multiply the decimal part by 2
continuously. Whenever the result is greater than 1, mark down a 1 and subtract 1 before continuing
to multiply. Otherwise, mark down a 0. End when you’ve reached 1. This is more easily illustrated
with an example:
0.375 * 2 = 0.75
0.75 * 2 = 1.5
0.5 * 2 =1
Thus 0.375 in binary is 011.
5. Put the two parts back together again, this time in binary decimal point form: 101.011.
6. Move the decimal point just behind the leftmost 1. Record how many decimal points you moved -
that’s the exponent of 2 you multiplied by.
In our example, after moving our decimal point to the correct position we would have 1.01011. We
moved the decimal point left by 2, so we multiplied 22 . Therefore we have that 101.011 = 1.01011 ∗ 22
This is basically scientific form, but for binary. Works in the other direction, too: 0.0000101 =
1.01 ∗ 2−5 (nothing to do with example, just a demonstration).
7. The power of 2 you multiplied by is the exponent, E, from back when we were turning binary into
decimal. The number in front is the mantissa, M . We can now understand the original number as
(−1)0 ∗ 22 ∗ 1.01011 = (−1)s ∗ 2E ∗ M . So E = 2 and M = 1.01011.
8. Now what remains is to do what we used to do to achieve E and M , but in reverse in order to get e
and f .
a. We know that E = e − bias.
We know that bias = 2(# exp bits)−1
− 1 = 28−1 − 1 = 127.
Therefore 2 = e − 127
Therefore → e = 2 + 127
e = 129
In binary, that’s 10000001.
b. Now that we have found e, we can check to see if the number is normalized. If e > 0, then it’s
normalized. If not, then it’s denormalized. In our case it’s normalized.
We know that for normalized values, M = 1 + f .
Therefore, 1.01011 = 1 + f
Therefore, f = 1.01011 − 1
f = 0.01011
F is already in binary form, so no conversion is needed. The fractional part will begin with
01011.

3
9. Now that we have the sign bit, the original unbiased exponent bits, and the fractional bits, we can
put it all together:
5.375 = 0 10000001 01011000000000000000000

Tada! 5.375 represented in IEEE-754 floating-point standard is 01000000101011000000000000000000.

Non-Standard Example: Represent -1.625 in binary with an imaginary floating-point format based off the
IEEE Standard that has 1 sign bit, 4 exponent bits, and 3 fractional bits.

As we said at the start, all steps apply no matter the number of exponent/fractional bits. Remember
that the bias changes, though, based on the number of exponent bits!

1. Find sign bit: Our number is negative, so the sign bit will be 1.

2. Separate: Whole part is 1, and the decimal part is 0.625.

3. Represent whole part in binary: 1 in binary is just 1.

4. Decimal part in binary:


0.625 * 2 = 1.25
0.25 * 2 = 0.5
0.5 * 2 =1
Thus 0.625 in binary is 101.

5. Put the parts back together: 1.101.

6. Move decimal point and convert to scientific form: It’s already in the correct form, so 1.101 ∗ 20 .

7. Find E and M: (−1)1 ∗ 20 ∗ 1.101 = (−1)s ∗ 2E ∗ M . Thus E = 0 and M = 1.101.

8. Reverse E and M to get e and f:

a. E = e − bias
bias = 2# exp bits−1 − 1 = 23 − 1 = 7
Thus 0 = e − 7
Thus e = 7
In binary, that’s 0111
b. e is greater than 0, so the number’s normalized.
For normalized floats: M = 1 + f
Thus 1.101 = 1 + f
Thus f = 0.101
The fractional part will begin with 101

9. Putting it all together:


-1.625 = 1 0111 101

Tada! -1.625 represented with our imaginary floating-point format is 10111101.

Denormalized Example (Non-Standard): Represent 0.0029296875 with an imaginary floating-point format


based off the IEEE standard that has 1 sign bit, 4 exponent bits, and 4 fractional bits

Now we look at representing a denormalized number. The reason we use a non-standard format is because

4
32-bit denormalized numbers are REALLY small and hard to work with without clogging absolutely ev-
erything up. We won’t easily know the number’s denormalized until we find e, so everything we do up to
that point will be the same as normal.

1. Find sign bit: Our number is positive, so the sign bit will be 0.

2. Separate: Whole part is 0, decimal part is 0.0029296875.

3. Whole part in binary: 0 in binary is 0.

4. Decimal part in binary:


0.0029296875 * 2 = 0.005859375
0.005859375 * 2 = 0.0.01171875
0.01171875 * 2 = 0.0234375
0.0234375 * 2 = 0.046875
0.046875 * 2 = 0.09375
0.09375 * 2 = 0.1875
0.1875 * 2 = 0.375
0.375 * 2 = 0.75
0.75 * 2 = 1.5
0.5 * 2 =1
Thus 0.0029296875 in binary is 0000000011.

5. Put the parts back together: 0.0000000011.

6. Move decimal point and convert to scientific: 0.0000000011 = 1.1 ∗ 2−9 .

7. Find E and M: 2−9 ∗ 1.1 = 2E ∗ M (ignoring sign bit for brevity). Thus E = -9 and M = 1.1.

8. Reverse E and M to get e and f:

a. E = e − bias
bias = 24−1 − 1 = 7
Thus −9 = e − 7
Thus e = −2

But now we’ve reached a problem: The exponent bits are unsigned and thus can’t be nega-
tive. What do we do? This is why denormalized exists.

When e ≤ 0 (as it is now), the number is denormalized. Under this condition, E now be-
comes 1 − bias.
Therefore, E = −6
The exponent bits will all be set to 0 (signifies denormalized).
b. Since E now equals -6 instead of the required -9, we must redo the fractional part to accommo-
date. Think of it this way - before we had 0.0000000011 = 1.1 ∗ 2−9 . Now we must get it in the
format x ∗ 2−6 .
Since 0.0000000011 = 1.1 ∗ 2−9 :
We can shift the decimal in the fractional part left by 3 (think of it like ”applying” -3 of the
powers of 2): 0.0000000011 = 0.0011 ∗ 2−6
Therefore M = 0.0011.
For a denormalized float: M = f (no implicit 1)
Thus f = 0.0011. The fractional bits will begin with 0011.

5
9. Finally, let’s put it all together:
0.0029296875 = 0 0000 0011
Tada! 0.0029296875 represented with our imaginary floating-point format is 000000011

All Of The Above Put Into A Garbage Compactor

Converting Binary to Float:

Normalized (exponent bits not all 0 or 1)

1. e = exponent bits interpreted as unsigned int


2. bias = 2(number of exponent bits)−1
−1
3. E = e - bias
4. f = fractional bits interpreted. b1 ∗ 2−1 + b2 ∗ 2−2 ∗ ...
5. M = f + 1
6. s = sign bit
7. Number = (−1)s ∗ 2E ∗ M

Denormalized (exponent bits all 0)

1. bias = 2(number of exponent bits)−1


−1
2. E = 1 - bias
3. M = fractional bits interpreted. b1 ∗ 2−1 + b2 ∗ 2−2 ∗ ...
4. s = sign bit
5. Number = (−1)s ∗ 2E ∗ M

Special (exponent bits all 1)

ˆ Fractional bits all 0: Number = infinity * (−1)s


ˆ Fractional bits not all 0: Number = NaN

Converting Float to Binary:

Normalized

1.

Denormalized

1.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy