0% found this document useful (0 votes)
61 views34 pages

Chapter2 2.5

The document discusses floating point number representation in computers. It begins by explaining that signed integer formats are not suitable for scientific and business applications involving real numbers. It then introduces floating point representation as a solution. The key points made include: - Floating point numbers represent numbers as three fields - a sign bit, exponent, and significand (mantissa). - The IEEE 754 standard defines common floating point formats, including 32-bit single precision and 64-bit double precision. - Special exponent values are used to represent infinity and NaN (not a number) values. - Zero can be represented with both positive and negative sign bits, so testing for equality to zero is problematic.

Uploaded by

iole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views34 pages

Chapter2 2.5

The document discusses floating point number representation in computers. It begins by explaining that signed integer formats are not suitable for scientific and business applications involving real numbers. It then introduces floating point representation as a solution. The key points made include: - Floating point numbers represent numbers as three fields - a sign bit, exponent, and significand (mantissa). - The IEEE 754 standard defines common floating point formats, including 32-bit single precision and 64-bit double precision. - Special exponent values are used to represent infinity and NaN (not a number) values. - Zero can be represented with both positive and negative sign bits, so testing for equality to zero is problematic.

Uploaded by

iole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter 2

Floating Point
Numbers
2.5 Floating-Point Representation

The signed magnitude, one s complement,


and two s complement representation that we
have just previously discussed deal with
signed integer values only.
Without modification, these formats are not
useful in scientific or business applications
that deal with real number values.
Floating-point representation solves this
problem.

2
2.5 Floating-Point Representation

If we are clever programmers, we can perform


floating-point calculations using any integer format.
This is called floating-point emulation, because
floating point values aren t stored as such; we just
create programs that make it seem as if floating-
point values are being used.
Most of toda s computers are equipped with
specialized hardware that performs floating-point
arithmetic with no special programming required.
Other than using the provided instruction set of you
CPU architecture
3
2.5 Floating-Point Representation

Floating-point numbers allow an arbitrary


number of decimal places to the right of the
decimal point.
For example: 0.5 0.25 = 0.125
They are often expressed in scientific notation.
For example:
0.125 = 1.25 10-1
5,000,000 = 5.0 106

4
2.5 Floating-Point Representation
Computers use a form of scientific notation for
floating-point representation
Numbers written in scientific notation have three
components:

5
2.5 Floating-Point Representation
Computer representation of a floating-point
number consists of three fixed-size fields:

Or Mantissa

more correct

This is the standard arrangement of these fields.


Note: Although “significand” and “mantissa” do not technically mean the same
thing, many people use these terms interchangeably. We use the term “significand”
to refer to the fractional part of a floating point number.

6
2.5 Floating-Point Representation

The one-bit sign field is the sign of the stored value.


The size of the exponent field determines the range
of values that can be represented.
The size of the significand determines the
precision of the representation.

7
2.5 Floating-Point Representation

We introduce a hypothetical Model to explain the


concepts, after which we will discuss the IEEE-754
one.
In this model:
A floating-point number is 14 bits in length
The exponent field is 5 bits
The significand field is 8 bits

8
2.5 Floating-Point Representation

The significand is always preceded by an implied


binary point. 0
Thus, the significand always contains a fractional
binary value.
The exponent indicates the power of 2 by which the
significand is multiplied.
0 011 x 215
9
2.5 Floating-Point Representation
Example:
Express 3210 in the simplified 14-bit floating-point
model. (1-bit sign, 5-bit exponent, 9-bit significand)
We know that 32 is 25. So in (binary) scientific
notation 32 = 100000 =0.1 x 26.

Using this information, we put 110 (= 610) in the


y
exponent field and 1 in the significand as shown.
Bo
o

10
2.5 Floating-Point Representation
The illustrations shown at
the right are all equivalent
representations for 32
using our simplified
model. 11

Not only do these


synonymous 11
representations waste
space, but they can also
11
cause confusion.

Same

0.12 26 and 0.012 27 same


11
2.5 Floating-Point Representation
To resolve the problem of synonymous forms,
we establish a rule that the first digit of the
significand must be 1, with no ones to the left of
the radix point.
This process, called normalization, results in a
unique pattern for each floating-point number.
In our simple model, all significands must have the
form 0.1xxxxxxxx
Iz
For example, 4.5 = 100.1 x 20 = 1.001 x 22 = 0.1001 x
23. The last expression is correctly normalized.

In our simple instructional model, we use no implied bits.

12
2.5 Floating-Point Representation

Another problem with our system is that we have


made no allowances for negative exponents. We
have no way to express 0.25! (Notice that there is
O
no sign in the exponent field.) solution1 excess M

All of these problems can be fixed with no


changes to our basic model.

13
2.5 Floating-Point Representation

To provide for negative exponents, we will use a


biased exponent. adepend
In our case, we have a 5-bit exponent. ftp.T.seqf.IN
angry_
25-1 1 = 24-1 = 15
Thus will use 15 for our bias: our exponent will use
excess-15 representation. TNT
a
In our model, exponent values less than 15 are
negative, representing purely fractional numbers.

14
2.5 Floating-Point Representation
Example:

D
Express 3210 in the revised 14-bit floating-point model.
eep
e
We know that 32 = 1.0 x 25 = 0.1 x 26. D
d
To use our excess 15 biased exponent, we add 15 to
6, giving 2110 (=101012). 21
61 15
So we have:

8 5
5 0.1 H
10001.101 1.0 24
15
2.5 Floating-Point Representation

Example:
Express 0.062510 in the revised 14-bit floating-point
model.
We know that 0.0625 is 2-4. So in (binary) scientific
notation 0.0625 =0.0001 = 1.0 x 2-4 = 0.1 x 2 -3.
To use our excess 15 biased exponent, we add 15 to
-3, giving 1210 (=011002). 12
15
34
6

16
2.5 Floating-Point Representation
Example:
Express -26.62510 in the revised 14-bit floating-point
e
model.
We find 26.62510 = 11010.1012. Normalizing, we
have: 26.62510 = 0.11010101 x 2 5.
To use our excess 15 biased exponent, we add 15 to
5, giving 2010 (=101002). We also need a 1 in the sign
bit.

17
2.5 Floating-Point Representation

The IEEE has established a standard for


floating-point numbers
excess M 127
The IEEE-754 single precision floating point
standard uses an 8-bit exponent (with a bias of
f
127) and a 23-bit significand.
issi.IT
The IEEE-754 double precision standard uses
an 11-bit exponent (with a bias of 1023) and a

I
52-bit significand.
eisias
them PLIED 1
18
2.5 Floating-Point Representation

In both the IEEE single-precision and double-


precision floating-point standard, the significant has
I
an implied 1 to the LEFT of the radix point.
The format for a significand using the IEEE format is:
1.xxx…
For example, 4.5 = .1001 x 23 in IEEE format is 4.5 =
1.001 x 22. The 1 is implied, which means is does not need
0
to be listed in the significand (the significand would
include only 001).
7
19
2.5 Floating-Point Representation
Example: Express -3.75 as a floating point number
using IEEE single precision.
First, let s normali e according to IEEE rules:
-3.75 = -11.112 = -1.111 x 21
The bias is 127, so we add 127 + 1 = 128 (this is our
exponent)
The first 1 in the significand is implied, so we have:

(implied)

Since we have an implied 1 in the significand, this equates


to
-(1).1112 x 2 (128 127) = -1.1112 x 21 = -11.112 = -3.75.

20
2.5 Floating-Point Representation
Using the IEEE-754 single precision floating point
standard:
An exponent of 255(after adding the bias (all 1 s)) indicates
a special value.

I
• If the significand is zero, the value is infinity.
• If the significand is nonzero, the value is NaN, not a
number, often used to flag an error condition.
Using the double precision standard:
The special exponent value for a double precision number
is 2047, instead of the 255 used by the single precision
standard.

21
2.5 Floating-Point Representation
Both the 14-bit model that we have presented
and the IEEE-754 floating point standard allow
two representations for zero.
Zero is indicated by all zeros in the exponent and the
significand, but the sign bit can be either 0 or 1.
This is why programmers should avoid testing a
T
floating-point value for equality to zero.
no
Negative zero does not equal positive zero.
r
22
2.5 Floating-Point Representation

IEEE Floating-point addition and subtraction


are done using methods analogous to how we
perform calculations using pencil and paper.
The first thing that we do is express both
operands in the same exponential power, then
add the numbers, preserving the exponent in the
sum.
If the exponent requires adjustment, we do so at
the end of the calculation.

23
2.5 Floating-Point Representation
Example: aka at
precised
Find the sum of 1210 and 1.2510 using the 14-bit simple
floating-point model.
We find 1210 = 0.1100 x 2 4. And 1.2510 = 0.101 x 2 1 =
0.000101 x 2 4.
Thus, our sum is
0.110101 x 2 4.

if bit it is added at
carry
24 the end to the exponent
2.5 Floating-Point Representation

Floating-point multiplication is also carried out in


a manner akin to how we perform multiplication
using pencil and paper.
We multiply the two operands and add their
exponents.
If the exponent requires adjustment, we do so at
the end of the calculation.

25
2.5 Floating-Point Representation

No matter how many bits we use in a floating-point


representation, our model must be finite.
The real number system is, of course, infinite, so our
models can give nothing more than an approximation
of a real value.
At some point, every model breaks down, introducing
errors into our calculations.
By using a greater number of bits in our model, we
can reduce these errors, but we can never totally
eliminate them.

27
2.5 Floating-Point Representation
O
• Consider OI
0.2
• 0.1 in decimal. will never
get 1Obean
I

• It cannot be perfectly represented in binary..


• 0.110
0.000110011001100110011 … …

28
2.5 Floating-Point Representation
Our job becomes one of reducing error, or at least
being aware of the possible magnitude of error in
our calculations.
We must also be aware that errors can compound
through repetitive arithmetic operations.
For example, our 14-bit model cannot exactly
represent the decimal value 128.5. In binary, it is 9
bits wide:
10000000.12 = 128.510

29
2.5 Floating-Point Representation

When we try to express 128.510 in our 14-bit model,


we lose the low-order bit, giving a relative error of:

To
128.5 - 128
128.5
0.39%

If we had a procedure that repetitively added 0.5 to


128.5, we would have an error of nearly 2% after only
four iterations.

30
2.5 Floating-Point Representation

Floating-point errors can be reduced when we use


teen
age
operands that are similar in magnitude.
If we were repetitively adding 0.5 to 128.5, it
would have been better to iteratively add 0.5 to
itself and then add 128.5 to this sum.
In this example, the error was caused by loss of
the low-order bit.
Loss of the high-order bit is more problematic.

31
2.5 Floating-Point Representation

Floating-point overflow and underflow can cause


programs to crash.
Overflow occurs when there is no room to store
the high-order bits resulting from a calculation.
Underflow occurs when a value is too small to
store, possibly resulting in division by zero.

Experienced programmers know that it s better for a


program to crash than to have it produce incorrect, but
plausible, results.

32
2.5 Floating-Point Representation

When discussing floating-point numbers, it is


important to understand the terms range,
precision, and accuracy.
The range of a numeric integer format is the
difference between the largest and smallest
values that can be expressed.
Accuracy refers to how closely a numeric
representation approximates a true value.
The precision of a number indicates how much
information we have about a value

33
2.5 Floating-Point Representation

Most of the time, greater precision leads to better


accuracy, but this is not always true.
For example, 3.1333 is a value of pi that is accurate to
two digits, but has 5 digits of precision.
There are other problems with floating point
numbers.
Because of truncated bits, you cannot always
assume that a particular floating point operation is
commutative or distributive.

34
2.5 Floating-Point Representation
This means that we cannot assume:
(a + b) + c = a + (b + c) or
a*(b + c) = ab + ac
Moreover, to test a floating point value for equality to
some other number, it is best to declare a nearness to x
epsilon value. For example, instead of checking to see if
floating point x is equal to 2 as follows:

t
if x = 2 then …
it is better to use:
if (abs(x - 2) < epsilon) then ...
(assuming we have epsilon defined correctly!)

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy