Chapter2 2.5
Chapter2 2.5
Floating Point
Numbers
2.5 Floating-Point Representation
2
2.5 Floating-Point Representation
4
2.5 Floating-Point Representation
Computers use a form of scientific notation for
floating-point representation
Numbers written in scientific notation have three
components:
5
2.5 Floating-Point Representation
Computer representation of a floating-point
number consists of three fixed-size fields:
Or Mantissa
more correct
6
2.5 Floating-Point Representation
7
2.5 Floating-Point Representation
8
2.5 Floating-Point Representation
10
2.5 Floating-Point Representation
The illustrations shown at
the right are all equivalent
representations for 32
using our simplified
model. 11
Same
12
2.5 Floating-Point Representation
13
2.5 Floating-Point Representation
14
2.5 Floating-Point Representation
Example:
D
Express 3210 in the revised 14-bit floating-point model.
eep
e
We know that 32 = 1.0 x 25 = 0.1 x 26. D
d
To use our excess 15 biased exponent, we add 15 to
6, giving 2110 (=101012). 21
61 15
So we have:
8 5
5 0.1 H
10001.101 1.0 24
15
2.5 Floating-Point Representation
Example:
Express 0.062510 in the revised 14-bit floating-point
model.
We know that 0.0625 is 2-4. So in (binary) scientific
notation 0.0625 =0.0001 = 1.0 x 2-4 = 0.1 x 2 -3.
To use our excess 15 biased exponent, we add 15 to
-3, giving 1210 (=011002). 12
15
34
6
16
2.5 Floating-Point Representation
Example:
Express -26.62510 in the revised 14-bit floating-point
e
model.
We find 26.62510 = 11010.1012. Normalizing, we
have: 26.62510 = 0.11010101 x 2 5.
To use our excess 15 biased exponent, we add 15 to
5, giving 2010 (=101002). We also need a 1 in the sign
bit.
17
2.5 Floating-Point Representation
I
52-bit significand.
eisias
them PLIED 1
18
2.5 Floating-Point Representation
(implied)
20
2.5 Floating-Point Representation
Using the IEEE-754 single precision floating point
standard:
An exponent of 255(after adding the bias (all 1 s)) indicates
a special value.
I
• If the significand is zero, the value is infinity.
• If the significand is nonzero, the value is NaN, not a
number, often used to flag an error condition.
Using the double precision standard:
The special exponent value for a double precision number
is 2047, instead of the 255 used by the single precision
standard.
21
2.5 Floating-Point Representation
Both the 14-bit model that we have presented
and the IEEE-754 floating point standard allow
two representations for zero.
Zero is indicated by all zeros in the exponent and the
significand, but the sign bit can be either 0 or 1.
This is why programmers should avoid testing a
T
floating-point value for equality to zero.
no
Negative zero does not equal positive zero.
r
22
2.5 Floating-Point Representation
23
2.5 Floating-Point Representation
Example: aka at
precised
Find the sum of 1210 and 1.2510 using the 14-bit simple
floating-point model.
We find 1210 = 0.1100 x 2 4. And 1.2510 = 0.101 x 2 1 =
0.000101 x 2 4.
Thus, our sum is
0.110101 x 2 4.
if bit it is added at
carry
24 the end to the exponent
2.5 Floating-Point Representation
25
2.5 Floating-Point Representation
27
2.5 Floating-Point Representation
O
• Consider OI
0.2
• 0.1 in decimal. will never
get 1Obean
I
28
2.5 Floating-Point Representation
Our job becomes one of reducing error, or at least
being aware of the possible magnitude of error in
our calculations.
We must also be aware that errors can compound
through repetitive arithmetic operations.
For example, our 14-bit model cannot exactly
represent the decimal value 128.5. In binary, it is 9
bits wide:
10000000.12 = 128.510
29
2.5 Floating-Point Representation
To
128.5 - 128
128.5
0.39%
30
2.5 Floating-Point Representation
31
2.5 Floating-Point Representation
32
2.5 Floating-Point Representation
33
2.5 Floating-Point Representation
34
2.5 Floating-Point Representation
This means that we cannot assume:
(a + b) + c = a + (b + c) or
a*(b + c) = ab + ac
Moreover, to test a floating point value for equality to
some other number, it is best to declare a nearness to x
epsilon value. For example, instead of checking to see if
floating point x is equal to 2 as follows:
t
if x = 2 then …
it is better to use:
if (abs(x - 2) < epsilon) then ...
(assuming we have epsilon defined correctly!)
35