Ece552 10 Floating Point
Ece552 10 Floating Point
Floating Point
© Prof. Mikko Lipasti
2
Floating Point
• Want to represent larger range of numbers
– Fixed point (integer): -2n-1 … (2n-1 –1)
• How? Sacrifice precision for range by
providing exponent to shift relative weight of
each bit position
• Similar to scientific notation:
3.14159 x 1023
• Cannot specify every discrete value in the
range, but can span much larger range
3
Floating Point
• Still use a fixed number of bits
– Sign bit S, exponent E, significand F
– Value: (-1)S x F x 2E
• IEEE 754 standard S E F
4
Floating Point Exponent
• Exponent specified in biased or excess
notation
• Why?
– To simplify sorting
– Sign bit is MSB to ease sorting
– 2’s complement exponent:
• Large numbers have positive exponent
• Small numbers have negative exponent
– Sorting does not follow naturally
5
Excess or Biased Exponent
Exponent 2’s Compl Excess-127
-127 1000 0001 0000 0000
-126 1000 0010 0000 0001
… … …
+127 0111 1111 1111 1110
6
Floating Point Normalization
• S,E,F representation allows more than one
representation for a particular value, e.g.
1.0 x 105 = 0.1 x 106 = 10.0 x 104
– This makes comparison operations difficult
– Prefer to have a single representation
• Hence, normalize by convention:
– Only one digit to the left of the floating point
– In binary, that digit must be a 1
• Since leading ‘1’ is implicit, no need to store it
• Hence, obtain one extra bit of precision for free
7
FP Overflow/Underflow
• FP Overflow
– Analogous to integer overflow
– Result is too big to represent
– Means exponent is too big
• FP Underflow
– Result is too small to represent
– Means exponent is too small (too negative)
• Both can raise an exception under IEEE754
8
IEEE754 Special Cases
Single Precision Double Precision Value
Exponent Significand Exponent Significand
0 0 0 0 0
0 nonzero 0 nonzero denormalized
1-254 anything 1-2046 anything fp number
255 0 2047 0 infinity
NaN (Not a
255 nonzero 2047 nonzero
Number)
9
FP Rounding
• Rounding is important
– Small errors accumulate over billions of ops
• FP rounding hardware helps
– Compute extra guard bit beyond 23/52 bits
– Further, compute additional round bit beyond that
• Multiply may result in leading 0 bit, normalize shifts guard bit into
product, leaving round bit for rounding
– Finally, keep sticky bit that is set whenever ‘1’ bits are
“lost” to the right
• Differentiates between 0.5 and 0.500000000001
10
Floating Point Addition
• Just like grade school
– First, align decimal points
– Then, add significands
– Finally, normalize result
• Example
9.997 x 102 9.997000 x 102
11
Sign Exponent Significand Sign Exponent Significand
FP Compare
Adder
Small ALU exponents
Exponent
difference
0 1 0 1 0 1
Shift smaller
Control Shift right
number right
Add
Big ALU
0 1 0 1
Increment or
decrement Shift left or right Normalize
13
FP Multiplication
• Compute sign, exponent, significand
• Normalize
– Shift left, right by 1
• Check for overflow, underflow
• Round
• Normalize again (if necessary)
14
Summary
• Floating point representation
– Normalization
– Overflow, underflow
– Rounding
• Floating point add
• Floating point multiply
15