0% found this document useful (0 votes)
31 views15 pages

Ece552 10 Floating Point

ECE/CS 552 covers floating point representation and arithmetic. Floating point numbers represent larger ranges than fixed point by using an exponent to shift the relative weight of each bit position. The IEEE 754 standard defines single and double precision floating point formats. Floating point addition works similarly to grade school addition, while multiplication computes the sign, exponent sum, and significand product before normalizing and rounding the result.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views15 pages

Ece552 10 Floating Point

ECE/CS 552 covers floating point representation and arithmetic. Floating point numbers represent larger ranges than fixed point by using an exponent to shift the relative weight of each bit position. The IEEE 754 standard defines single and double precision floating point formats. Floating point addition works similarly to grade school addition, while multiplication computes the sign, exponent sum, and significand product before normalizing and rounding the result.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

ECE/CS 552:

Floating Point
© Prof. Mikko Lipasti

Lecture notes based in part on slides created by Mark


Hill, David Wood, Guri Sohi, John Shen and Jim Smith
Basic Arithmetic and the ALU
• Now
– Floating point representation
– Floating point addition, multiplication
• These are not crucial for the project

2
Floating Point
• Want to represent larger range of numbers
– Fixed point (integer): -2n-1 … (2n-1 –1)
• How? Sacrifice precision for range by
providing exponent to shift relative weight of
each bit position
• Similar to scientific notation:
3.14159 x 1023
• Cannot specify every discrete value in the
range, but can span much larger range

3
Floating Point
• Still use a fixed number of bits
– Sign bit S, exponent E, significand F
– Value: (-1)S x F x 2E
• IEEE 754 standard S E F

Size Exponent Significand Range

Single precision 32b 8b 23b 2x10+/-38

Double precision 64b 11b 52b 2x10+/-308

4
Floating Point Exponent
• Exponent specified in biased or excess
notation
• Why?
– To simplify sorting
– Sign bit is MSB to ease sorting
– 2’s complement exponent:
• Large numbers have positive exponent
• Small numbers have negative exponent
– Sorting does not follow naturally

5
Excess or Biased Exponent
Exponent 2’s Compl Excess-127
-127 1000 0001 0000 0000
-126 1000 0010 0000 0001
… … …
+127 0111 1111 1111 1110

• Value: (-1)S x F x 2(E-bias)


– SP: bias is 127
– DP: bias is 1023

6
Floating Point Normalization
• S,E,F representation allows more than one
representation for a particular value, e.g.
1.0 x 105 = 0.1 x 106 = 10.0 x 104
– This makes comparison operations difficult
– Prefer to have a single representation
• Hence, normalize by convention:
– Only one digit to the left of the floating point
– In binary, that digit must be a 1
• Since leading ‘1’ is implicit, no need to store it
• Hence, obtain one extra bit of precision for free

7
FP Overflow/Underflow
• FP Overflow
– Analogous to integer overflow
– Result is too big to represent
– Means exponent is too big
• FP Underflow
– Result is too small to represent
– Means exponent is too small (too negative)
• Both can raise an exception under IEEE754

8
IEEE754 Special Cases
Single Precision Double Precision Value
Exponent Significand Exponent Significand
0 0 0 0 0
0 nonzero 0 nonzero denormalized
1-254 anything 1-2046 anything fp number
255 0 2047 0 infinity
NaN (Not a
255 nonzero 2047 nonzero
Number)

9
FP Rounding
• Rounding is important
– Small errors accumulate over billions of ops
• FP rounding hardware helps
– Compute extra guard bit beyond 23/52 bits
– Further, compute additional round bit beyond that
• Multiply may result in leading 0 bit, normalize shifts guard bit into
product, leaving round bit for rounding
– Finally, keep sticky bit that is set whenever ‘1’ bits are
“lost” to the right
• Differentiates between 0.5 and 0.500000000001

10
Floating Point Addition
• Just like grade school
– First, align decimal points
– Then, add significands
– Finally, normalize result
• Example
9.997 x 102 9.997000 x 102

4.631 x 10-1 0.004631 x 102


Sum 10.001631 x 102
Normalized 1.0001631 x 103

11
Sign Exponent Significand Sign Exponent Significand

FP Compare

Adder
Small ALU exponents

Exponent
difference

0 1 0 1 0 1

Shift smaller
Control Shift right
number right

Add
Big ALU

0 1 0 1

Increment or
decrement Shift left or right Normalize

Rounding hardware Round

Sign Exponent Significand


12
FP Multiplication
• Sign: Ps = As xor Bs
• Exponent: PE = AE + BE
– Due to bias/excess, must subtract bias
e = e1 + e2
E = e + 1023 = e1 + e2 + 1023
E = (E1 – 1023) + (E2 – 1023) + 1023
E = E1 + E2 –1023
• Significand: PF = AF x BF
– Standard integer multiply (23b or 52b + g/r/s bits)
– Use Wallace tree of CSAs to sum partial products

13
FP Multiplication
• Compute sign, exponent, significand
• Normalize
– Shift left, right by 1
• Check for overflow, underflow
• Round
• Normalize again (if necessary)

14
Summary
• Floating point representation
– Normalization
– Overflow, underflow
– Rounding
• Floating point add
• Floating point multiply

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy