This Unit: Arithmetic and ALU Design Floating Point Arithmetic
This Unit: Arithmetic and ALU Design Floating Point Arithmetic
Sorin
from Roth and Lebeck
62
This Unit: Arithmetic and ALU Design
Integer Arithmetic and ALU
Binary number representations
Addition and subtraction
The integer ALU
Shifting and rotating
Multiplication
Division
Floating Point Arithmetic
Binary number representations
FP arithmetic
Accuracy
Application
OS
Firmware Compiler
CPU I/O
Memory
Digital Circuits
Gates & Transistors
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
63
Floating Point Arithmetic
Formats
Precision and range
IEEE 754 standard
Operations
Addition and subtraction
Multiplication and division
Error analysis
Error and bias
Rounding and truncation
Only scientists care?
Application
OS
Firmware Compiler
CPU I/O
Memory
Digital Circuits
Gates & Transistors
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
64
Floating Point (FP) Numbers
Floating point numbers: numbers in scientific notation
Two uses
Use #1: real numbers (numbers with non-zero fractions)
3.1415926
2.1878
9.8
6.62 * 10
34
5.875
Use #2: really big numbers
3.0 * 10
8
6.02 * 10
23
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
65
The World Before Floating Point
Early computers were built for scientific calculations
ENIAC: ballistic firing tables
But didnt have primitive floating point data types
Circuits were big
Many accuracy problems
Programmers built scale factors into programs
Large constant multiplier turns all FP numbers to integers
Before program starts, inputs multiplied by scale factor manually
After program finishes, outputs divided by scale factor manually
Yuck!
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
66
The Fixed Width Dilemma
Natural arithmetic has infinite width
Infinite number of integers
Infinite number of reals
Infinitely more reals than integers (head spinning)
Hardware arithmetic has finite width N (e.g., 16, 32, 64)
Can represent 2
N
numbers
If you could represent 2
N
integers, which would they be?
Easy! The 2
N1
on either size of 0
If you could represent 2
N
reals, which would they be?
2
N
reals from 0 to 1, not too useful
2
N
powers of two (1, 2, 4, 8, ), also not too useful
Something in between: yes, but what?
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
67
Range and Precision
Range
Distance between largest and smallest representable numbers
Want big range
Precision
Distance between two consecutive representable numbers
Want small precision
In fixed bit width, cant have unlimited both
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
68
Scientific Notation
Scientific notation: good compromise
Number [S,F,E] = S * F * 2
E
S: sign
F: significand (fraction)
E: exponent
Floating point: binary (decimal) point has different magnitude
+ Sliding window of precision using notion of significant digits
Small numbers very precise, many places after decimal point
Big numbers are much less so, not all integers representable
But for those instances you dont really care anyway
Caveat: most representations are just approximations
Sometimes weirdos like 0.9999999 or 1.0000001 come up
+But good enough for most purposes
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
69
IEEE 754 Standard Precision/Range
Single precision: float in C
32-bit: 1-bit sign + 8-bit exponent + 23-bit significand
Range: 2.0 * 10
38
< N < 2.0 * 10
38
Precision: ~7 significant (decimal) digits
Double precision: double in C
64-bit: 1-bit sign + 11-bit exponent + 52-bit significand
Range: 2.0 * 10
308
< N < 2.0 * 10
308
Precision: ~15 significant (decimal) digits
Numbers >10
308
dont come up in many calculations
10
80
~ number of atoms in universe
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
70
How Do Bits Represent Fractions?
Sign: 0 or 1 easy
Exponent: signed integer also easy
Significand: unsigned fraction not obvious!
How do we represent integers?
Sums of positive powers of two
S-bit unsigned integer A: A
S1
2
S1
+ A
S2
2
S2
+ + A
1
2
1
+ A
0
2
0
So how can we represent fractions?
Sums of negative powers of two
S-bit unsigned fraction A: A
S1
2
0
+ A
S2
2
1
+ + A
1
2
S+2
+ A
0
2
S+1
More significant bits correspond to larger multipliers
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
71
Some Examples
What is 5 in floating point?
Sign: 0
5 = 1.25 * 2
2
Significand: 1.25 = 1*2
0
+ 1*2
2
= 101 0000 0000 0000 0000 0000
Exponent: 2 = 0000 0010
What is 0.5 in floating point?
Sign: 1
0.5 = 0.5 * 2
0
Significand: 0.5 = 1*2
1
= 010 0000 0000 0000 0000 0000
Exponent: 0 = 0000 0000
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
72
Normalized Numbers
Notice
5 is 1.25 * 2
2
But isnt it also 0.625 * 2
3
and 0.3125 * 2
4
and ?
With 8-bit exponent, we can have 8 representations of 5
Multiple representations for one number is bad idea
Would lead to computational errors
Would waste bits
Solution: choose normal (canonical) form
Disallow de-normalized numbers
IEEE 754 normal form: coefficient of 2
0
is always 1
Similar to scientific notation: one non-zero digit left of decimal
Normalized representation of 5 is 1.25 * 2
2
(1.25 = 1*2
0
+1*2
-2
)
0.625 * 2
3
is de-normalized (0.625 = 0*2
0
+1*2
-1
+1*2
-3
)
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
73
More About Normalization
What is 0.5 in normalized floating point?
Sign: 1
0.5 = 1 * 2
1
Significand: 1 = 1*2
0
= 100 0000 0000 0000 0000 0000
Exponent: -1 = 1111 1111 (assuming 2s complement for now)
IEEE 754: no need to represent coefficient of 2
0
explicitly
Its always 1
+ Buy yourself an extra bit of precision
Pretty cute trick
Problem: what about 0?
How can we represent 0 if 2
0
is always implicitly 1?
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
74
IEEE 754: The Whole Story
Exponent: signed integer not so fast
Exponent represented in excess or bias notation
N-bits typically can represent signed numbers from 2
N1
to 2
N1
1
But in IEEE 754, they represent exponents from 2
N1
+2 to 2
N1
1
And they represent those as unsigned with an implicit 2
N1
1 added
Implicit added quantity is called the bias
Actual exponent is E(2
N1
1)
Example: single precision (8-bit exponent)
Bias is 127, exponent range is 126 to 127
126 is represented as 1 = 0000 0001
127 is represented as 254 = 1111 1110
0 is represented as 127 = 0111 1111
1 is represented as 128 = 1000 0000
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
75
IEEE 754: Continued
Notice: two exponent bit patterns are unused
0000 0000: represents de-normalized numbers
Numbers that have implicit 0 (rather than 1) in 2
0
Zero is a special kind of de-normalized number
+Exponent is all 0s, significand is all 0s
There are both +0 and 0, but they are considered the same
Also represent numbers smaller than smallest normalized numbers
1111 1111: represents infinity and NaN
infinities have 0s in the significand
NaNs do not
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
76
IEEE 754: To Infinity and Beyond
What are infinity and NaN used for?
To allow operations to proceed past overflow/underflow situations
Overflow: operation yields exponent greater than 2
N1
1
Underflow: operation yields exponent less than 2
N1
+2
IEEE 754 defines operations on infinity and NaN
N / 0 = infinity
N / infinity = 0
0 / 0 = NaN
Infinity / infinity = NaN
Infinity infinity = NaN
Anything and NaN = NaN
Will not test you on these rules
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
77
IEEE 754: Final Format
Biased exponent
Normalized significand
Exponent uses more significant bits than significand
Helps when comparing FP numbers
Exponent bias notation helps there too why?
Every computer since about 1980 supports this standard
Makes code portable (at the source level at least)
Makes hardware faster (stand on each others shoulders)
exp significand
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
78
Floating Point Arithmetic
We will look at
Addition/subtraction
Multiplication/division
Implementation
Basically, integer arithmetic on significand and exponent
Using integer ALUs
Plus extra hardware for normalization
To help us here, look at toy quarter precision format
8 bits: 1-bit sign + 3-bit exponent + 4-bit significand
Bias is 3 (= 2
N-1
1)
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
79
FP Addition
Assume
A represented as bit pattern [S
A
, E
A
, F
A
]
B represented as bit pattern [S
B
, E
B
, F
B
]
What is the bit pattern for A+B [S
A+B
, E
A+B
, F
A+B
]?
[S
A
+S
B
, E
A
+E
B
, F
A
+F
B
]? Nope!
So what is it then?
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
80
FP Addition Decimal Example
Lets look at a decimal example first: 99.5 + 0.8
9.95*10
1
+ 8.0*10
-1
Step I: align exponents (if necessary)
Temporarily de-normalize operand with smaller exponent
Add 2 to its exponent must shift significand right by 2
8.0* 10
-1
0.08*10
1
Step II: add significands
9.95*10
1
+ 0.08*10
1
10.03*10
1
Step III: normalize result
Shift significand right by 1 and then add 1 to exponent
10.03*10
1
1.003*10
2
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
81
FP Addition (Quarter Precision) Example
Now a binary quarter example: 7.5 + 0.5
7.5 = 1.875*2
2
= 0 101 11110 (the 1 is the implicit leading 1)
1.875 = 1*2
0
+1*2
-1
+1*2
-2
+1*2
-3
0.5 = 1*2
-1
= 0 010 10000
Step I: align exponents (if necessary)
0 010 10000 0 101 00010
Add 3 to exponent shift significand right by 3
Step II: add significands
0 101 11110 + 0 101 00010 = 0 101 100000
Step III: normalize result
0 101 100000 0 110 10000
Shift significand right by 1 add 1 to exponent
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
82
FP Addition Hardware
E1 F1 E2 F2
>>
+
>> +
ctrl
E F
v
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
83
What About FP Subtraction?
Or addition of negative quantities for that matter
How to subtract significands that are not in TC form?
Can we still use an adder?
Trick: internally and temporarily convert to TC
Add phantom 2 in front (1*2
1
)
Use standard negation trick
Add as usual
If phantom 2 bit is 1, result is negative
Negate it using standard trick again, flip result sign bit
Then ignore phantom bit (which is now 0 anyway)
Youll want to try this at home!
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
84
FP Multiplication
Assume
A represented as bit pattern [S
A
, E
A
, F
A
]
B represented as bit pattern [S
B
, E
B
, F
B
]
What is the bit pattern for A*B [S
A*B
, E
A*B
, F
A*B
]?
This one is actually a little easier (conceptually) than addition
Scientific notation is logarithmic
In logarithmic form: multiplication is addition
[S
A
XOR S
B
, E
A
+E
B
, F
A
*F
B
]? Pretty much, except for
Normalization
Addition of exponents in biased notation (must subtract bias)
Tricky: when multiplying two normalized F-bit significands
Where is the binary point?
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
85
FP Division
Assume
A represented as bit pattern [S
A
, E
A
, F
A
]
B represented as bit pattern [S
B
, E
B
, F
B
]
What is the bit pattern for A/B [S
A/B
, E
A/B
, F
A/B
]?
[S
A
XOR S
B
, E
A
E
B
, F
A
/F
B
]? Pretty much, again except for
Normalization
Subtraction of exponents in biased notation (must add bias)
Binary point placement
No need to worry about remainders, either
A little bit of irony
Multiplication/division roughly same complexity for FP and integer
Addition/subtraction much more complicated for FP than integer
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
86
Accuracy
Remember our decimal addition example?
9.95*10
1
+ 8.00*10
-1
1.003*10
2
Extra decimal place caused by de-normalization
But what if our representation only has two digits of precision?
What happens to the 3?
Corresponding binary question: what happens to extra 1s?
Solution: round
Option I: round down (truncate), no hardware necessary
Option II: round up (round), need an incrementer
Why rounding up called round?
Because an extra 1 is half-way, which is rounded up
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
87
More About Accuracy
Problem with both truncation and rounding
They cause errors to accumulate
E.g., if always round up, result will gradually crawl upwards
One solution: round to nearest even
If un-rounded LSB is 1 round up (011 10)
If un-rounded LSB is 0 round down (001 00)
Round up half the time, down other half overall error is stable
Another solution: multiple intermediate precision bits
IEEE 754 defines 3: guard + round + sticky
Guard and round are shifted by de-normalization as usual
Sticky is 1 if any shifted out bits are 1
Round up if 101 or higher, round down if 011 or lower
Round to nearest even if 100
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
88
Numerical Analysis
Accuracy problems sometimes get bad
Addition of big and small numbers
Subtraction of big numbers
Example, whats 1*10
30
+ 1*10
0
1*10
30
?
Intuitively: 1*10
0
= 1
But: (1*10
30
+ 1*10
0
) 1*10
30
= (1*10
30
1*10
30
) = 0
Numerical analysis: field formed around this problem
Bounding error of numerical algorithms
Re-formulating algorithms in a way that bounds numerical error
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
89
One Last Thing About Accuracy
Suppose you added two numbers and came up with
0 101 11111 101
What happens when you round?
Number becomes denormalized arrrrgggghhh
FP adder actually has six steps, not three
Align exponents
Add/subtract significands
Re-normalize
Round
Potentially re-normalize again
Potentially round again
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
90
Accuracy, Shmaccuracy?
Only scientists care? Au contraire
Intel 486 used equivalent of Modified Booths for division
Generate multiple quotient bits per step
Requires you to guess quotient bits and adjust later
Guess taken from a lookup table implemented as PLA
Along came Pentium
PLA was optimized to return 0 for impossible table indices
Which turned out not to be impossible after all
Result: precision errors in 4
th
15
th
decimal places for some divisors
Pentium fdiv bug is born
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
91
Pentium FDIV Bug
Pentium shipped in August 1994
Intel actually knew about the bug in July
But calculated that delaying the project a month would cost ~$1M
And that in reality only a dozen or so people would encounter it
They were right but one of them took the story to EE Times
By November 1994, firestorm was full on
IBM said that typical Excel user would encounter bug every month
Assumed 5K divisions per second around the clock
People believed the story
IBM stopped shipping Pentium PCs
By December 1994, Intel promises full recall
Total cost: ~$550M
All for a bug which in reality maybe affected a dozen people
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
92
Summary of Floating Point
FP representation
S*F*2
E
IEEE754 standard
Representing fractions
Normalized numbers
FP operations
Addition/subtraction: hard
Multiplication/division: logarithmic no harder than integer
Accuracy problems
Rounding and truncation
Upshot: FP hardware is tough
Thank lucky stars that ECE 152 project has no FP
ECE 152 2011 Daniel J. Sorin
from Roth and Lebeck
93
Unit Recap: Arithmetic and ALU Design
Integer Arithmetic and ALU
Binary number representations
Addition and subtraction
The integer ALU
Shifting and rotating
Multiplication
Division
Floating Point Arithmetic
Binary number representations
FP arithmetic
Accuracy
Application
OS
Firmware Compiler
CPU I/O
Memory
Digital Circuits
Gates & Transistors