0% found this document useful (0 votes)
4 views52 pages

CS220 Lecture 12

The document discusses various number representation systems in computer arithmetic, including positional, fixed-point, and floating-point systems, as well as signed and unsigned representations. It explains the complexities of arithmetic operations and overflow conditions in different systems, particularly focusing on signed-magnitude, biased, and complement representations, including 1's and 2's complement. The document also covers hardware implementations for these representations and the implications of overflow in integer operations.

Uploaded by

cvishnoi23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views52 pages

CS220 Lecture 12

The document discusses various number representation systems in computer arithmetic, including positional, fixed-point, and floating-point systems, as well as signed and unsigned representations. It explains the complexities of arithmetic operations and overflow conditions in different systems, particularly focusing on signed-magnitude, biased, and complement representations, including 1's and 2's complement. The document also covers hardware implementations for these representations and the implications of overflow in integer operations.

Uploaded by

cvishnoi23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Computer Organization: Computer Arithmetic: Number Representation

Debapriya Basu Roy


Department of Computer Science & Engineering
Indian Institute of Technology Kanpur
dbroy@cse.iitk.ac.in
dbroy24@gmail.com
Number System
• The oldest method for representing numbers consisted of the use of stones
or sticks.
• Symbolic Representation: Roman numeral system: The units of this system
are 1, 5, 10, 50, 100, 500, 1000, 10 000, and 100 000, denoted by the
symbols I, V, X, L, C, D, M, ((I)), and (((I)))
• Not suitable for representing very large numbers. Furthermore, it is difficult to do
arithmetic on numbers represented with this notation
• Positional Number System: the value represented by each symbol depends
not only on its shape but also on its position relative to other symbols. Our
conventional method of representing numbers is based on a positional
system.
• In digital systems, numbers are encoded by means of binary digits or bits.
• If we have 4 bits to represent numbers, there will be 16 possible codes.
• We are free to assign the 16 codes to numbers as you please. However,
since number representation has significant effects on algorithm and circuit
complexity, only some of the wide range of possibilities have found
applications.
• The assignment of codes to numbers must be done in a logical and
systematic manner. For example, if you assign codes to 2 and 3 but not to 5,
then adding 2 and 3 will cause an “overflow” (yields an unrepresentable
value) in the number system.
Some Example

• Unsigned: interpret the 4-bit patterns as 4-bit binary numbers,


• The signed-magnitude scheme results in integers in the range [−7, 7] being represented,
with 0 having two representations.
• The 3-plus-1 fixed-point number system (3 whole bits, 1 fractional bit) gives us numbers
from 0 to 7.5 in increments of 0.5. Viewing the 4-bit codes as signed fractions gives us a
range of [−0.875, +0.875]
• The 2-plus-2 unsigned floating-point number system in, with its 2-bit exponent e in {−2,
−1, 0, 1} and 2-bit integer significand s in {0, 1, 2, 3}, can represent certain values s × 2 e in
[0, 6].
Fixed Radix Positional Number System

• Fixed-point positional number system is based on a positive integer radix (base) r and an
implicit digit set {0, 1, ··· ,r − 1}. Each unsigned integer is represented by a digit vector of
length k + l, with k digits for the whole part and l digits for the fractional part.

• Digit sets composed of consecutive integers:{−α, −α + 1, ··· , β − 1, β} = [−α, β]

• Example: Balanced ternary number system: r = 3, digit set = [−1, 1].


• Example: Negative-radix number systems: radix −r, digit set = [0,r − 1]. If we put r=2 and
digit set=[0,1], we get negabinary number system
Some More Examples:
• Nonredundant signed-digit number systems: digit set [−α,r −1−α] for radix r. As
an example, one can use the digit set [−4, 5] for r = 10. [3 -1 5=(3*100+(-
1)*10+5=295)]

• Redundant signed-digit number systems: digit set[−α, β], with α+β ≥ r for radix r.
For example, once can use the digit set [−7, 7], for r = 10 . (3 -1 5) = (3 0 -5)ten =
(1 -7 0 -5)

• Fractional Number System:


r = 0.1 with digit set [0, 9].

• Irrational Number System: r=sqrt(2), with digit set [0,1]


Signed and Unsigned Representation

• In radix r, with the standard digit set [0,r − 1], the number of digits needed
to represent the natural numbers in [0, max] is
• k==

• If a computation generates a number that cannot fit within these bits (that
is greater than max), an overflow is said to have occurred

• Non-negative integers (often called unsigned integers) can be represented


using standard binary
• Negative integers require the sign to be encoded appropriately

• Four possibilities have been tried: sign magnitude, two’s complement, one’s
complement, biased

• Sign magnitude is the simplest: reserve the most significant bit to represent
the sign of the integer but has ambiguous representation of zero (00…0 and
100…0) and addition requires extra logic to set the result’s sign
Signed Magnitude Representation

• Sign-and-Magnitude format: 1 bit is devoted to sign.


• 1 denotes a negative sign and 0 a positive sign.
• For radix-2 numbers with a total width of k bits, k−1 bits will be available to represent
the magnitude or absolute value of the number. The range of k-bit signed-magnitude
binary numbers is thus [−(2k−1 − 1), 2k−1 − 1].
• Addition of numbers with unlike signs (subtraction) must be handled differently from
that of same-sign operands
• Added overhead for detecting “-0” and changing it to “+0”
Signed Magnitude: Hardware

• The control circuit receives as inputs the operation to be performed (0 = add, 1 =


subtract), the signs of the two operands x and y, the carry-out of the adder, and the
sign of the addition result.
• It produces signals for the adder’s carry-in, complementation of x, complementation
of the addition result, and the sign of the result.
• Note that complementation hardware is provided only for the x operand. This is
because x−y can be obtained by first computing y−x and then changing the sign of the
result.
Biased Representation

• The biased representation is based on adding a positive value bias to all numbers,
allowing us to represent the integers from –bias to max – bias using unsigned values
from 0 to max
• Signed integers in the range [−8, +7] can be encoded as unsigned values 0 through 15
by using a bias of 8
x + y + bias = (x + bias) + (y + bias) − bias
x − y + bias = (x + bias) − (y + bias) + bias
Multiplication and division becomes very complicated. For this reason, the practical use
of biased representation is limited to floating-point numbers.
Complement Representation

• A suitably large complementation constant M is selected and the negative value −x is


represented as the unsigned value M −x
• To represent integers in the range [−N, +P] unambiguously, the complementation
constant M must satisfy M ≥ N +P+1
• M = N + P + 1 yields maximum efficiency.
Complement Representation

• Two auxiliary operations are required for complement representations to be effective:


complementation or change of sign (computing M − x) and computations of residues mod M.

• If finding M − x requires subtraction and finding residues mod M implies division, then
complement representation becomes quite inefficient.

• Radix-complement representations, modulo-M reduction is done by ignoring the carry-out


from digit position k−1 in a (k+l)-digit radix-r addition. For digit-complement representations,
computing the complement of x (i.e., M −x), is done by simply replacing each nonzero digit x i
by r − 1 − xi. This is particularly easy if r is a power of 2.
2’s Complement

• r = 2, the radix complement representation


that corresponds to M = 2k is known as 2’s
complement.
• The 2’s complement of a number x can be
found via bitwise complementation of x and
the addition of ulp (unit in least significant
position). For integers, ulp=1.
2k − x = [(2k − ulp) − x] + ulp = xcompl + ulp
• To add numbers modulo 2k , we simply drop
a carry-out of 1 produced by position k − 1.
• The range of representable numbers in a
2’s-complement number system with k
whole bits is from − 2k−1 to 2k−1 − ulp.
• If complementation is done as a separate
sign change operation, it must include
overflow detection
Digit or Diminished Radix Complement: 1’s complement
• The digit or diminished-radix complement
representation is known as 1’s complement in the
special case of r = 2.

• Complementation constant is M = 2k −ulp.

• Note that compared with the 2’s-complement


representation, the representation for −8 has been
eliminated and instead an alternate code has been
assigned to 0 (technically, −0).

• This may somewhat complicate 0 detection in that


both the all-0s and the all-1s patterns represent 0.

• The arithmetic circuits can be designed such that the


all-1s pattern is detected and automatically
converted to the all-0s pattern
• The 1’s complement of a number x can be found by
bitwise complementation: (2k − ulp) − x =xcompl
1’s Complement
• To add numbers modulo 2k − ulp, the carry-out of our (k +l)-bit adder should be directly
connected to its carry-in; this is known as end-around carry.
• The foregoing scheme properly handles any sum that equals or exceeds 2 k . When the sum
is 2k − ulp, however, the carry-out will be zero and modular reduction is not accomplished.
As suggested earlier, such an all-1s result can be interpreted as an alternate representation
of 0 that is either kept intact (making 0 detection more difficult) or is automatically
converted by hardware to +0.
• The range of representable numbers in a 1’s-complement number system with k whole bits
is from − (2k−1 − ulp) to 2k−1 − ulp
• Note that for integers, ulp=1
2’s complement for fraction: ulp

• Consider a number -101.703125


• Its binary equivalent is 01100101.10110100
• The 1’s complement is 10011010.01001011
• The 2’s complement will be:
10011010.01001011+ulp
• ulp= unit in the least significant position,representing the smallest
possible value that can be represented within the given bit
precision for the fractional component of a number
• So here the ulp =.00000001, and the 2’s complement value is
10011010.01001011+.00000001=10011010.01001100
• Check the case for 101.000 and 101
• For 101.000 ulp is .001 and for 101 the ulp is 1
Sign Extension
• Occasionally we need to extend the number of digits in an operand to make it of
the same length as another operand.
• For example, if a 16-bit number is to be added to a 32-bit number, the former is
first converted to 32-bit format, with the two 32-bit numbers then added using
a 32-bit adder.
• Unsigned- or signed-magnitude fixed-point binary numbers can be extended
from the left (whole part) or the right (fractional part) by simply padding them
with 0s.
• Given a 2’s-complement number xk−1xk−2 ··· x1x0.x−1x−2 ··· x−l, extension can be
achieved from the left by replicating the sign bit (sign extension) and from the
right by padding it with 0s. ··· xk−1xk−1xk−1xk−1xk−2 ··· x1x0.x−1x−2 ··· x−l000 ···
• When the number of whole (fractional) digits is increased from k (l) to k’ (‘l), the
complementation constant increases from M = 2k to M’=2k’
• M‘ − M = 2k‘- 2k = 2k (2k‘−k − 1), This difference is a binary integer consisting of k’ −
k, ‘1’s followed by k, ‘0’s; hence the need for sign extension.
• Find out the sign extension for one’s complement
2’s complement in 32 bit number
• The positive half of the numbers, from 0 to 2,147,483,647 (2 31-1), use the
normal binary representation.
• The bit pattern (1000 . . . 0000) represents the most negative number
2,147,483,648 (-231). It is followed by a declining set of negative numbers:
2,147,483,647 (1000 . . . 0001) down to 1(1111 . . . 1111)
• Two’s complement does have one negative number, 2,147,483,648, that has no
corresponding positive number
• Two’s complement representation has the advantage that all negative numbers
have a 1 in the most significant bit. Consequently, hardware needs to test only
this bit to see if a number is positive or negative (with the number 0 considered
positive).
• (x31* -231)+(x30* 230)+(x29* 229)+ . . .+(x1* 2)+x0
Adder hardware in 2’s complement
• Two’s complement adder/subtractor
– For A+B, the add/sub input wire is set to 0 and for A-B, the
add/sub input wire is set to 1
– Both A and B must be in two’s complement representation
(assumes n=4)
– Cout is usually ignored
• Actually used to detect overflows (will discuss later)

add/sub
a3 b3 a2 b2 a1 b1 a0 b0
XOR XOR XOR XOR

Cout FA3 FA2 FA1 FA0 Cin

S3 S2 S1 S0
Overflow in integer operations
• Overflow occurs if the result of a computation cannot fit within the given
number of bits
• In usual binary representation, if there is a carry-out in the MSB position of
an addition, overflow is said to occur
• In two’s complement representation, overflow occurs if and only if the
carry in to the MSB is different from carry out from the MSB
• Let the carry in to the MSB be cn-1 and carry out from the MSB be cn
when adding two n-bit numbers
• Suppose that we are adding A (an-1an-2…a0) and B (bn-1bn-2…b0) both in two’s
complement representation and let the sum be S (sn-1sn-2…s0)

• Case I: cn-1=0, cn=1


• Therefore, an-1=bn-1=1 and sn-1=0
• This is an overflow condition because adding two negative numbers
cannot yield a positive number
Overflow in integer operations

• Case II: cn-1=1, cn=0


• Therefore, an-1=bn-1=0 and sn-1=1
• This is an overflow condition because adding two positive numbers
cannot yield a negative number
• Case III: cn-1=0, cn=0
• Therefore, at most one of an-1 and bn-1 is 1
• Case IIIA: an-1=bn-1=0
• Since cn-1 is 0, the magnitudes of two positive numbers are added without
any overflow
• Therefore, there is no overflow
• Case IIIB: an-1=0, bn-1=1 (an-1=1, bn-1=0 is similar)
• Therefore, A+B = -2n-1 +
• Since cn-1 is 0, there is no overflow in adding lower n-1 bits of A and B;
hence, sn-2sn-3…s0 =
• So, A+B = -2n-1 + sn-2sn-3…s0 = 1sn-2sn-3…s0 (no overflow)
Overflow in integer operations
• Case IV: cn-1=1, cn=1
• Therefore, at most one of an-1 and bn-1 is 0
• Case IVA: an-1=1, bn-1=0 (an-1=0, bn-1=1 is similar)
• Therefore, A+B = -2n-1 +
• Since cn-1 is 1, the result of adding lower n-1 bits of A and B is 1s n-2sn-3…s0 = ,
where 1sn-2sn-3…s0 is treated as a positive value i.e., 2n-1 +
• So, A+B = -2n-1 + 1sn-2sn-3…s0 = -2n-1 + 2n-1 + = 0sn-2sn-3…s0
• Since in this case, an-1+bn-1+cn-1 = cn0, correct result is obtained by ignoring cn
(hence, no overflow)
• Case IVB: an-1=1, bn-1=1
• Therefore, A+B = -2n +
• Since cn-1 is 1, the result of adding lower n-1 bits of A and B is 1s n-2sn-3…s0 = ,
where 1sn-2sn-3…s0 is treated as a positive value i.e., 2n-1 +
• So, A+B = -2n + 1sn-2sn-3…s0 = -2n + 2n-1 +
= -2n-1 + = 1sn-2sn-3…s0 in two’s complement
• Since in this case, an-1+bn-1+cn-1 = cn1, correct result is obtained by ignoring cn
(hence, no overflow)
Generic Overflow Condition
1. Two positive numbers result in a negative number.
2. Two negative numbers result in a positive number.

1’s complement addition rule:


C := A + B
if num_bits(C) == num_bits(A) + 1
C := C + 1
check_overflow(A,B,C)
C := truncate(C)

2‘s complement Addition rule:


C := A + B
check_overflow(A,B,C)
C := truncate(C)
Floating Point Representation
• How to present real numbers:
• Fixed point
• Rational number systems: approximate a real value by the ratio of two
integers.
• Floating-point number systems
• Logarithmic number systems
• Fixed-point representation leads to equal spacing in the set of representable
numbers.
• Maximum absolute error is the same throughout (ulp with truncation and
ulp/2 with rounding)
• x = (0000 0000. 0000 1001), y = (1001 0000. 0000 0000)
• error due to truncation or rounding is quite significant for x, while it is much
less severe for y.
• Both x2 and y2 are unrepresentable, because their computations lead to
underflow (number too small) and overflow (too large), respectively.
Floating Point Representation

• Floating point presentation x=(1.001)×2−5 and y=(1.001)×2+7


• Four components: the sign, the significand s, the exponent base b, and the
exponent e.
• The exponent base b is usually implied (not explicitly represented) and is usually
a power of 2.
• x = ±s × be , ± significand × baseexponent
• sign is single bit to indicate positive or negative number
• The exponent sign is embedded in the biased exponent. When the bias is a
power of 2 (e.g., 128 with an 8-bit exponent), the exponent sign is the
complement of its most-significant bit (MSB).
Floating Point Representation
The range of values in a floating-point number representation format is composed of
the intervals [−max, −min] and [min, max]
max = largest significand × blargest exponent
min = smallest significand × bsmallest exponent
Biased exponent format has virtually no effect on the speed or cost of
exponent arithmetic, given the small number of bits involved. It
does, however, facilitate zero detection and magnitude comparison

Three special or singular values −∞, 0, and +∞ (0 is special because it cannot be


represented with a normalized significand)
Overflow: When a result is less than −max or greater than max.
Underflow: Results in the range (−min, 0) or (0, min).
Normalized Floating Point and IEEE754
• A representation of normalized floating-point number fixes the number of bits
needed for mantissa and exponent, given a fixed total number of bits
• Trade-off between range and precision
• If mantissa has more bits then exponent has less
• Normalized scientific representation has three advantages
• Simplifies exchange of floating-point data between computers due to a
standard representation
• Simplifies floating-point arithmetic algorithms because all operands are in the
same format
• Compacts the representation by discarding leading zeros on the left hand side
• 0.0000000010101 is same as 1.0101 x 2-9 and it is enough to store
s=0101 and e=-9
IEEE-754 floating point representation
IEEE 754 single-precision format
• Exponent field needs to encode both positive and negative exponents
• Necessary to choose an encoding such that the entire 31-bit magnitude field is
monotonic in the value of the floating-point number
• A larger magnitude floating-point number should have a larger 31-bit
magnitude field compared to a smaller magnitude floating-point number
• Helps to sort numbers quickly by magnitude
• Mantissa is already monotonic in the value of the fraction
• Cannot use two’s complement encoding for exponent
• Negative exponents will be represented by large numbers
• Biased encoding of exponent with bias set to 127
• Encoded exponent = 127 + actual exponent
• Actual exponent is allowed to range from -126 to 127 i.e., the encoded
exponent can range from 00000001 to 11111110
• Encoded exponent cannot be 00000000 or 11111111 because these
encodings are reserved to represent some special numbers
• Notice that larger encoded exponent now represents larger actual exponent
because biased encoding preserves monotonicity
• Two’s complement encoding does not preserve monotonicity
• IEEE 754 single-precision format
– Largest non-negative number
0 11111110 11111111111111111111111
• +1.111…1 x 2127 = (2 – 2-23) x 2127
– Smallest positive number
0 00000001 00000000000000000000000
• +1.000…0 x 2-126
– Negative number with smallest magnitude
1 00000001 00000000000000000000000
• -1.000…0 x 2-126
– Negative number with largest magnitude
1 11111110 11111111111111111111111
• -1.111…1 x 2127 = -(2 – 2-23) x 2127
IEEE-754 floating point representation

• In IEEE 754, zero has the all-0s representation, with positive or negative sign.
• Special codes are also needed for representing ±∞ and NaN (not-anumber).
• The NaN special value is useful for representing undefined results such as 0/0.
• When one of these special values appears as an operand in an arithmetic
operation, the result of the operation is specified according to defined rules.
• Ordinary number / +∞= ±0
• +∞ x ordinary number= ±∞
• NaN + ordinary number= NaN
• The special codes thus allow exceptions to be propagated to the end of a
computation rather than bringing it to a halt.
• IEEE 754 single-precision format
– Special numbers
• Encoding of zero (two possible
representations):
X 00000000 00000000000000000000000

0 • 11111111
Encoding of +infinity:
00000000000000000000000
• Encoding of –infinity:
1 11111111 00000000000000000000000
• Encoding of NaN (result of 0/0, sqrt(-n), 0*inf,
X 11111111
etc.): Anything non-zero
IEEE 754 - Subnormals
• Subnormals, or subnormal values, are defined as numbers without a
hidden 1 and with the smallest possible exponent.
• certain small values that are not representable as normalized
numbers, hence must be rounded to 0 if encountered in the course
of computations, can be represented more precisely as subnormals.
• For example, (0.0001) × 2−126 is a subnormal that does not have a
normalized representation in the IEEE single format.

+0.111…1 x 2-126 = (1 – 2-23) x 2-126 : Largest Positive Subnormal value

+0.000…01 x 2-126 : Smallest Positive Subnormal value


IEEE 754 single-precision format
• Representable magnitudes:
• 0
• Denormalized: 2-149 to (1 – 2-23) x 2-126
• Normalized: 2-126 to (2 – 2-23) x 2127
• Infinity
• Anything with an exponent bigger than 127
• Note: a mantissa that is larger than the largest representable
mantissa (i.e., 1111…1) does not make the magnitude
• NaN

IEEE 754 double-precision format


• 64-bit representation
• MSB is sign bit, Least significant 52 bits represent mantissa
• The middle 11 bits represent biased exponent with a bias of 1023
• Actual exponent can range from -1022 to 1023
• Largest representable normalized magnitude
• 1.111…1 x 21023 = (2 – 2-52) x 21023
• Smallest representable normalized magnitude
• 1.000…0 x 2-1022 = 2-1022
Floating-point numbers
• IEEE 754 double-precision format
• Largest representable denormalized magnitude
• 0.111…1 x 2-1022 = (1 – 2-52) x 2-1022
• Smallest representable denormalized magnitude
• 0.000…01 x 2-1022 = 2-1074
• Zero
• Exponent = 0, Mantissa = 0
• Infinity
• Exponent = 11111111111, Mantissa = 0
• NaN
• Exponent = 11111111111, Mantissa = non-zero
• Use of IEEE 754 double-precision format
• The C data type “double” translates to 64-bit double-precision
format
Floating-point numbers
• IEEE 754 half-precision format
• Lower range and precision compared to single-precision for
computers having narrow busses/operands
• 16-bit representation
• MSB is sign bit
• Least significant ten bits represent mantissa
• The middle five bits represent the biased exponent with a bias of 15
• Actual exponent can range from -14 to 15
• Largest representable normalized magnitude
• 1.1111111111 x 215 = (2 – 1/1024) x 215
• Smallest representable normalized magnitude
• 1.0000000000 x 2-14 = 2-14
Floating-point numbers
• IEEE 754 half-precision format
• Largest representable denormalized magnitude
• 0.1111111111 x 2-14 = (1 – 1/1024) x 2-14
• Smallest representable denormalized magnitude
• 0.000…01 x 2-14 = 2-24
• Zero
• Exponent = 0, Mantissa = 0
• Infinity
• Exponent = 11111, Mantissa = 0
• NaN
• Exponent = 11111, Mantissa = non-zero
Floating-point numbers
• IEEE 754 quadruple-precision format
• Much higher range and precision compared to double-
precision
• No computer supports it yet
• 128-bit representation
• MSB is sign bit
• Least significant 112 bits represent mantissa
• The middle 15 bits represent the biased exponent with a bias of 16383
• Actual exponent can range from -16382 to 16383
• Largest representable normalized magnitude
• 1.111…1 x 216383 = (2 – 2-112) x 216383
• Smallest representable normalized magnitude
• 1.000…0 x 2-16382 = 2-16382
Floating-point numbers
• IEEE 754 quadruple-precision format
• Largest representable denormalized magnitude
• 0.111…1 x 2-16382 = (1 – 2-112) x 2-16382
• Smallest representable denormalized magnitude
• 0.000…01 x 2-16382 = 2-16494
• Zero
• Exponent = 0, Mantissa = 0
• Infinity
• Exponent = 111…1, Mantissa = 0
• NaN
• Exponent = 111…1, Mantissa = non-zero
Arithmetic Operations in IEEE-754
• Consider the addition (±s1 × be1) + (±s2 × be2) = ±s × be
• Assuming e1 ≥ e2, we begin by aligning the two operands through right-
shifting of the significand s2 of the number with the smaller exponent:
±s2 × be2 = (±s2/be1−e2) × be1
• If the exponent base b and the number representation radix r are the same,
we shift s2 to the right by e1−e2 digits. When b = ra, the shift amount is
multiplied by a.
• (±s1 × be1) + (±s2 × be2) = (±s1 × be1) + (±s2/be1−e2) × be1
= (±s1 ± s2/be1−e2) × be1 = ±s × be
• When the operand signs are alike, a 1-digit normalizing shift is always
enough.
• When the operands have different signs, the resulting significand may be
very close to 0 and left shifting by many positions may be needed for
normalization.
• Overflow/underflow can occur during the addition step as well as due to
normalization.
Arithmetic Operations in IEEE-754
• Floating-point multiplication is simpler than floating-point addition; it is
performed by multiplying the significands and adding the exponents:
• (±s1 × be1) × (±s2 × be2) = ±(s1 × s2) × be1+e2
• Post-shifting may be needed, since the product s1 × s2 of the two
significands can be unnormalized.
• The computed exponent needs adjustment if the exponents are biased or if
a normalization shift is performed.
• Overflow/underflow is possible during multiplication if e1 and e2 have like
signs; overflow is also possible due to normalization.
• Floating-point division is performed by dividing the significands and
subtracting the exponents:
• (±s1 × be1)/(±s2 × be2) = ±(s1/s2) × be1−e2
• Other operations: Fused Multiplication and add = ax+b
• Square root
Overflow in floating-point ops
• What happens if the magnitude of a floating-point
number exceeds the largest representable by the
corresponding format?
• E.g., float x > (2 – 2-23) x 2127
• May cause an overflow if exponent is larger than maximum
allowed (e.g., 127 in single-precision)
• The number is treated as +infinity or –infinity depending on the sign of
the number
• What if the fraction cannot fit within the mantissa field, but
the exponent is within range?
• Not an overflow
• Handled by rounding the mantissa (default is round to nearest and
round to nearest even for halfway round)
• Loss of precision
Underflow in floating-point ops

• What happens if the magnitude of a floating-point


number is less than the smallest representable by the
corresponding format?
• E.g., float x < 2-149
• May cause an underflow if an exponent smaller than the
minimum (e.g., -126 in single-precision) is required to
represent the number
• E.g., 2-150 in single-precision or 2-1075 in double-precision
• The number is treated as zero in this case
• Denormalized numbers are said to undergo gradual underflow as more
and more leading zeros appear on the right side of the binary point
ultimately becoming zero below 2-149
Rounding
• Whenever a number with higher precision is to be converted to a format
offering lower precision (e.g., double-precision or extended-single to
single-precision), rounding is required as part of the conversion process.
• The same applies to conversions between integer and floating-point
formats.
• IEEE 754-2008 includes five rounding modes, two round-to-nearest
modes, with different rules for breaking ties, and three directed rounding
modes:
• Round to nearest, ties to even (rtne)
• Round to nearest, ties away from zero (rtna)
• Round toward zero (inward).
• Round toward +∞ (upward).
• Round toward −∞ (downward).
Rounding

• An unsigned number with integer and fractional digits is to be rounded to


an integer. xk−1xk−2 ··· x1x0.x−1x−2 ··· x−l → yk−1yk−2 ··· y1y0.
• Rounding to a destination format that has l’ fractional digits, with 0 < l’ < l,
is equivalent to the above, with the radix point moved to the left by l’
positions on both sides

On signed Magnitude Form


On 2’s complement
Rounding- Truncate

Average error is -0.375


Average error = -(2-L'-2-L )/2, where L' = new number of fractional
bits
Rounding- rtna
• With the rtna scheme, a fractional part of less than 1/2 is dropped, while a
fractional part of 1/2 or more (.1xxx ··· in binary) leads to rounding to the next
higher integer, or away from zero.

x−1x−2 = 00 Round down, error = 0


x−1x−2 = 01 Round down, error = −0.25
x−1x−2 = 10 Round up, error = 0.5
x−1x−2 = 11 Round up, error = 0.25

Average error=0.125
Rouding- rtne
Rounding: RTNE

Average error is 0
Requires more hardware
Rounding modes
• IEEE 754 rounding modes
• Round to nearest (default behavior)
• Round to nearest even for halfway rounding
• After rounding, the least significant representable mantissa bit should be
even (see following single-precision examples)
• Example: 1.1111…1 (has 24 1s in mantissa) is rounded to 2.0 (in
decimal) i.e., 1.0 x 21 in binary
• Example: 1.111…101 (has 22 1s followed by a 0 and a 1 in mantissa) is
rounded to 1.111…10 (has 22 1s followed by a 0 in mantissa)
• Round toward zero
• Round toward +infinity
• Round toward –infinity
• A computer can choose one of the modes
Other Rounding Methods

• Round to the nearest odd


• Von Neuman or Jam Rounding
• ROM rounding
Reference

• Read Section 2.4 from Chapter 2 and subsection


"Floating-point Representation" from Section 3.5
of Chapter 3 of Patterson and Hennessy; Practice
exercises from Chapter 3 end: 3.1, 3.2, 3.3, 3.4,
3.5, 3.6, 3.7, 3.8, 3.20, 3.22, 3.23, 3.24, 3.25, 3.26,
3.27, 3.28, 3.41, 3.42, 3.43] (Computer
Organization and Design: Hardware/Software
Interface)
• Computer Arithmetic: Algorithms and
Hardware Design, Behrooz Parhami
Acknowledgement

• Dr. Mainak Chauduri


• Dr. Urbi Chatterjee

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy