0% found this document useful (0 votes)
69 views175 pages

Arithmetic

The document discusses arithmetic circuits and describes ripple carry adders. It explains that a ripple carry adder adds bits sequentially, with the carry out of one bit as the carry in of the next. This results in a long critical path. The adder generates sums from the carry outputs to avoid optimizing the sum generation delay. It also describes the complementation property of sums and carries, allowing the same hardware to be used for original and complemented inputs.

Uploaded by

Yagami Light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views175 pages

Arithmetic

The document discusses arithmetic circuits and describes ripple carry adders. It explains that a ripple carry adder adds bits sequentially, with the carry out of one bit as the carry in of the next. This results in a long critical path. The adder generates sums from the carry outputs to avoid optimizing the sum generation delay. It also describes the complementation property of sums and carries, allowing the same hardware to be used for original and complemented inputs.

Uploaded by

Yagami Light
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 175

Arithmetic Circuits

Dinesh Sharma

EE Department
IIT Bombay, Mumbai

October 16, 2022

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 1 / 175


Part I

Adders
1 Half and Full Adders
2 Ripple Carry adder
3 Carry Look Ahead
Manchester Carry Chain
4 Carry Bypass Adder
5 Carry Select Adder
Stacking Carry Select Adders
6 Tree Adders
Brent Kung adder
Tutorial: 32 bit Brent Kung Logarithmic Adder
7 Serial Adders

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 2 / 175


Half and Full Adders

Half Adder

The truth table for addition of two bits is:


A B Sum Carry
0 0 0 0
0 1 1 0 sum = A · B + B · A
carry = A · B
1 0 1 0
1 1 0 1

What do we do with the carry?


Obviously, it must be added to more significant bits.
So we need an adder with three inputs.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 3 / 175


Half and Full Adders

Full Adder

Truth Table for the addition of


three bits is: Which leads to the following Karnaugh maps:
A B Cin Sum Cout
0 0 0 0 0 AB
Cin 00 01 11 10
0 1 0 1 0 0 0 1 0 1
SUM
1 0 0 1 0 1 1 0 1 0
1 1 0 0 1
0 0 1 1 0 AB
0 1 1 0 1 Cin 00 01 11 10
0 0 0 1 0 CARRY
1 0 1 0 1
1 0 1 1 1
1 1 1 1 1

sum = A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin


Cout = A · B + B · Cin + Cin · A = A · B + Cin · (A + B)

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 4 / 175


Ripple Carry adder

Ripple Carry adder

A2 B2 A1 B1 A0 B0

Cout Cout Cout Cout Cout

S2 S1 S0

Carry out of one bit becomes Carry in of the next.


This architecture is therefore called ripple carry adder.
The critical delay path of the adder is the carry rippling from one bit to the
next.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 5 / 175


Ripple Carry adder

Sum derived from carry

Because carry is on the critical path, Carry-out must be generated as


quickly as possible.
We need not optimize the delay of generating sum.
We can in fact generate sum from Carry out.

Cout = A · B + Cin · (A + B)
= (A + B) · (Cin + A · B)
= A · Cin + B · Cin + A · B
Cout · (A + B + Cin ) = A · B · Cin + A · B · Cin + A · B · Cin

sum = A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin


= Cout · (A + B + Cin ) + A · B · Cin

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 6 / 175


Ripple Carry adder

CMOS Implementation

VDD VDD

A B
A
Cin
B

B
A
Cout Cout Cout A
Sum Sum
Cin
A A

B
B
A
A Cin
B
Gnd

Cout = A.B + Cin . (A+B) Sum = Cout . (A + B + Cin) + A . B . Cin

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 7 / 175


Ripple Carry adder

Complementation Property
Both Sum and Carry show an interesting symmetry:

sum = A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin


sum = (A + B + Cin ) · (A + B + Cin ) · (A + B + Cin ) · (A + B + Cin )
= (A + A · B + A · Cin + A · B + B · Cin + Cin · A + Cin · B) ·
(A + A · B + A · Cin + A · B + B · Cin + Cin · A + Cin · B)
= (A + B · Cin + B · Cin ) · (A + B · Cin + B · Cin )
= A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin

Thus

sum = A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin


sum = A · B · Cin + A · B · Cin + A · B · Cin + A · B · Cin

This shows that the same hardware that produces sum from A, B and Cin ,
will produce sum if the inputs are changed to A, B and Cin

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 8 / 175


Ripple Carry adder

Complementation Property

Carry also has the same complementation property.

Cout = A · B + Cin · (A + B)

Hence, Cout = A · B + Cin · (A + B) = (A + B) · (Cin + A · B)


= A · Cin + B · Cin + A · B

Thus Cout = A · B + Cin · (A + B)


while Cout = A · B + Cin · (A + B)

So the same hardware which produces Cout from A, B and Cin , will produce
Cout from A, B and Cin .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 9 / 175


Ripple Carry adder

Making use of the symmetry property

In CMOS implementation, we interchange series and parallel


configurations for the n and p channel transistors.
This is to ensure that the pull up and pull down circuits are
complementary.
However, for sum and carry functions, we see that these functions are
their own complements.
Therefore, for implementing sum and carry, we can use the same
configuration for n and p channel transistors.
We use this to reduce the number of series connected transistors in pull
up/pull down networks.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 10 / 175


Ripple Carry adder

Mirror gates for Adders

By making use of symmetry property of sum and carry, it is possible to


simplify the implementations.
VDD
VDD
A B Cin A
B A A
Cin B

Cin A B Cin Cin


A Cout
B Cin Cout Cout Sum Sum

Cout
Cout Cin
Cin B
B
B A A A B Cin
A
Gnd
Gnd
Cout = A.B + Cin . (A+B)
Sum = Cout . (A + B + Cin) + A . B . Cin
These are called mirror gates because the n and p transistors have the same
series parallel combination.
This is highly unusual.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 11 / 175


Ripple Carry adder

Speeding up the Ripple Carry Adder

The worst case delay of the ripple carry adder is linear in number of bits
to be added.
To reduce the delay per stage, we can eliminate the inverter from the
carry output.
All even bit adders accept a, b and Cin as inputs. The mirror gate without
inverter gives Cout as the output.
All odd bit adders accept A, B and Cin as inputs and thus produce Cout as
output.
Outputs of all bits are now compatible with inputs of the next stage.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 12 / 175


Ripple Carry adder

Speeding up the Ripple Carry Adder

Extra inverters are required to produce A, B and at the outputs to produce


the proper result. However, these are not on the critical path, and do not
add to the worst case delay.
Extreme care needs to be taken in layout to ensure that the loading on
the tree gate producing carry output is as small as possible.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 13 / 175


Carry Look Ahead

Terms Independent of Carry

Carry propagation is the critical path for a multi-bit adder.


To speed up the adder, we would like an architecture where logic terms
are classified as those dependent on carry and those which do not
depend on carry.
To speed up the adder, we would like to pre-compute all terms which do
not depend on carry.
Now when the carry arrives, we quickly compute the output carry and
pass it on to the next stage.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 14 / 175


Carry Look Ahead

Carry Independent Terms

We would like to analyze what information can be pre-computed from Ai and


Bi , which will help us in generating Cout quickly from Cin .
When Ai = 0 and Bi = 0, Cout is 0, independent of Cin . We define this
condition as ‘Kill’. K = A · B
Similarly, when Ai = 1 and Bi = 1, Cout is 1, independent of Cin . We
define this condition as ‘Generate’: G = A.B.
Only when Ai = 0 and Bi = 1 or when Ai = 1 and Bi = 0,
we need to wait for Cin to compute Cout .
In both these cases, Cout = Cin .
We call this condition as ‘Propagate’, and define P = A.B + A.B.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 15 / 175


Carry Look Ahead

Using Carry Independent Terms

We define K ≡ A · B, G ≡ A.B and P ≡ A ⊕ B.


Exactly one of K, G or P is true at any time.
When K = 1, Cout is 0, independent of Cin .
When G = 1, Cout is 1, independent of Cin .
When P = 1, Cout = Cin .
P needs to be computed using an xor gate, which can be slow. However, the
only difference between xor and or logic is when both inputs are 1, i.e. G = 1.
If we can ensure that G forces Cout to 1 irrespective of P, we can use the
simpler ‘or’ logic to compute P.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 16 / 175


Carry Look Ahead

Carry Look Ahead

Cin for bit i+1 is the Cout of bit i.


So we can write Ci+1 = Gi + Pi .Ci
Notice that the Kill signal is not required.
If Gi = 0, Ci+1 = A ⊕ B = A + B when G = A.B = 0
If Gi = 1, Ci+1 = 1, and the value of Pi does not matter anyway.
So we can use P = A + B instead of P = A ⊕ B.
Now, we have the sequence:

Ci+1 = Gi + Pi .Ci = Gi + Pi .Gi−1 + Pi .Pi−1 .Ci−1 = · · ·

and so on, till we reach C0 .


Since all Gi , Pi and C0 can be computed in parallel on arrival of the inputs, we
can compute all sum and carry terms independently if we do not mind the
added complexity.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 17 / 175


Carry Look Ahead

Carry Look Ahead

Ci+1 = Gi + Pi .Ci = Gi + Pi .Gi−1 + Pi .Pi−1 .Ci−1 = · · ·


Unfortunately, static implementation of these gates has almost as much delay
as the ripple carry implementation.
Therefore, the static implementation of computation of sum and carry terms
as a logic expression depending on all Ai , Bi and C0 is rarely used.
We can use these expressions for blocks of a small number of bits (say 4) and
then propagate carry over these blocks.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 18 / 175


Carry Look Ahead Manchester Carry Chain

Manchester Carry Chain

VDD

P
Static implementation of look ahead carry is not
really fast if we try to look ahead by a large number Cin Cout

of bits, because the logic becomes very complex. G

A dynamic implementation is useful and is widely


used. It is known as the Manchester Carry Chain. Ck
Gnd

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 19 / 175


Carry Look Ahead Manchester Carry Chain

Manchester Carry Chain

VDD

P When the clock is low, the output is unconditionally


charged by the pMOS.
Cin Cout
G When the clock goes high, the output will be pulled
low if G = 1 or if P = 1 and Cin = 0.

Ck
In all other cases, the output will remain high. Thus
this circuit implements the required logic.
Gnd

This circuit can be concatenated for all bits and since P and G are ready
before Cin arrives, the carry quickly ripples through from bit to bit.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 20 / 175


Carry Look Ahead Manchester Carry Chain

Manchester Carry Chain as Carry Look Ahead

VDD

P
Notice that the nMOS logic can be interpreted as:
Cin Cout
G P.Cin + G

where Cin itself has been recursively generated by


Ck similar logic.
Gnd

As in the static case, there is a limit to the number of bits which can be so
connected.
If P = 1 for many successive bits, the discharge path is through series
connected pass transistors of all these gates. The discharge time for this
critical path has an n2 dependence.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 21 / 175


Carry Look Ahead Manchester Carry Chain

Manchester Carry Chain as Carry Look Ahead


The circuit below shows a Manchester carry chain over 4 bits.
VDD

P0 P1 P2 P3
Cin0 Cout0 Cout1 Cout2 Cout3

G0 G1 G2 G3

Ck

If G = 1 for any bit, the output is brought to ‘0’. (Recall that Carry
propagates – not Carry).
The time of carry arrival for all subsequent bits is from the last bit where P
= 0.
The worst case for delay occurs when P = 1 for all bits. In this case, all
load capacitors are shorted, so load capacitance ∝ n.
The discharge of capacitors is through n series connected pass
transistors, so average R is ∝ n.
Thus in the worst case, the delay ∝ RC ∝ n2 .
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 22 / 175
Carry Bypass Adder

Carry Bypass Adder

The worst case for addition occurs when P = 1 for all bits and carry has to
ripple through all bits.
In carry bypass adder, we form groups of bits and if P = 1 for all members
of a group, we pass on the carry input to this group directly to the input of
the next group, without having to ripple through each bit.
This improves the worst case delay of the adder.
bypass = P0.P1.P2.P3

VDD

P0 P1 P2 P3
Cout0 Cout1 Cout2 Cout3

Cin0 G0 G1 G2 G3

Ck

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 23 / 175


Carry Select Adder

Single bit Carry Select Adder

One can make a fast adder at the cost of some added complexity, by
implementing two adders, one assuming that Cin = 0 and the other
assuming that Cin = 1.
When the actual carry input arrives at this bit, it chooses the correct one
using a multiplexer, depending on its value.
Since Cout = G + P · Cin , the two cases are:
For Cin = 0, Cout = G = A · B
For Cin = 1, Cout = G + P = A · B + A ⊕ B = A + B
Thus the two candidates for Cout are quite easy to generate, being just the
AND/OR of A and B.
This concept can be extended to multi-bit carry select adders.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 24 / 175


Carry Select Adder

Carry Select Adder

An m bit carry select adder can be constructed as follows:


We first compute the generate/propagate/kill signals for each bit (in
parallel) from the input bits. Assuming unit gate delay model, this takes
one unit of time.
We use two m bit carry bypass adders. One of the adders assumes the
carry input Cin to be 0, while the other assumes Cin to be 1. The two
adders work in parallel and each takes m units of time.
We now use a multiplexer controlled by the actual Cin to select the correct
Cout . This takes one unit of time.
The Cout of one such m bit adder will be used as the select input of the
multiplexer of the next.
The sum output of each bit is derived from P and Cout signals for the
corresponding bit and appear one unit of time after Cout is available.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 25 / 175


Carry Select Adder

Multi-bit Carry Select Adders

a b
(0) (0) The two m bit sub-adders assume the
Generate
G, P, K carry to be 0 or 1 respectively.
(1)

Cin = 0 m bit m bit Cin = 1


Times of availability of various signals
(0)
adder adder
(0) are noted in parentheses in the
(m+1) (m+1)

Mux Cout
diagram.
Actual Cin (m+2)
(Unit delay times)

The two alternatives for the carry output are ready at (m+1) units of time.
If the actual Cin is available at n units of time, the output will be available
at (m+2) or (n+1), whichever is later.
In case of 4 bit adders, this is at 6 units of time or at Cin arrival + 1,
whichever is later.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 26 / 175


Carry Select Adder Stacking Carry Select Adders

Stacking in Carry Select adders

The sub-adders in carry select adder can use any architecture.


They could be Manchester carry chains, carry bypass or ripple carry
adders.
Obviously, these sub adders should not be very long, otherwise, their
outputs will be ready after a long time and we shall lose the advantage of
carry bypass additions.
Then, how do we make long adders using carry select?
This is done by stacking several smaller carry select adders.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 27 / 175


Carry Select Adder Stacking Carry Select Adders

First stage of Carry Select adders

The first stage of stacked Carry Select adders is different from the rest.
In this case, we do not have to wait for Cin to arrive – it is already known.
Therefore we do not have to use redundant adders – a single m bit adder
will do.
Since no multiplexing is required, the output of the first stage is ready at
(m + 1) units of time, rather than at (m + 2).
This is convenient – because the two alternatives of the second stage are
also ready at (m + 1) units of time.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 28 / 175


Carry Select Adder Stacking Carry Select Adders

Linear Stacking

We could stack several identical carry select adders.


There is no need for carry select in the first stage, as Cin for this stage is
available simultaneously with Ai and Bi .
Every subsequent stage will have two sub-adders, one assuming Cin = 0,
the other assuming Cin = 1.
The correct output will be selected by the actual Cin when it arrives.
Thus, after the first stage, each group of m bit adders will add only one
unit of delay.
This is much faster. However, the delay is still linear in number of bits.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 29 / 175


Carry Select Adder Stacking Carry Select Adders

Linear stacking: Example

A 32-bit adder made by cascading 8 4-bit carry select adders.

a (0-3) b (0-3) a (4-7) b (4-7) (5 gps of 4 bits) a (28-31) b (28-31)


(0) (0) (0) (0) (0) (0) (0) (0)

gen G, P, K gen G, P, K gen G, P, K gen G, P, K


Bits cy in alt cy.s cy out
(1)
` 0' (1) ` 1' ` 0' (1) ` 1' ` 0' (1) ` 1'
0-3 0 - 5
Cin
4 bit
Adder
4 bit 4 bit 4 bit 4 bit
Cin Adder Adder Cin Cin Adder Adder Cin
4 bit 4 bit
Cin Adder Adder Cin
4-7 5 5 6
Cout
(5)
Mux
(5)
Cout
(5)
Mux
(5)
Cout
(5)
Mux
(5)
Cout 8-11 6 5 7
(5) (6) (11) (12)
12-15 7 5 8
The sum generation will take another 16-19 8 5 9
unit of time, so the overall results will 20-23 9 5 10
be available in 13 units of time. 24-27 10 5 11
28-31 11 5 12

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 30 / 175


Carry Select Adder Stacking Carry Select Adders

Square-root Stacking

Can we speed up the adder if we don’t use the same no. of bits in every
stage?
In linear stacking, since all adders are identical, they are ready with their
alternative outputs at the same time.
But the carry arrives later and later at each successive group of carry
select adders.
We could have used this extra time to add up more bits in the later
stages, and still be ready with the alternative results before carry arrives!
Since the carry arrives one unit of time later at each successive group,
each successive group could be longer by one bit.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 31 / 175


Carry Select Adder Stacking Carry Select Adders

Square-root Stacking

We can do more bits of addition in the same time, if each successive


stage is 1 bit longer than the previous one.
Thus, the number of bits which can be added is given by

s(m0 + m0 + s − 1)
n = m0 + m0 + (m0 + 1) + (m0 + 2) + · · · = m0 +
2
where s is the number of stages following the first one without carry
select.
The total delay will be m0 + 1 for the first stage. Each subsequent stage
takes just 1 unit of time since the candidates for selection are available
just in time.
The time taken is just m0 + s + 1 units. When s ≫ m0 , we have n ≈ s2 /2,
while the time taken is nearly s.

Thus the time taken to add n bits is ≈ 2n

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 32 / 175


Carry Select Adder Stacking Carry Select Adders

Square-root Stacking: Example

For a 32 bit adder, we could use a distribution like: 4,4,5,6,7,6.

Bits carry in carry alternatives carry out


0-3 0 - 5
4-7 5 5 6
8-12 6 6 7
13-18 7 7 8
19-25 8 8 9
26-31 9 7 10

Our sum will be ready at 11 - which is faster. This gain will be much higher for
wider additions.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 33 / 175


Tree Adders

Tree Adders

Tree adders use the idea of carry look ahead addition.


However, these do not try to implement the complex logic expressions
which would result if we try to generate each carry directly from input
operands.
Instead, these build up the logic in a tree like structure, where each node
performs simple logic operations on the results of the previous node.
Because of the tree structure used in this, the delay is of the order of log n
for an n bit adder.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 34 / 175


Tree Adders

Carry Look Ahead

For carry look ahead, we had defined


K = A · B, G = A.B and P = A ⊕ B.
P, G and K can be computed without waiting for Cin .
when K = 1 Cout = 0 irrespective of Cin .
when G = 1 Cout = 1 irrespective of Cin .
When P = 1 Cout = Cin : This is the only case when we must wait for Cin in
order to compute Cout
Exactly one of P, G and K will be true for any combination of A, B and C.
Therefore we do not have to compute all three. Most adders just use G and P.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 35 / 175


Tree Adders

Terminology

Let us first establish the terminology used for this section.


aN-1 bN-1 ai bi a1 b1 a0 b0
cN
N-1
cN-1 ci+1
i
ci
1
c1
0
c0 The least significant bit is indexed as 0
GN-1, PN-1 Gi, Pi G1, P1 G0, P0 and the most significant bit as N − 1.
sN-1 si s1 s0

The input operands to the adder are A = (aN−1 · · · a0 ) and


B = (bN−1 · · · b0 ), with a possible input carry c0 . All these bits are
available at the start.
ci represents the input carry to the i’th bit.
The output carry from bit i is ci+1 , which is the input carry for bit (i+1).
Thus c0 represents the overall input carry for the addition and cN
represents the final output carry.
si represents the sum output from the i’th bit.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 36 / 175


Tree Adders

P and G signals over blocks of multiple bits


The Generate and Propagate signals are derived exclusively from ai and bi
inputs and are independent of carry input. These can thus be generated in
constant time and in parallel for all the bits.
The output carry for i’th bit is generated from the incoming carry using the
relation: ci+1 = Gi + Pi · ci . Similarly, ci = Gi−1 + Pi−1 · ci−1 .
Substituting for ci in the relation for ci+1 , we get
ci+1 = Gi + Pi · (Gi−1 + Pi−1 · ci−1 ) = (Gi + Pi · Gi−1 ) + (Pi · Pi−1 ) · ci−1
If we define Gi:i−1 ≡ Gi + Pi · Gi−1 and Pi:i−1 ≡ Pi · Pi−1 , we get the
relation: ci+1 = Gi:i−1 + Pi:i−1 · ci−1
This is the same relation as the one used for single bit carry generation,
but permits us to compute ci+1 directly from ci−1 .
Thus Gi:i−1 and Pi:i−1 are effectively the Generate and Propagate values
for a block of 2 bits (i and i − 1).
Like Gi and Pi , Gi:i−1 and Pi:i−1 are independent of carry and can be
computed in constant time from A and B in parallel.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 37 / 175


Tree Adders

Higher order P and G


Just as we combined single bit G and P values to get new G and P values
which operate over two bits, we can combine these 2 bit G and P values
to get G and P values which operate over 4 bits and so on.
In general, if we take two contiguous ranges u and l each of size 2n , we
can write for the combined range (u : l) of size 2n+1 , the recursive
relation:
Gu:l = Gu + Pu · Gl and Pu:l = Pu · Pl
This suggests a tree structure for computation of successive G and P
values which operate over bigger and bigger ranges of bits.
To distinguish G and P values operating over ranges of different sizes,
we’ll use a superscript which gives the “order” of computation of these.
Thus single bit G and P values will carry a superscript of 0, 2 bit values
will use a superscript of 1 and so on. Eventually, G and P values covering
a range of 2m bits will carry a superscript of m.
As before, G and P values will carry a subscript which gives the range of
bit indices over which these operate.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 38 / 175
Tree Adders

Higher order P and G

Once the highest order P and G values have been generated, the final
carry can be computed in one step from the input carry.
The final result contains all the sum bits and the final carry. So it may
appear that we do not need the intermediate carries at each bit.
However, the sum bits depend on internal carries. The sum bits are given
by:
Si = Ai ⊕ Bi ⊕ Ci = Pi ⊕ Ci
Thus we do need the internal bit-wise carries for sum generation.
The group size over which the carry can be computed directly multiplies
by two each time we use a higher order for G and P values.
On the other hand, the time to compute the required higher order G and
P values increments by one gate delay.
(time to compute A + B · C for G and A · B for P).
This results in the ultimate time to generate the all the P and G values
being logarithmic in the number of bits being added.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 39 / 175


Tree Adders

Logarithmic Adders

Using P and G values of different orders, we can compute the bit wise
carry and sum values.
Notice that in logarithmic adders, internal bit-wise sum and carry values
may be available after the final carry.
Thus the critical path is not the generation of the final carry, but that of
bit-wise sums.
Different architectures have been described in literature for the order of
computation of G, P, Cout and Sum bits.
All of these compute the final result in times which are logarithmic
functions of the number of bits.
For wide adders,these can be much faster than other architectures.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 40 / 175


Tree Adders Brent Kung adder

Brent Kung adder

The Brent Kung tree adder is a logarithmic adder of low complexity.


Generate and Pass signals are successively computed over groups of 1
bit, 2bits, 4bits, . . . in a tree structure.
Since the number of bits covered in every step doubles, the total time
taken for this is a logarithmic function of the number of bits.
Values of multiple orders of G and P so computed are then used to
compute the internal carry values at each internal bit, from which sum
values for every bit are derived.
This step is called a back trace and also takes logarithmic time.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 41 / 175


Tree Adders Brent Kung adder

Brent Kung adder

The figure below shows the generation of P and G values for an 8 bit
adder.
a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

P70G70 P60G60 P50G50 P40G40 P30G30 P20G20 P10G10 P00G00

P7:61 G7:61 P5:41 G5:41 P3:21 G3:21 P1:01 G1:01

P7:42 G7:42 P3:02 G3:02

P7:03 G7:03

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 42 / 175


Tree Adders Brent Kung adder

Brent Kung adder


a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

P70G70 P60G60 P50G50 P40G40 P30G30 P20G20 P10G10 P00G00

P7:61 G7:61 P5:41 G5:41 P3:21 G3:21 P1:01 G1:01

P7:42 G7:42 P3:02 G3:02

P7:03 G7:03

we first calculate Pi1 , Gi1 , with i = 0 · · · 7.


Gi = Ai · Bi , Pi = Ai ⊕ Bi
2 2
Next, using these values, we can generate P2i+1,2i , G2i+1,2i
with i = 0 · · · 3.
2 1 1 1 2 1
G2i+1,2i = G2i+1 + P2i+1 · G2i , P2i+1,2i = P2i+1 · P2i1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 43 / 175
Tree Adders Brent Kung adder

Brent Kung adder


a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

P70G70 P60G60 P50G50 P40G40 P30G30 P20G20 P10G10 P00G00

P7:61 G7:61 P5:41 G5:41 P3:21 G3:21 P1:01 G1:01

P7:42 G7:42 P3:02 G3:02

P7:03 G7:03

3 3
In the next step, we use second order P,G values to generate P4i+3,4i , G4i+3,4i
with i = 0, 1.
3 2 2 2 3 2 2
G7,4 = G7,6 + P7,6 · G5,4 , P7,4 = P7,6 · P5,4
3 2 2 2 3 2 2
G3,0 = G3,2 + P3,2 · G1,0 , P3,0 = P3,2 · P1,0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 44 / 175


Tree Adders Brent Kung adder

Brent Kung adder


a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0

P70G70 P60G60 P50G50 P40G40 P30G30 P20G20 P10G10 P00G00

P7:61 G7:61 P5:41 G5:41 P3:21 G3:21 P1:01 G1:01

P7:42 G7:42 P3:02 G3:02

P7:03 G7:03

3 3 4 4
Finally, using G4i+3,4i and P4i+3,4i (with i = 0, 1) we can compute P7,0 , G7,0 .

4 3 3 3
G7,0 = G7,4 + P7,4 · G3,0
4 3 3
P7,0 = P7,4 · P3,0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 45 / 175


Tree Adders Brent Kung adder

Brent Kung adder

Once P and G terms of various orders are known, we can compute the values
of carry outputs which depend on these and the input carry C0 , which is
available at t = 0.

C1 = G01 + P01 · C0 , 2
C2 = G1,0 2
+ P1,0 · C0
3 3 4 4
C4 = G3,0 + P3,0 · C0 , C8 = G7,0 + P7,0 · C0
When these carry values are valid, the other carry values which depend on
these can be generated.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 46 / 175


Tree Adders Brent Kung adder

Brent Kung adder

Once C1 , C2 , C4 and C8 have been generated, we can produce internal


carries which depend on these.

C3 = G21 + P21 · C2 , C5 = G41 + P41 · C4 2


C6 = G5,4 2
+ P5,4 · C4 ,

Finally, C7 can be generated from C6 .

C7 = G61 + P61 · C6

With all carry values generated, the corresponding sum values can be
calculated using the relation Sumi = Pi1 ⊕ Ci .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 47 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

Logarithmic Adders with a tree architecture


We illustrate the operation of a 32 bit Brent Kung adder with a numerical
example.
Recall that if we represents indices for upper half of a range by u and the
lower half by l, we can write:

G(u:l) = Gu + Pu · Gl , whereas Cnext = G(u) + P(u) · Cprev

Notice that G values are computed by the same logic relation as carry
outputs.
The input carry C0 is known at the start itself.
Whenever the carry is already known, we can replace Gl by this carry.
The computed value of G(u:l) will then be the carry output, rather than the
G value. This value can be used for further G calculations and will directly
give the carry each time.
This can reduce the computation required to generate the carry and sum
values since some of the carry values are already available.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 48 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Order 0

We use a unit time model in which we assume that logic functions AND,
XOR, A + B.C as well as A.B + C.(A+B) take the same amount of time,
which defines 1 slot of time for this tutorial.
The single Bit G and P values (designated as order 0) are given by

Pi0 = ai ⊕ bi , Gi0 = ai · bi , except G00 = ai · bi + c0 · (a0 + b0 )

An exception is made for the least significant bit of G because for this bit,
the input carry is known at the start.
We make use of this and compute effectively the carry output from bit 0
(c1 ) and map the output carry as if it was due to a generate signal at this
position. Thus,
G00 = c1 = a0 · b0 + c0 · (a0 + b0 )
All these functions can be computed in one unit of time directly from ai , bi
and input carry c0 . So these are all ready at the end of the first time slot.
Since c1 = G00 , c1 is also ready at the end of first slot.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 49 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

Brent Kung adder: higher orders

We can define G and P functions which operate over multiple bits. Higher
order G and P values are computed as

G = Gu + Pu · Gl , P = Pu · Pl

where u and l stand for upper half range and lower half range for a range
of bit indices.
These can be computed within one time slot from the next lower order G
and P values. Thus higher orders of G and P values, (successively
covering twice the range of indices for the previous order) will be
available in each time slot.
Internal carries are computed using functions like C = G + P · Cin .
Depending on the order of G and P values, we can compute carry values
whose indices are 1, 2, 4, 8 . . . bits higher than the input carry. This
computation also takes one time slot, but can be performed only after the
needed Cin , P and G values are available.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 50 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder

G and P values for single bits are available at the end of first slot.
G and P values spanning groups of 2 bits are available at the end of
second slot. G and P values spanning groups of 4 bits are available at
the end of third slot. G and P values spanning groups of 8 bits are
available at the end of fourth slot. G and P values spanning groups of 16
bits are available at the end of fifth slot.
Finally, G and P values spanning the full word of 32 bits are available at
the end of sixth slot.
G and P values are available over spans of 2n bits. The start bit for these
spans has a granularity of 2n bits. For example, second order values
connect 0 → 4, 4 → 8 etc. We cannot connect using these from 1 → 5 in
a Brent Kung adder.
The lowest index G value for any order i is automatically the carry value
for bit index 2i .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 51 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder

at time =0, all ai , bi and c0 are available.


at time =1, all Pi0 and Gi0 are available. c1 = G00 is also available.
at time =2, all 2 bit P and G values (P..1 and G..1 ) are available. c2 = G(1:0)
1

has been computed.


at time =3, all 4 bit P and G values (P..2 and G..2 ) are available. c4 = G(3:0)
2
,
0 0
c3 ← c2 using G2 , P2 and c2 have also been computed.
at time =4, all 8 bit P and G values (P..3 and G..3 ) are available.
c8 = G(7:0 )3 is also available.
c5 ← c4 using G40 , P40 and c4 ; as well as c6 ← c4 using G(5:4)
1 1
, P(5:4) and
c4 have been computed.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 52 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder

at time =5, all 16 bit P and G values (P..4 and G..4 ) have been computed.
4
c16 = G(15:0) is also available.
c7 ← c6 using G60 , P60 and c6 ; c9 ← c8 using G8 0, P80 and c8 ;
1 1
c10 ← c8 using G(9:8) , P(9:8) and c8 ;
2 2
c12 ← c8 using G(11:8) , P(11:8) and c8 are all available.
5
at time =6, G(31:0) is generated. This is the value of c32 = Cout .
5
P(31:0) is not required.
0 0 0 0
c11 ← c10 using G10 , P10 and c10 ; c13 ← c12 using G12 , P12 and c12 ;
1 1
c14 ← c12 using G(13:12) , P(13:12) and c12 ;
0 0
c17 ← c16 using G16 , P16 and c16 ;
1 1
c18 ← c16 using G(17:16) , P(17:16) and c16 ;
2 2
c20 ← c16 using G(19:16) , P(19:16) and c16 ; and
3 3
c24 ← c16 using G(23:16) , P(23:16) and c16 have all been computed.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 53 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder


at time =7, all G and P values for groups of 1, 2, 4, 8 and 16 bits are
available.
0 0
c15 ← c14 using G14 , P14 and c14 .
0 0
c19 ← c18 using G18 , P18 and c18 .
0 0
c21 ← c20 using G20 , P20 and c20 .
1 1
c22 ← c20 using G(21:20) , P(21:20) and c20 .
0 0
c25 ← c24 using G24 , P24 and c24 .
1 1
c26 ← c24 using G(25:24) , P(25:24) and c24 .
2 2
c28 ← c24 using G(27:24) , P(27:24) and c24 .
at time =8, we have computed:
0 0
c23 ← c22 using G22 , P22 and c22 .
0 0
c27 ← c26 using G26 , P26 and c26 .
0 0
c29 ← c28 using G28 , P28 and c28 .
1 1
c30 ← c28 using G(29:28) , P(29:28) and c28 have been computed.
at time =9, we have computed:
0 0
c31 ← c30 using G30 , P30 and c30 .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 54 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder

We can show the sequence of generation of carry values by the following


diagram:
32 Cout

00 Cin
Carry input to bit number:
31
30
29
28
27
26
25
24
23
22
21
20

09
08
07
06
05
04
03
02
01
19
18
17
16
15
14
13
12
11
10
0
1 G0 P0
2 G1 P1
3 G2 P2
4 G3 P3
Time slot

5 G4 P4
6 G5
7
8
9

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 55 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

Taking the example of adding B7A56893H to 506A980CH with an input carry


of ‘1’, let us list the P, G, carry and sum bits generated in each time slot.
In the first slot, we generate the single bit P and G values.
a 1011 0111 1010 0101 0110 1000 1001 0011
b 0101 0000 0110 1010 1001 1000 0000 1100
P0 1110 0111 1100 1111 1111 0000 1001 1111
G0 0001 0000 0010 0000 0000 1000 0000 0001†

Pi0 = ai ⊕ bi , Gi0 = ai · bi
†G00 is generated as a0 · b0 + c0 · (a0 + b0 )
c1 = G00 = 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 56 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the second slot, we generate P and G values spanning two bits each.
From now on,
m+1
Prange = Pum · Plm , m+1
Grange = Gum + Pum · Glm ,

where u represents the upper half range and l represents the lower half range.

P0 1110 0111 1100 1111 1111 0000 1001 1111


G0 0001 0000 0010 0000 0000 1000 0000 0001
P1 10 01 10 11 11 00 00 11
G1 01 00 01 00 00 10 00 01
1
c2 = G1−0 =1
s0 = P00 ⊕ c0 = 1 ⊕ 1 = 0, s1 = P10 ⊕ c1 = 1 ⊕ 1 = 0.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 57 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the third slot, we calculate P and G values spanning 4 bits each.

P1 10 01 10 11 11 00 00 11
G1 01 00 01 00 00 10 00 01
P2 0 0 0 1 1 0 0 1
G2 1 0 1 0 0 1 0 1
2
c4 = G3−0 = 1. We can also compute
c3 = G20 + P20 · c2 = 0 + 1 · 1 = 1,
s2 = P20 ⊕ c2 = 1 ⊕ 1 = 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 58 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the fourth slot, we calculate P and G values spanning 8 bits each.

P2 0 0 0 1 1 0 0 1
G2 1 0 1 0 0 1 0 1
P3 0 0 0 0
G3 1 1 1 0
3
c8 = G7−0 = 0. We can also compute
c5 = G40 + P40 · c4 = 0 + 1 · 1 = 1, c6 = G5−4
1 1
+ P5−4 · c4 = 0 + 0 · 1 = 0.
s3 = P30 ⊕ c3 = 1 ⊕ 1 = 0, s4 = P40 ⊕ c4 = 1 ⊕ 1 = 0.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 59 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the fifth slot, we calculate P and G values spanning 16 bits each.

P3 0 0 0 0
G3 1 1 1 0
P4 0 0
G4 1 1
4
c16 = G15−0 = 1. We can also compute
c7 = G60 + P60 · c6 = 0 + 1 · 0 = 0, c9 = G80 + P80 · c8 = 0 + 0 · 0 = 0,
1 1
c10 = G9−8 + P9−8 · c8 = 0 + 0 · 0 = 0,
2 2
c12 = G11−8 + P11−8 · c8 = 1 + 0 · 0 = 1.
s5 = P50 ⊕ c5 = 0 ⊕ 1 = 1. s6 = P60 ⊕ c6 = 0 ⊕ 0 = 0.
s8 = P80 ⊕ c8 = 0 ⊕ 0 = 0.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 60 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

5 4 4 4
In the sixth slot, we compute G31−0 = G31−16 + P31−16 · G15−0 .
5
P31−0 is not required.
5
This gives Cout = c32 = G31−0 = 1. We can further compute:
0 0
c11 = G10 + P10 · c10 = 0 + 0 · 0 = 0,
0 0
c13 = G12 + P12 · c12 = 0 + 1 · 1 = 1,
1 1
c14 = G13−12 + P13−12 · c12 = 0 + 1 · 1 = 1,
0 0
c17 = G16 + P16 · c16 = 0 + 1 · 1 = 1,
1 1
c18 = G17−16 + P17−16 · c16 = 1 + 1 · 1 = 1,
2 2
c20 = G19−16 + P19−16 · c16 = 1 + 1 · 1 = 1,
3 3
c24 = G23−16 + P23−16 · c16 = 0 + 1 · 1 = 1
s7 = P70 ⊕ c7 = 1 ⊕ 0 = 1, s9 = P90 ⊕ c9 = 0 ⊕ 0 = 0,
0 0
s10 = P10 ⊕ c10 = 0 ⊕ 0 = 0, s12 = P12 ⊕ c12 = 1 ⊕ 1 = 0,
0
s16 = P16 ⊕ c16 = 1 ⊕ 1 = 0,

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 61 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the seventh slot, All the required values of P and G are already available.
We can compute:
0 0 0 0
c15 = G14 + P14 · c14 = 0 + 1 · 1 = 1 c19 = G18 + P18 · c18 = 0 + 1 · 1 = 1
0 0 1 1
c21 = G20 + P20 · c20 = 0 + 0 · 1 = 0 c22 = G21−20 + P21−20 · c20 = 1 + 0 · 0 = 1
0 0 1 1
c25 = G24 + P24 · c24 = 0 + 1 · 1 = 1 c26 = G25−24 + P25−24 · c24 = 0 + 1 · 1 = 1
2 2
c28 = G27−24 + P27−24 · c24 = 0 + 0 · 1 = 0
0 0
s11 = P11 ⊕ c11 = 0 ⊕ 0 = 0, s13 = P13 ⊕ c13 = 1 ⊕ 1 = 0,
0 0
s14 = P14 ⊕ c14 = 1 ⊕ 1 = 0, s17 = P17 ⊕ c17 = 1 ⊕ 1 = 0,
0 0
s18 = P18 ⊕ c18 = 1 ⊕ 1 = 0, s20 = P20 ⊕ c20 = 0 ⊕ 1 = 1,
0
s24 = P10 ⊕ c24 = 1 ⊕ 1 = 0,

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 62 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

In the eighth slot, we can compute:


0 0
c23 = G22 + P22 · c22 = 0 + 1 · 1 = 1,
0 0
c27 = G26 + P26 · c26 = 0 + 1 · 1 = 1,
0 0
c29 = G28 + P28 · c28 = 1 + 0 · 1 = 1,
1 1
c30 = G29−28 + P29−28 · c28 = 1 + 0 · 1 = 1.
Sums corresponding to carries computed in the previous slot can also be
evaluated as:
0
s15 = P15 ⊕ c15 = 1 ⊕ 1 = 0,
0
s19 = P19 ⊕ c19 = 1 ⊕ 1 = 0,
0
s21 = P21 ⊕ c21 = 0 ⊕ 0 = 0,
0
s22 = P22 ⊕ c22 = 1 ⊕ 1 = 0,
0
s25 = P25 ⊕ c25 = 1 ⊕ 1 = 0,
0
s26 = P26 ⊕ c26 = 1 ⊕ 1 = 0,
0
s28 = P28 ⊕ c28 = 0 ⊕ 0 = 0.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 63 / 175


Tree Adders Tutorial: 32 bit Brent Kung Logarithmic Adder

32 bit Brent Kung adder: Numerical Example

0 0
In the ninth slot, we can compute c31 = G30 + P30 · c30 = 0 + 1 · 1 = 1,
and the sum values
0
s23 = P23 ⊕ c23 = 1 ⊕ 1 = 0,
0
s27 = P29 ⊕ c29 = 0 ⊕ 1 = 1,
0
s29 = P29 ⊕ c29 = 1 ⊕ 1 = 0,
0
s30 = P30 ⊕ c30 = 1 ⊕ 1 = 0,
0
Finally in the tenth slot, we can evaluate s31 as s31 = P31 ⊕ c31 = 1 ⊕ 1 = 0.
Thus we have
Cin 1110 1111 1101 1111 1111 0000 0011 1111
a 1011 0111 1010 0101 0110 1000 1001 0011
b 0101 0000 0110 1010 1001 1000 0000 1100
sum 0000 1000 0001 0000 0000 0000 1010 0000

Final carry out is 1.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 64 / 175


Serial Adders

Serial Adders

Up to now, we have been concerned with making fast adders, even at the cost
of increased complexity and power.
In many applications, speed is not as important as low power consumption
and low cost.
Serial adders are an attractive option in such cases.
A single full adder is used.
If numbers to be added are available in parallel form, these can be serialized
using shift registers.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 65 / 175


Serial Adders

Serial Adders

A single full adder adds the incoming bits. Bits to be added are fed to it
serially, LSB first.
The sum bit goes to the output while carry is stored in a flip-flop.
Carry then gets added to the more significant bits which arrive next.
Output can be converted to parallel form if needed, using another shift
register.
Cin

Load Cprev
Csel Q
Cy Mux
A operand C D
A Shift Register
Shift Registers Sum
B
Output
B operand Full Adder Cout Latch

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 66 / 175


Part II

Shift and Rotate Operations

8 Shift and Rotate Operations

9 Barrel Shifters
Logarithmic Barrel Shifters
Combining Rotate and Shift Operations
Bidirectional Shift and Rotate Operations

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 67 / 175


Shift and Rotate Operations

Shift and Rotate Operations

SHL op, count CF operand 0

SAL op, count Same as SHL This can be implemented by a


bi-directional shift register.
SHR op, count 0 operand CF
We have to add circuits to choose
operand CF the value and point of entry of
SAR op, count
new bits.
CF operand
ROL op, count This can be very slow for a large
operand CF number of shifts.
ROR op, count
Ideally, we would like a shifter
CF operand
RCL op, count which produces the result in a
operand CF single clock cycle.
RCR op, count

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 68 / 175


Barrel Shifters

Shift/Rotate as Select Operations

No “computation” is being carried out during shift/rotate. So why should


we spend a large number of clock cycles for it?
We can view Shift/Rotate as a selection operation.

B A
n bits n bits

n bits
Select n contiguous bits
We just have to choose B and A appropriately to implement a particular shift
or rotate operation.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 69 / 175


Barrel Shifters

Shift/Rotate as Select Operations

For all rotate operations,


B=A=data
B A
n bits n bits For Shift Left, B=data, A=0.
For Logical Shift Right, B=0,
n bits A=data.
Select n contiguous bits For Arithmetic Shift Right,
B=replicated MSB of data,
A=data.
Of course we do not actually copy data bits to A/B.
Each output bit is produced by a mux which picks out the correct input data bit.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 70 / 175


Barrel Shifters

Barrel Shifters

Shifters which produce outputs as select operations are called barrel


shifters.
The name comes from viewing the inputs as well as outputs as a circular
arrangement of bits.
The shifter then connects the input circle to the output circle like the
sections of a barrel.
A brute force implementation will require n multiplexers of n bits each,
where the control inputs for each multiplexer are generated from the
amount and type of shift/rotate.
This is quite complex and puts a heavy load on data bits.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 71 / 175


Barrel Shifters Logarithmic Barrel Shifters

Logarithmic Barrel Shifters

The brute force barrel shifter places a heavy load on input data lines
because each input bit is a candidate for each output position.
The control logic is complex because the amount of shift is variable.
The loading on data lines and control logic complexity can be reduced if
we break up the shift process into parts.
We can carry out shifts in different stages, each stage corresponding to a
single bit of the binary representation of the shift amount.
Thus a shift by 6 (binary: 110) will be carried out by first doing a 4 bit shift
and then a 2 bit shift.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 72 / 175


Barrel Shifters Logarithmic Barrel Shifters

Logarithmic Barrel Shifters

We need n bits to represent a maximum shift amount of 2n − 1 places.


So the number of bits to express the shift amount (and hence the number
of shift stages required) is logarithmic in the maximum shift desired.
That is why such shifters are called Logarithmic Barrel Shifters.
We can optionally buffer the outputs after each stage.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 73 / 175


Barrel Shifters Logarithmic Barrel Shifters

Logarithmic Barrel Shifter Stages

Bit i of the shift amount represents


no shift (if it is 0)
a constant shift by 2i places (if it is 1).
If the shift amount is fixed, we do not need any electronics. The output
can just be wired from the input bits.
Using a 2 way mux controlled by bit i of shift amount, we can choose
either the unshifted operand bit or the operand bit 2i places away from it.
This can be done for all bits of the operand in parallel.
This constitutes one stage of the logarithmic shifter.
The output can then be shifted again in the next stage, controlled by the
next significant bit.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 74 / 175


Barrel Shifters Logarithmic Barrel Shifters

Right Rotate for an 8 bit Operand

X7 X6 X5 X4 X3 X2 X1 X0
X3 X2 X1 X0 X7 X6 X5 X4

n2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

p7 p6 p5 p4 p3 p2 p1 p0
Each input bit drives
just two muxes, each
with just 2 inputs.
n1
p7
0 1
p1 p6
0 1
p0 p5
0 1
p7 p4
0 1
p6 p3
0 1
p5 p2
0 1
p4 p1
0 1
p3 p0
0 1
p2
At each stage, the
q7 q6 q5 q4 q3 q2 q1 q0
muxes select either
the unshifted bit or a
bit 2n places from it.
3 stages are required
n0
q7
0 1
q0 q6
0 1
q7 q5
1 0
q6 q4
0 1
q5 q3
0 1
q4 q2
0 1
q3 q1
0 1
q2 q0
0 1
q1
for 0 to 7 bits of shift.

Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 75 / 175


Barrel Shifters Logarithmic Barrel Shifters

8 bit Logical Shift Right

X7 X6 X5 X4 X3 X2 X1 X0
`0'
X7 X6 X5 X4
`0' `0' `0' `0'
n2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

p7 p6 p5 p4 p3 p2 p1 p0 If we need a shift
instead of a rotate,
we feed a 0 instead
of the corresponding
p7 `0' p6 `0' p5 p7 p4 p6 p3 p5 p2 p4 p1 p3 p0 p2
n1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 bit.
q7 q6 q5 q4 q3 q2 q1 q0
This has to be done
for 4 muxes in the
first stage, 2 in the
q7 `0' q6 q7 q5 q6 q4 q5 q3 q4 q2 q3 q1 q2 q0 q1
second stage and 1
n0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 in the last stage.
Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 76 / 175


Barrel Shifters Combining Rotate and Shift Operations

Combining Rotate and Shift

ASR
X3 X2 X1 X0
X7 1
’0’ 0
0 1 0 1 0 1 0 1
We can combine the circuits for
Rotate
rotate and shift functions by
4-bit shift rotate row putting muxes where different
inputs need to be presented for
Rotate

0 1 0 1
the two functions.
2-bit shift/rotate row We can include the Arithmetic
Shift function by choosing
0 1
Rotate between 0 or X7 as the bit to be
1-bit shift/rotate row
inserted.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 77 / 175


Barrel Shifters Combining Rotate and Shift Operations

Rotate and Shift by Masking

We can also combine the rotate and shift functions by masking.


We use the rotate function, which does not lose any information.
Now we can mask n bits at the left to 0 if a right shift operation was
desired instead.
In case of an arithmetic shift, n bits on the left have to be set to the same
value as X7.
Shift/Rotate Left case is similar, except that the Logical and Arithmetic
shifts are the same.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 78 / 175


Barrel Shifters Bidirectional Shift and Rotate Operations

Combining Left and Right Shift/Rotate

X7 X6 X5 X4 X3 X2 X1 X0
X0 X1 X2 X3 X4 X5 X6 X7
Left 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 We can use the same hardware
for left and right shift/rotate
operations.
Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 This can be done by adding rows
Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 of muxes at the input and output
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 which reverse the order of bits.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 79 / 175


Barrel Shifters Bidirectional Shift and Rotate Operations

Combining Left and Right Shift/Rotate

We can also make use of the fact that a left rotate by m places is the
same as a right rotate by 2n − m places where 2n is the width of the
operand (data being rotated).
2n − m is just the 2’s complement of m in an n bit representation.
By presenting the 2’s complement of m at the mux controls, we can
convert a right rotate to a left rotate.
This can be followed by a mask operation, if a shift operation was
required, rather than a rotate.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 80 / 175


Part III

Multiplier Circuits
10 Shift and Add Multipliers
11 Array Multipliers
12 Speeding up Multipliers
Booth Encoding
Adding Partial Products
Wallace Multipliers
Dadda Multipliers
13 Multiply and Accumulate circuits
14 Serial Multipliers
Bit Serial Multipliers
Row Serial multipliers

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 81 / 175


Shift and Add Multipliers

Shift and Add Multipliers

An obvious way for implementing multipliers is to replicate the paper and


pencil procedure in hardware.
Initialize the product to 0, extend multiplicand
to left by n bits filled with 0s.
If the least significant bit of the multiplier is 1,
add the multiplicand to product, else do
nothing.
Shift the multiplier right by one bit.
Shift the multiplicand left by one bit.
Repeat for n bits

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 82 / 175


Shift and Add Multipliers

Shift and Add Multipliers

Each term being added to form the product is called a partial product.
The name “partial product” is also used for individual bits of the terms
being added - so beware!
The paper-pencil procedure requires n-1 additions to a 2n bit
accumulator.
This uses a single adder, but takes long to complete the multiplication. A
32 x 32 multiplication will require 31 addition steps to a 64 bit
accumulator.
Multiplication can be made faster by using multiple adders and adding
terms in a tree structure.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 83 / 175


Array Multipliers

Array Multiplier

Suppose we want to multiply two n-bit numbers A and B, where

X
n−1 X
n−1
A= 2i ai B= 2j bj
i=0 j=0

We can regard all bits of the partial products as an array, whose (i,j)th
element is ai · bj . Notice that each element is just the AND of ai and bj .
All elements of the array are available in parallel, within one gate delay of
arrival of A and B.
We can now use an array of full adders to produce the result. One input
of each adder is the sum from the previous row, the other is the AND of
appropriate ai and bj .
This architecture is called an array multiplier.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 84 / 175


Array Multipliers

Array Multiplier

A 4X4 array multiplier is shown below.


a3 b0 a2 b0 a1 b0 a0 b0

a3 b1 a2 b1 a1 b1 a0 b1

c FA c FA c FA c HA
s s s
a3 b2 a2 b2 a1 b2 a0 b2 s

c c c
c FA FA FA HA
a3 b3 a2 b3 s a1 b3 s a0 b3 s s

c c c
c FA FA FA HA
s s s s

Half adders can be used at the right end.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 85 / 175


Array Multipliers

Critical Path through Array Multiplier

The critical path through a 4X4 array multiplier is shown below.


a3 b0 a2 b0 a1 b0 a0 b0

a3 b1 a2 b1 a1 b1 a0 b1

c FA c FA c FA c HA
s s s
a3 b2 a2 b2 a1 b2 a0 b2 s

c c c
c FA FA FA HA
a3 b3 a2 b3 s a1 b3 s a0 b3 s s

c c c
c FA FA FA HA
s s s s

The critical path involves carry as well as sum outputs!

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 86 / 175


Speeding up Multipliers

Speeding up Multipliers

The array multiplier has a regular layout with relatively short connections.
However, it is still rather slow.
How can we speed up a multiplier?
There are two possibilities:
Somehow reduce the number of partial products to be added. For
example, could we multiply 2 bits at a time rather than 1?
Since we have to add more than two terms at a time, use an adder
architecture which is optimized for this.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 87 / 175


Speeding up Multipliers Booth Encoding

Booth Encoding

Booth Encoding reduces the number of partial products by multiplying 2 bits


at a time.
Let the multiplicand be A and the multiplier B.
Rather than multiplying A with successive bits of B, we can multiply it with
two bits of B at a time.
Depending on the two bits being 00, 01, 10 or 11, the partial product will
be 0, A, 2A or 3A.
0 and A can be produced trivially.
2A can be produced easily by a left shift of A.
Generating 3A presents a problem!

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 88 / 175


Speeding up Multipliers Booth Encoding

Booth Encoding

Multiplying the Multiplicand A by 2 bits of the multiplier at a time requires the


generation of 0, A, 2A or 3A as partial products. Generating 0, A or 2A is
easy.
3A cannot be generated directly. However, 3A can be expressed as 4A -
A.
The task of adding 4A is passed on to the next group of 2 bits of the
multiplier.
Since the place value of the next group of 2 bits is 4 times the current
one, adding 4A to the product is equivalent to adding 1 to the next group
of 2 bits of the multiplier.
-A can be generated from A, using an adder/subtracter rather than an
adder for accumulating the sum of partial products.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 89 / 175


Speeding up Multipliers Booth Encoding

Modified Booth Encoding

To simplify the logic for deciding whether an additional 4A should be added on


behalf of the less significant 2 bits in the multiplier, we express 2A also as 4A -
2A.
Since we anyway have an adder-subtracter, this requires no additional
resources.
The modified logic is:
for 00, do nothing.
For 01, add A.
for 10, subtract 2A, ask the next group to add 4A.
for 11, subtract A, ask the next group to add 4A.
Now the next group can just look at the more significant bit of the
previous group and add 1 to the multiplier if it is ‘1’.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 90 / 175


Speeding up Multipliers Booth Encoding

Modified Booth Encoding

The partial product generator looks at the current 2 bits and the MSB of
the previous group of 2 bits to decide its action.
Thus, we scan the multiplier 3 bits at a time, with one bit overlapping.
For the first group of 2 bits, we assume a 0 to the right of it.
After handling the previous group, the multiplicand is shifted left by 2
positions. Thus, it has already been multiplied by 4.
Therefore, adding 4 A on behalf of the previous group is equivalent to
adding 1 to the multiplier corresponding to the current group.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 91 / 175


Speeding up Multipliers Booth Encoding

Modified Booth Encoding

The following table summarizes the effective multiplier for generating the
partial product.
Current Multiplier Previous Pending Total
2-bits for these MSBit Increment Multiplier
00 0 0 0 0
01 +1 0 0 +1
10 -2 0 0 -2
11 -1 0 0 -1
00 0 1 +1 +1
01 +1 1 +1 +2
10 -2 1 +1 -1
11 -1 1 +1 0

Notice that a 111 in the 3 bit group being scanned requires no work at all.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 92 / 175


Speeding up Multipliers Booth Encoding

Modified Booth Encoding

Curr. Prev. Multi-


2 bits MSB plier What happens if there is a string of ‘1’s in the
00 0 0 multiplier?
00 1 +1 There will be a -1 in the beginning, because
01 0 +1 the group begins with 110.
01 1 +2 Similarly, there will be a +1 at the end,
10 0 -2 because it will end with 011.
10 1 -1 However, for the length of continuous ‘1’s,
11 0 -1 nothing needs to be done (add zeros).
11 1 0
Thus Booth encoding reduces the number of partial products to half
(multiplying 2 bits at a time).
It makes addition in columns of partial products fast because carry
propagation during addition will be reduced.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 93 / 175


Speeding up Multipliers Adding Partial Products

Adding Partial Products

Multipliers can be speeded up by using special adder architectures which are


optimized for adding more than two numbers.
One option is to use tree adders rather than an accumulator.
Several additions proceed in parallel, since all partial products are
generated together.

PP1 PP1 PP2 PPi-1 PPi


+
+ +
PP2 + S12

+
PPi +
Tree Addition
Time = (n-1) Tadd PPn T = (log2n) Tadd
Adders required: 1 Adders required: n/2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 94 / 175


Speeding up Multipliers Adding Partial Products

Carry Save Adders

Ordinary adders are large and complex. Also, these are slow due to
rippling of carry.
Let us consider an adder which presents its output not as one word - but
two. The actual result is the sum of these.
Obviously, an adder of this type is of no use for adding just two numbers!
But it can be useful in a multiplier where we are adding multiple terms.
For each bit column, the sum goes into one output word, while carry outs
go into the other (without being added to the next more significant
column).
Now there is no rippling of carry and the output is available in constant
time.
We need a conventional adder in the end to add these two words.
This type of adder is called a “Carry Save Adder” or CSA.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 95 / 175


Speeding up Multipliers Adding Partial Products

Carry Save Adders

A Carry Save Adder (whose output is two words which must be added to
produce the result) is of no use for adding just two words!
However, we can construct a useful CSA for adding 4 bits in the same
column.
We make use of the fact that all partial product bits are available in
constant time after the application of inputs.
Since there are multiple bits to be added, we can feed three of them to a
full adder.
The sum and carry output of this adder is then available in constant time.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 96 / 175


Speeding up Multipliers Adding Partial Products

Carry Save Adders

The 4 input 2 output CSA uses two full adders as shown below:
a The first Full adder uses 3 bits of partial products of
b
c the same weight (bits in the same column).
d
FA These are available in parallel in constant time.
s The sum output of first FA goes to the second FA.
cy1 cy1
out FA in The carry output (cy1) of the first FS goes as
cy2 sum intermediate input to the CSA used in the column to
the left of this one.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 97 / 175


Speeding up Multipliers Adding Partial Products

Carry Save Adders

The second FA accepts one additional bit from the


partial product column, the sum output of the first Full
a Adder FA1 and cy1 output coming from the CSA to its
b
c right.
d
FA All inputs to this Full Adder are also available in
constant time.
cy1 s cy1
out FA in Notice that even though cy1 goes from one column to
the next significant column, it does not ripple all the
cy2 sum
way horizontally.
It goes to FA2 of the more significant column whose
output is not required by the next column.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 98 / 175


Speeding up Multipliers Adding Partial Products

Tiling Carry Save Adders

The figure below shows how we can add 4 columns of 4 bits each.
Rows are labeled as a,b,c and d. Columns are 0,1,2 and 3.
Column 3 Column 2 Column 1 Column 0
a3b3 c3 d3 a2 b2 c2d2 a1 b1c1 d1 a0 b0 c0d0

FA FA FA FA
s s s s
cy1 cy1 cy1 cy1
FA FA FA FA
c s c s c s c s

Outputs are collected in two separate registers (shown in dotted lines).


These must be added using a conventional adder.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 99 / 175


Speeding up Multipliers Adding Partial Products

Critical Path of Carry Save Adders

Critical path of a 4x4 Carry Save Adder is shown below.


Column 3 Column 2 Column 1 Column 0
a3b3 c3 d3 a2 b2 c2d2 a1 b1c1 d1 a0 b0 c0d0

FA FA FA FA
s s s s
cy1 cy1 cy1 cy1
FA FA FA FA
c s c s c s c s

One can see that the critical path has been broken up.
Addition of 4 words of 32 bits each will also have a critical path of the same
length.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 100 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Multipliers

Multipliers do not have the same number of bits in every column.


In 1964, Wallace proposed a method for a carry save like reduction
scheme which is valid for columns of variable length.
Wallace multiplier uses adders which take multiple inputs of the same
weight and produce sum outputs of the same weight and carry outputs
with higher weights.
These are combined in stages to reduce the number of terms at each
weight to 2 or less.
These two terms are then added by a conventional adder to produce the
final result.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 101 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Multipliers

Wallace multipliers act in three stages:


1 Generate all bits of the partial products in parallel.
2 Collect all partial products bits with the same place value in bunches of
wires and reduce these in several layers of adders till each weight has no
more than two wires.
3 For all bit positions which have two wires, take one wire at corresponding
place values to form one number, and the other wire to form another
number.
Add these two numbers using a fast adder of appropriate size.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 102 / 175
Speeding up Multipliers Wallace Multipliers

Reduction Stage of Wallace Multipliers

We assume that Full adders and Half adders will be used.


A full adder takes 3 inputs and produces one output of the same weight
(sum) and another of higher weight (carry). This is called a (3,2) adder. It
reduces the number of wires at its own weight by 2 and adds one wire at
the higher weight.
A half adder takes 2 inputs and produces one output of the same weight
(sum) and another of higher weight (carry). This is a (2,2) adder. It
reduces the number of wires at its own weight by 1 and adds one wire at
the higher weight.
The reduction algorithm is general and can be used with any adders of
type (n,m). For example, a carry save adder is of type (4,2).

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 103 / 175
Speeding up Multipliers Wallace Multipliers

Reduction Stage of Wallace Multipliers

Each reduction stage looks at the number of wires for each weight and if
any weight has more than 2 wires, it adds a layer of adders.
When the numbers of wires for each weight have been reduced to 2 or
less, we form one number with one of the wires at corresponding place
values and another with the other wire (if present).
These two numbers are added using a fast adder of appropriate size to
generate the final product.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 104 / 175
Speeding up Multipliers Wallace Multipliers

Reduction Stage of Wallace Multipliers

Partial products are grouped in multiples of three rows.


Rows which are additional over multiples of three are just passed on to
the next stage.
Wires are now reduced for each group of 3 rows.
For any group of 3 rows, if we find 3 wires for any weight, we place a Full
Adder, which generates 1 wire of the same weight and 1 wire with the
next higher weight.
If there are two wires left, we place a half adder to reduce these.
If only one wire is left, it is carried through to the next layer.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 105 / 175
Speeding up Multipliers Wallace Multipliers

Reduction Stage of Wallace Multipliers

The reduction procedure is carried out for all weights. starting from the
least significant weights.
At the end of of each layer, we count wires for each weight again, and if
none has more than 2 wires, we proceed to the final addition stage.
If any weight has 3 or more wires, we add another layer, and repeat this
procedure till the number of wires for all weights is reduced to 2 or less.
Now we compose one number from one of the left over wires at
corresponding weights and another from the remaining wires.
Finally, we use a conventional fast adder of appropriate size to add the
two numbers.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 106 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Multiplier Example

Consider a multiplier for 4X4 bits.


a3 a2 a1 a0
x b3 b2 b1 b0
a3.b0 a2.b0 a1.b0 a0.b0
a3.b1 a2.b1 a1.b1 a0.b1
a3.b2 a2.b2 a1.b2 a0.b2
a3.b3 a2.b3 a1.b3 a0.b3

Partial product bits where the sum of indices is the same have the same place
value and need to be added to each other.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 107 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Multiplier Example

Partial products are generated in parallel and we have the following wires:
Bit Terms Wires
0 a0b0 1
1 a0b1, a1b0 2
2 a0b2, a1b1, a2b0 3
3 a0b3, a1b2, a2b1, a3b0 4
4 a1b3, a2b2, a3b1 3
5 a2b3, a3b2 2
6 a3b3 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 108 / 175
Speeding up Multipliers Wallace Multipliers

4X4 Wallace Multiplier: First Reduction

The multiplier has 4 rows of partial products, which are divided in groups of 3
and 1.
The bottom row is just passed on to the next stage.
Layer 1

For wires within the top 3 rows:


Bit 0 has a single wire: passed through.
Bit 1 has 2 wires: fed to a half adder.
BitWires
Bits 2 and 3 have 3 wires: fed to full adders. 0 1 1
1 2 HA 1
Bit 4 has 2 wires (in the group of 3): fed to a half 2 3 FA
2

adder. 3 4 FA
3

3
Bit 5 has 1 wire: passed through. 4
5 2
3 HA
3
6 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 109 / 175
Speeding up Multipliers Wallace Multipliers

4X4 Wallace Multiplier: Outputs after First Reduction

After first reduction Layer 1

Bits 0, 1 and 6 have a single wire.


Bit 2 has 2 wires: carry of the half adder at bit 1 and
the sum of full adder at bit 2.
BitWires
Bits 3 and 4 have 3 wires: carry of the full adder at 0 1 1
1
1 2 HA
lower weight, the sum wire from their full/half adder 2 3 FA
2

and a passed through wire. 3 4 FA


3

3
Bit 5 has 3 wires: carry of bit 4 plus 2 fed through 4 3 HA
3
5 2
wires. 6 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 110 / 175
Speeding up Multipliers Wallace Multipliers

4X4 Wallace Multiplier: Second Reduction

Since Bits 3, 4 and 5 have 3 wires each, we need another reduction layer.
This will be the last reduction layer.
Layer 1

Bits 0 and 1 have single wires. These are fed


through.
BitWires 1 1
Bit 2 has 2 wires: these are fed to a half adder. 0 1
1 1
1 2 HA 2 1
Bits 3, 4 and 5 have 3 wires: These are fed to full 2 3 FA
3
HA
2
adders. 3 4 FA
3
FA
2
4 3 FA 2
HA 3
5 2 FA
1 2
6 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 111 / 175
Speeding up Multipliers Wallace Multipliers

4X4 Wallace Multiplier: Outputs from Second


Reduction

Bits 0, 1, and 2 have single wires which carry the final result.
Layer 1

Bit 3 has 2 wires: carry of the half adder at bit 2 and


sum of the full adder at bit 3.
Bit 4 has 2 wires: carry of the full adder at bit 3, and BitWires 1 1
0 1
sum of the full adder at bit 4. 1 2 HA
1
2
1
1
HA
2 3
Bit 5 has 2 wires, carry of bit 4 and sum of bit 5. FA
3 2
3 4 FA
FA 2
Bit 6 has 2 wires, carry of bit 5 and 1 fed through wire. 4 3
3
FA 2
HA 3
5 2 FA
1 2
6 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 112 / 175
Speeding up Multipliers Wallace Multipliers

4X4 Wallace Multiplier: Final Addition

After the second layer, no bit has more than 2 wires. Single wires at bits 0, 1
and 2 are fed through to the output.
Layer 1

A fast conventional adder is used to add the 2 bits


each at Bits 3, 4, 5 and 6.
Notice that we do not need a full width fast adder.
BitWires 1 1
0 1
This is because the half adders at low weights have 1 2 HA
1 1
2 1
already rippled the carry while the rest of weights 2 3 FA
HA
2
3
were being reduced. 3 4 FA
FA
2
3
FA
This makes the final adder smaller and faster. 4 3 HA 3
2
5 2 FA
1 2
6 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 113 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

The reduction scheme as described above sometimes produces a


redundant most significant bit.
The result is still correct and the redundant bit will always be zero.
To see this effect, let us apply the above scheme to an 8x8 multiplier.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 114 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 115 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 116 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 117 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 118 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 119 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 120 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 121 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 122 / 175
Speeding up Multipliers Wallace Multipliers

Redundant MSB in large Wallace Multipliers

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

As one can see, there are two bits at


bit-14, which can produce a carry
during the final addition.
When this carry is added to the bit at
bit-15, it could produce another carry
which will go to bit-16.
This would be an extra bit.
(Bits 0 to 16 will be 17 bits).
In practice, this does not occur, as
multiplying 8 bit operands will
generate at the most a 16 bit result.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 123 / 175
Speeding up Multipliers Wallace Multipliers

Avoiding the Redundant MSB in Wallace Multipliers

One can avoid the redundant bit by modifying the reduction scheme.
We treat all wires in a column as equivalent.
No groups of 3 rows).
Make bunches of 3 wires and send each to a full adder.
Now we can be left with 0, 1 or 2 wires.
There is nothing to do for 0 wires left.
If one wire is left, it is passed through to next layer.
If two wires are left, we have a more complex decision.
We need to define the capacity of a reduction layer to describe the policy
for reduction of 2 wires.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 124 / 175
Speeding up Multipliers Wallace Multipliers

Wire capacity of a reduction layer

We define the capacity of a layer as the maximum number of wires it can


accommodate. How can we determine it?
We know that the final reduction layer should have no more than 2 wires.
Now we can work backwards from the final layer to the first.
Let dj represent the maximum number of wires for any weight in layer j,
where j = 1 for the final adder. (Thus d1 = 2).
The maximum number of wires which can be handled in layer j+1 (from
the end) is the integral part of (3/2)dj .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 125 / 175
Speeding up Multipliers Wallace Multipliers

Wire capacity of a reduction layer

The maximum number of wires for any weight in layer j+1 (from the end)
is the integral part of (3/2)dj .
j = 1 for the final adder. Thus d1 = 2.
We go up in j, till we reach a number which is just greater than or equal to
the largest bunch of wires in any weight.
The number of reduction layers required is this jfinal − 1.
Capacities of layers starting from last layer and moving towards the top
are 2, 3, 4, 6, 9, 13, 19 . . . .

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 126 / 175
Speeding up Multipliers Wallace Multipliers

Reduction of two wires

Now we can define the policy for reduction of 2 left over wires after deploying
the maximum number of full adders.
If all columns at the right have a single wire, we reduce the two wires
using a half adder. (This helps in reducing the width of final adder).
If there is a column to the right with more than one wire, we pass through
the two wires to the next layer if it can accommodate these. (That is, the
total number of wires do not exceed the capacity of that layer).
If passing through the two wires would exceed the capacity of next layer,
we reduce these with a half adder.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 127 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Reduction without redundant MSB

Max wires in this layer: 8, Capacity of next layer = 6.


Bit No. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In Wires 0 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1
FA 0 0 0 1 1 1 2 2 2 2 2 1 1 1 0 0
Remaining 0 1 2 0 1 2 0 1 2 1 0 2 1 0 2 1
HA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
PT 0 1 2 0 1 2 0 1 2 1 0 2 1 0 0 1
Sums 0 0 0 1 1 1 2 2 2 2 2 1 1 1 1 0
Carries to
Higher bits 0 0 0 1 1 1 2 2 2 2 2 1 1 1 1 0
Output
Wires 0 1 3 2 3 5 4 5 6 5 3 4 3 2 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 3 2 3 5 4 5 6 5 3 4 3 2 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 128 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Reduction without redundant MSB

Second reduction layer: Capacity of next layer = 4.


Bit No. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In Wires 0 1 3 2 3 5 4 5 6 5 3 4 3 2 1 1
FA 0 0 1 0 1 1 1 1 2 1 1 1 1 0 0 0
Remaining 0 1 0 2 0 2 1 2 0 2 0 1 0 2 1 1
HA 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
PT 0 1 0 2 0 2 1 0 0 2 0 1 0 0 1 1
Sums 0 0 1 0 1 1 1 2 2 1 1 1 1 1 0 0
Carries to
Higher bits 0 0 1 0 1 1 1 2 2 1 1 1 1 1 0 0
Output
Wires 0 2 1 3 2 4 4 4 3 4 2 3 2 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

1 3 2 3 5 4 5 6 5 3 4 3 2 1 1

2 1 3 2 4 4 4 3 4 2 3 2 1 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 129 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Reduction without redundant MSB

Third reduction layer: Capacity of next layer = 3.


Bit No. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In Wires 0 2 1 3 2 4 4 4 3 4 2 3 2 1 1 1
FA 0 0 0 1 0 1 1 1 1 1 0 1 0 0 0 0
Remaining 0 2 1 0 2 1 1 1 0 1 2 0 2 1 1 1
HA 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
PT 0 2 1 0 2 1 1 1 0 1 2 0 0 1 1 1
Sums 0 0 0 1 0 1 1 1 1 1 0 1 1 0 0 0
Carries to
Higher bits 0 0 0 1 0 1 1 1 1 1 0 1 1 0 0 0
Output
Wires 0 2 2 1 3 3 3 3 2 2 3 2 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

2 2 1 3 3 3 3 2 2 3 2 1 1 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 130 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Reduction without redundant MSB


Final reduction layer: Capacity of next layer = 2.
Bit No. 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In Wires 0 2 2 1 3 3 3 3 2 2 3 2 1 1 1 1
FA 0 0 0 0 1 1 1 1 0 0 1 0 0 0 0 0
Remaining 0 2 2 1 0 0 0 0 2 2 0 2 1 1 1 1
HA 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0
PT 0 2 2 1 0 0 0 0 0 0 0 0 1 1 1 1
Sums 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
Carries to
Higher bits 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
Output
Wires 0 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Thus we have reached two wires without generating a bit at b15. There are
two wires at b14 and if these produce a carry, it will go to b15 and there is no
redundant b16.(IIT B)
Dinesh Sharma Arithmetic Circuits October 16, 2022 131 / 175
Speeding up Multipliers Wallace Multipliers

Wallace Reduction without redundant MSB

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

We have reduced the number of wires at all bit positions to ≤ 2 without


generating a bit at b15.
There are two wires at b14 and if these produce a carry, it will go to b15 and
there is no redundant b16.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 132 / 175
Speeding up Multipliers Dadda Multipliers

Dadda Multipliers

Dadda multipliers are very similar to Wallace multipliers and use the same 3
stages:
1 Generate all bits of the partial products in parallel.
2 Collect all partial products bits with the same place value in bunches of
wires and reduce these in several layers of adders till each weight has no
more than two wires.
3 For all bit positions which have two wires, take one wire at corresponding
place values to form one number, and the other wire to form another
number.
Add these two numbers using a fast adder of appropriate size.
The difference is in the reduction stage.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 133 / 175
Speeding up Multipliers Dadda Multipliers

Dadda Multipliers

Wallace multipliers reduce as soon as possible, while Dadda multipliers


reduce as late as possible.
Dadda multipliers plan on reducing the final number of wires for any
weight to 2 with as few and as small adders as possible.
We determine the number of layers required first, beginning from the last
layer, where no more than 2 wires should be left.
The number of layers in Dadda multipliers is the same as in Wallace
multipliers.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 134 / 175
Speeding up Multipliers Dadda Multipliers

Dadda Multipliers: Number of layers

We work back from the final adder to earlier layers till we find that we can
manage all wires generated by the partial product generator.
We know that the final adder can take no more than 2 wires for each
weight.
Let dj represent the maximum number of wires for any weight in layer j,
where j = 1 for the final adder. (Thus d1 = 2).
The maximum number of wires which can be handled in layer j+1 (from
the end) is the integral part of 32 dj .
We go up in j, till we reach a number which is just greater than or equal to
the largest bunch of wires in any weight.
The number of reduction layers required is this jfinal − 1.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 135 / 175
Speeding up Multipliers Dadda Multipliers

Wire Reduction in Dadda Multipliers

At each layer we know the maximum number of wires which should be


left for the next layer.
For each weight, we place the least number of smallest adders, such
that the wires going out to the next layer do not exceed the maximum
number of wires it can handle.
At each weight, we must consider all the sum and pass through wires at
this weight, as well as the wires which will be transferred through carry of
the less significant weights, to the next layer.
That is why we must begin with the lowest weight and go towards higher
weights in each layer.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 136 / 175
Speeding up Multipliers Dadda Multipliers

Dadda Multiplier: Example

Take the example of 4-bit by 4-bit multiplication multiplying a3a2a1a0 by


b3b2b1b0. As before, partial products are generated in parallel and we have
the following wires:

Weight Terms Wires


1 a0b0 1
2 a0b1, a1b0 2
4 a0b2, a1b1, a2b0 3
8 a0b3, a1b2 a2b1 a3b0 4
16 a1b3, a2b2, a3b1 3
32 a2b3, a3b2 2
64 a3b3 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 137 / 175
Speeding up Multipliers Dadda Multipliers

4x4 Dadda Multiplier: Number of Reduction Layers

Maximum no. of wires for any weight in this example is 4.


d1 = 2, d2 = 3, d3 = 4. So we need 2 layers of reduction.
The first reduction layer should reduce the number of wires at any weight
to a maximum of 3.
The second layer will then reduce these to a maximum of 2 wires.
At each reduction layer, we scan from less significant weights to more
significant ones, keeping track of additional carry wires which will be
transferred at the output from lower weights to higher ones.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 138 / 175
Speeding up Multipliers Dadda Multipliers

4x4 Dadda Multiplier: First Reduction Layer

Weights 1, 2 and 4 have 3 or less wires. These are


Weight Wires passed through.
1 1
Weight 8 has 4 wires. No carry is anticipated from
2 2 lower weights. A half adder is used to reduce the
4 3 output wires to 3. (Half Adder Sum + 2 wires passed
8 4 through).
16 3
Weight 16 has 3 wires, but we anticipate a carry from
32 2
the adder at weight 8. So we should reduce by 1 to
64 1
keep the total output wires to 3. So this column is
also reduced using a half adder.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 139 / 175
Speeding up Multipliers Dadda Multipliers

4X4 Dadda Multiplier: After First Reduction

Layer 1
Wt. 1 1 1 Wt.1 has the single wire which was fed through.
Wt. 2
2 2 Wt.2 has 2 fed through wires.
3 3 Wt.4 has 3 wires: all passed through.
Wt. 4
Wt.8 has 3 wires: sum of the half adder at wt.4, and 2
4 3 passed through.
Wt. 8
HA Wt.16 has 3 wires: carry of wt. 8, sum of half adder
3 3 at 16 and 1 passed through.
Wt. 16 HA Wt.32 has 3 wires: carry of wt. 16 and 2 passed
3
Wt. 32
2 through.
Wt. 64 1 1 Wt.64 has 1 fed through wire.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 140 / 175
Speeding up Multipliers Dadda Multipliers

4X4 Dadda Multiplier: Second Reduction

In the second layer, we should leave no more than 2 wires at any weight, as
this is the last stage.
As before, we anticipate the number of carry wires transferred from the
lower weight when planning reduction using half or full adders.
In Dadda multipliers, we use minimum hardware during reduction. So the
smallest adder which will reduce the output wires to 2 will be used.
At the lowest weights, if the number of wires is less than or equal to 2, we
just pass these through.
So the single wire at Wt. 1, and the 2 wires at Wt. 2 are just fed through.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 141 / 175
Speeding up Multipliers Dadda Multipliers

4X4 Dadda Multiplier: Second Reduction

Layer 1 Layer 2 3 wires at Wt. 4 are reduced to 2 by a half


Wt. 1 1 1 1
adder: No carry in expected.
2 2 2
Wt. 2 Wt. 8 has 3 input wires. Carry will arrive from
3 3 2 Wt. 4. Reduced using a Full adder.
Wt. 4 HA
4 3 2 Wt. 16 has 3 input wires. Carry will arrive from
Wt. 8
HA
FA Wt. 8. Reduced using a full adder.
3 2 Wt. 32 has 3 input wires. Carry will arrive from
3
Wt. 16 HA
FA Wt. 16. Reduced using a full adder.
3 2
2 FA
Wt. 64 has 1 input wire. Carry will arrive from
Wt. 32
2 Wt. 32, making it 2 output wires, which will be
Wt. 64 1 1
fed through.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 142 / 175
Speeding up Multipliers Dadda Multipliers

4X4 Multipliers: Final Addition

Wallace Multiplier Dadda Multiplier


Layer 1 Layer 2
Wt. 1 1 1 1
2 2
Wt. 2
3 3
Wt. 4 HA
4 3
Wt. 8 FA
Bit Wires HA
Layer 2
0 1 1
1 2 HA 1 3 3
1 Wt. 16 FA
2 3 FA HA HA
2 3
FA 2
3 4 FA Wt. 32 FA
FA 2
4 3 HA 1 1
FA 2 Wt. 64
5 2 6 bit
6 1 1 Adder
Notice that we have used only 3 Full Adders and 3 Half Adders during
reduction, whereas Wallace multiplier requires 5 Full Adders and 3 Half
Adders.
We require a 6-bit final adder for Dadda multiplier, whereas Wallace multiplier
needs only a 4 bit final adder.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 143 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: Dot diagrams

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

capacity: 6

capacity: 4

capacity: 3

capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 144 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: First reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Capacity of next layer is 6. Since
bits 0-5 have six or less wires,
these are just passed through.
bit 6 has 7 wires. To reduce to six,
we place a half adder. (This gives
a sum wire at bit 6 and a carry
wire at bit 7). capacity: 6

These outputs are shown by a dot


each at bits 6 and 7, joined by a capacity: 4

crossed line.
capacity: 3
Remaining 5 bits are passed
through. capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 145 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: First reduction

Bit 7 has 8 wires. Of the six 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

output places, one is already


occupied by the carry of half
adder at bit 6. So we should
produce only 5 outputs at this bit
– a reduction by 3.
This can be done through a full
and a half adder. Outputs of the capacity: 6

full adder are shown as a dot at


bit 7 (sum) and another at bit 8 capacity: 4

(carry) joined by a line.


capacity: 3
Outputs of the half adder are also
shown by two dots joined by a capacity: 2

crossed line as before.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 146 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: First reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Bit 8 has 7 wires. Of the 6 output


places, 2 are occupied by carries
of full and half adder.
So we should produce only 4
outputs at this bit – again a
reduction by 3.
capacity: 6
This can be done through a full
and a half adder as before.
capacity: 4
Full and half adder take up 5
wires. The remaining 2 are capacity: 3

passed through.
capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 147 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: First reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Bit 9 has 6 wires, which should be
reduced to 4 (since two places
are taken up by carries of full and
half adders).
This can be done by a full adder
whose outputs are shown by dots
at bit 9 and 10 joined by a line. capacity: 6

The remaining 3 wires are passed


through. capacity: 4

Wires of all the higher bits can be


capacity: 3
passed through without
exceeding the limit of 6 outputs. capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 148 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: Second reduction

The output capacity of next layer 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

is 4.
Wires of bits 0-3 can just be
passed through.
For all bit position, we reduce the
output places available by the
incoming carries of previous bit. capacity: 6

We place minimal number of full


and half adders to reduce the capacity: 4
total output wires to 4.
Each full adder (FA) reduces capacity: 3

wires by 2, half adder (HA) capacity: 2


reduces by 1.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 149 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: Second reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Bit 4 has 5 wires. Reduced to 4


by HA.
Bit 5 has 6 wires, reduced to (4-1)
by FA+HA.
Bit 6 has 6 wires, reduced to (4-2)
by 2 FAs. capacity: 6

This is repeated till bit 10.


Bit 11 has 4 wires, reduced to capacity: 4

(4-2) by a FA.
All other wires can be passed capacity: 3

through. capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 150 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: Third reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
The reduction procedure can be
repeated at each layer.
If 2 wires (or multiples of 2) are to
be reduced, we place FAs till 1 or
0 wires are left.
If 1 wire remains, we place a Half
capacity: 6
adder.
This layer requires a half adder at
capacity: 4
bit 4, FA + HA at bit 5, 2 FAs at
bits 6-10 and a full adder at bit 11.
capacity: 3
Rest of the wires are just passed
through. capacity: 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 151 / 175
Speeding up Multipliers Dadda Multipliers

Dadda 8X8 Multiplier: Final reduction

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Capacity of the final layer is 2.


We continue with the same
procedure, placing a half adder at
bit 3 and full adders at bits 4-12.
The remaining wires are passed
capacity: 6
through. Now we can make two
words of 14 bit width and add
these using a fast adder to get the capacity: 4

final product.
capacity: 3
Notice there is no extra bit!

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 152 / 175
Speeding up Multipliers Dadda Multipliers

Comparison of Wallace and Dadda Multipliers

Wallace and Dadda multipliers need the same number of layers.


Dadda multiplier minimizes the number of adders, so it has the potential
for lower complexity and power.
Dadda multiplier uses more pass throughs and smaller adders which
have lower delay. Thus it can also minimize the critical delay for reaching
the final 2 wire stage.
However, The final addition in Dadda multiplier needs wider carry
propagating adders, which can slow it down.
A careful evaluation inclusive of wiring and parasitic delays has to be
made to determine which is the faster adder for a given configuration and
process.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 153 / 175
Multiply and Accumulate circuits

Multiply and Accumulate circuits

A
Pcommon task during data processing is the evaluation of quantities like
ci Xi .
This can be made easier if we have a dedicated hardware circuit which
can compute A × B + C. Here the size of the operand C is the same as
that of the product A × B.
This circuit is the multiply and accumulate or MAC circuit.
The MAC circuit is not much more complex compared to a multiplier. This
is because during multiplication we are anyway adding multiple bits in
every column. The accumulator just provides an additional wire at each
bit position.
This circuit is much faster than separate multiplication and addition
because the latter requires two steps of addition with rippling carry while
the MAC requires only one.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 154 / 175
Multiply and Accumulate circuits

8x8 MAC with 16 bit Accumulater

Consider for example a MAC circuit which multiplies two 8 bit operands and adds the
product to a 16 bit accumulator.
The number of wires from partial sums of the multiplier is:
Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Wires 0 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1
If we include a wire at each position from the accumulator, we get the wire count as:
Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Wire 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 155 / 175
Multiply and Accumulate circuits

8x8 MAC, Wallace reduction

We can now proceed to reduce these wires as a Wallace tree.


Stage 1: Max. wires:9, capacity of next stage = 6
Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2
FA 0 0 1 1 1 2 2 2 3 2 2 2 1 1 1 0
HA 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
PT 1 2 0 1 2 0 1 0 0 2 1 0 2 1 0 0
Out 1 3 2 3 5 4 6 6 5 6 5 3 4 3 2 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 156 / 175
Multiply and Accumulate circuits

Wallace reduction: Stage 2

Stage 2: capacity of next stage = 4


Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In 1 3 2 3 5 4 6 6 5 6 5 3 4 3 2 1
FA 0 1 0 1 1 1 2 2 1 2 1 1 1 1 0 0
HA 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
PT 1 0 2 0 2 1 0 0 0 0 2 0 1 0 0 1
Out 2 1 3 2 4 4 4 4 4 3 4 2 3 2 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 157 / 175
Multiply and Accumulate circuits

Wallace reduction: Stage 3

Stage 3: capacity of next stage = 3


Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In 2 1 3 2 4 4 4 4 4 3 4 2 3 2 1 1
FA 0 0 1 0 1 1 1 1 1 1 1 0 1 0 0 0
HA 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
PT 1 1 0 2 1 1 1 1 1 0 1 2 0 0 1 1
Out 1 2 1 3 3 3 3 3 3 2 2 3 2 1 1 1

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 158 / 175
Multiply and Accumulate circuits

Wallace reduction: Stage 4

Stage 4: capacity of next stage = 2


Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In 1 2 1 3 3 3 3 3 3 2 2 3 2 1 1 1
FA 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 0
HA 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
PT 1 2 1 0 0 0 0 0 0 0 0 0 0 1 1 1
Out 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
Now the maximum wires at any position is 2. We compose two words using one wire
each and add these using a fast conventional adder to get the final result.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 159 / 175
Multiply and Accumulate circuits

Wallace Tree for MAC


The dot diagram for this scheme is shown below:
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2
From
Accumulator

From
Partial Prod. Gen.

(Max wires = 9)

Stage 2
Capacity = 6

Stage 3
Capacity = 4

Stage 4
Caoacity = 3

Stage 5
Capacity = 2

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 160 / 175
Multiply and Accumulate circuits

8x8 MAC

Complexity of the MAC circuit is not much higher than a Wallace tree 8x8
multiplier.
This is so for any wire reduction scheme. We could have used the Dadda
scheme and the complexity would not be much more than the plain 8x8
Dadda multiplier.
The result is produced much faster than separate multiplication and
addition.
This is because a traditional adder is used only once in the MAC.
Otherwise, the multiplier will use it once and then the addition to the
accumulator will again involve a traditional addition.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 161 / 175
Serial Multipliers

Serial Multipliers

Often, we need multipliers which have very low complexity or very low power
consumption and speed is not very important.

Serial multipliers are a good option in such cases.

Low complexity multipliers can be bit serial or row serial.

Bit serial multipliers require m × n clocks for completing an m × n


multiplication.

Row serial multipliers require only n steps, but we require m full adders rather
than just one.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 162 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multipliers: Partial Product Generation

Multiplicand
Each bit of the multiplier needs to be ANDed
cin with each bit of the multiplicand.
m bit c
x FA
fclock
Multiplier This requires that all multiplicand bits be
s
fclock / m
y presented one after the other, every time a
n bit
new bit from the multiplier is taken up.
This can be managed by using a re-circulating shift register for the
multiplicand, which is clocked at a rate which is m times faster than the
clock to the multiplier shift register.
The inputs y and Cin to the full adder have to be appropriately selected
and timed to generate the correct product.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 163 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multipliers: Partial Product Generation

Consider a 4 × 4 bit serial multiplier.

The x input to the Full Adder appears in the following order:


ck x ck x ck x ck x
Multiplicand
cin
0 a0b0 4 a0b1 8 a0b2 12 a0b3
fclock x FA
c
1 a1b0 5 a1b1 9 a1b2 13 a1b3
Multiplier
y
s 2 a2b0 6 a2b1 10 a2b2 14 a2b3
fclock/4 3 a3b0 7 a3b1 11 a3b2 15 a3b3

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 164 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multipliers: Partial Product addition

The arrival time of partial product bits is:


ck x ck x ck x ck x
0 a0b0 4 a0b1 8 a0b2 12 a0b3
1 a1b0 5 a1b1 9 a1b2 13 a1b3
2 a2b0 6 a2b1 10 a2b2 14 a2b3
3 a3b0 7 a3b1 11 a3b2 15 a3b3
We need additions as follows:
a3 a2 a1 a0
× b3 b2 b1 b0
a3b0 a2b0 a1b0 a0b0
a3b1 a2b1 a1b1 a0b1
a3b2 a2b2 a1b2 a0b2
a3b3 a2b3 a1b3 a0b3

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 165 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multipliers: Partial Product addition

Let us put the arrival time of terms in parentheses next to each term.
a3 a2 a1 a0
× b3 b2 b1 b0
a3b0(3) a2b0(2) a1b0(1) a0b0(0)
a3b1(7) a2b1(6) a1b1(5) a0b1(4)
a3b2(11) a2b2(10) a1b2(9) a0b2(8)
a3b3(15) a2b3(14) a1b3(13) a0b3(12)

It is clear that for all additions, the earlier terms have to wait for 3 clock cycles before
the later terms arrive.
We can manage this by putting a 3 bit shift register at the sum output and presenting
the delayed output at the ‘y’ input of the full adder.
The carry output can be added immediately in the next clock, since it should go to the
next column to its left.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 166 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multipliers: Partial Product addition

A 3 clock delay for sum and a 1 clock delay for carry leads to the following
circuit.

co Reset
ci
x FF
s
y

Unfortunately, it does not work!


We have to take care of a few exceptions at row ends.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 167 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multiplier: Exceptions

Let us look at all the exceptions in detail.


0 0 0 0 0 0 0 0 0
c3 a3b0 c2 a2b0 c1 a1b0 c0 a0b0
s3 s2 s1 s0
c3 c6 s3 c5 s2 c4 s1 0
c7 a3b1 a2b1 a1b1 a0b1 In the adjoining figure, sum and carry
s7 s6 s5 s4
c7 c10 s7 c9 s6 c8 s5
terms are indexed by the clock interval
c11 a3b2 a2b2 a1b2 a0b2 0
s11 s10 s9 s8 in which these were generated.
c11 c14 s11 c13 s10 c12 s9 0
c15 a3b3 a2b3 a1b3 a0b3
c15 s15 s14 s13 s12 s8 s4 s0

At clocks 0, 4, 8 and 12, carry input should be forced to 0.


At clocks 7, 11 and 15, the adder y input should receive carry terms (c3,
c7 and c11) instead of sum terms (s4, s8 and s12).
At these clocks, the sum terms should be taken out as result bits.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 168 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multiplier: Exceptions

At clocks 0, 4, 8 and 12:


Carry input should be forced to 0.
c3
0 0 0 0 0 0 0 0
a3b0 c2 a2b0 c1 a1b0 c0 a0b0
0
The carry FF output (which is a 1
s3 s2 s1 s0
clock delayed version of cout)
c3 c6 s3 c5 s2 c4 s1 0
c7 a3b1
s7
a2b1
s6
a1b1
s5
a0b1
s4
should be inserted in the 3-bit
c7 c10 s7 c9 s6 c8 s5
c11 a3b2 0
shift register.
a2b2 a1b2 a0b2
s11 s10 s9 s8 Thus C3 (which is always 0), C7
c11 c14 s11 c13 s10 c12 s9 0
c15 a3b3 a2b3 a1b3 a0b3 and C11 will emerge at clocks 7,
c15 s15 s14 s13 s12 s8 s4 s0
11 and 15 respectively.
The sum terms should be taken
out as result bits.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 169 / 175
Serial Multipliers Bit Serial Multipliers

Bit Serial Multiplier: Implementation

With exception handling at the end of rows, the serial multiplier will work.
Row End
Carry input is forced to 0 at row ends.
ci co
The mux normally inserts the sum into the shift
x s FF
y register. However, at row ends, it inserts the
delayed carry output.
The sum terms at row ends can be taken out as the low bits of the
product.
One can add another shift register at the output to collect these.
The 2 more significant bits of the shift register and the last sum and carry
provide the high bits of the product at the end.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 170 / 175
Serial Multipliers Row Serial multipliers

Row Serial Multipliers


We need not reduce the complexity all the way down to a single adder for
serial multipliers.
We could have a row of n adders performing additions in parallel.
Taking the example of 4 × 4 multiplication, we are trying to perform the
following operations:
a3 a2 a1 a0
X b3 b2 b1 b0

a3b0 a2b0 a1b0 a0b0


a3b1 a2b1 a1 b1 a0b1
a3b2 a2b2 a1b2 a0b2
a3b3 a2b3 a1b3 a0b3

We would like to perform 4 additions per clock interval.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 171 / 175
Serial Multipliers Row Serial multipliers

Row Serial Multipliers

We can use n full adders arranged in n columns.


The carry of previous addition should remain at the same column.
The sum from the previous addition in the left column is brought to this
column by a shift operation.
This sum has the same weight as the carry generated during the
previous clock in this column.
These two are added to the partial product bit for this column.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 172 / 175
Serial Multipliers Row Serial multipliers

Row Serial Multipliers

The addition process using 4 adders is represented in the following figure.


del_cy 0 0 0 0
Shift_sum 0 0 0 0
PP_term a3b0 a2b0 a1b0 a0b0
0 c3 s3 c2 s2 c1 s1 c0 s0
del_cy c3 c2 c1 c0
a3 a2 a1 a0 Shift_sum 0 s3 s2 s1 s0
PP_term a3b1 a2b1 a1b1 a0b1
X b3 b2 b1 b0
0 c7 s7 c6 s6 c5 s5 c4 s4
a3b0 a2b0 a1b0 a0b0 del_cy c7 c6 c5 c4
Shift_sum 0 s7 s6 s5 s4
a3b1 a2b1 a1 b1 a0b1 a3b2 a2b2 a1b2 a0b2
a3b2 a2b2 a1b2 a0b2 0 c11 s11 c10 s10 c9 s9 c8 s8
del_cy c11 c10 c9 c8
a3b3 a2b3 a1b3 a0b3 Shift_sum 0 s11 s10 s9 s8
a3b3 a2b3 a1b3 a0b3
c15 s15 c14 s14 c13 s13 c12 s12
del_cy c15 c14 c13 c12
Shift_sum 0 s15 s14 s13
0 0 0 0 s12

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 173 / 175
Serial Multipliers Row Serial multipliers

Row Serial Multipliers

del_cy 0 0 0 0
Shift_sum 0 0 0 0 Notice that the same ‘a’ term is used in
PP_term a3b0 a2b0 a1b0 a0b0
0 c3 s3 c2 s2 c1 s1 c0 s0 a given adder.
del_cy c3 c2 c1 c0
Shift_sum 0 s3 s2 s1 s0 The ‘b’ term has to be shifted right every
PP_term a3b1 a2b1 a1b1 a0b1 time to generate the right partial product
0 c7 s7 c6 s6 c5 s5 c4 s4 bit.
del_cy c7 c6 c5 c4
Shift_sum 0 s7 s6 s5 s4 Sums have to be shifted right to be
a3b2 a2b2 a1b2 a0b2
0 c11 s11 c10 s10 c9 s9 c8 s8 added to the carry of the previous
del_cy c11 c10 c9 c8 addition in the same column.
Shift_sum 0 s11 s10 s9 s8
a3b3 a2b3 a1b3 a0b3 4 additional clock cycles will be required
c15 s15 c14 s14 c13 s13 c12 s12 to ripple the carry in the last addition.
del_cy c15 c14 c13 c12
Shift_sum 0 s15 s14 s13 During these, the partial product bits will
0 0 0 0 s12
be 0.

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 174 / 175
Serial Multipliers Row Serial multipliers

Row Serial Multipliers

This scheme can be implemented as follows:

0 b3 b2 b1 b0
Shift Register
a3 a2 a1 a0

FF FA3 s FF FA2 s FF FA1 s FF FA0 s


0 c c c c

FF FF FF FF

Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 175 / 175

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy