Arithmetic
Arithmetic
Dinesh Sharma
EE Department
IIT Bombay, Mumbai
Adders
1 Half and Full Adders
2 Ripple Carry adder
3 Carry Look Ahead
Manchester Carry Chain
4 Carry Bypass Adder
5 Carry Select Adder
Stacking Carry Select Adders
6 Tree Adders
Brent Kung adder
Tutorial: 32 bit Brent Kung Logarithmic Adder
7 Serial Adders
Half Adder
Full Adder
A2 B2 A1 B1 A0 B0
S2 S1 S0
Cout = A · B + Cin · (A + B)
= (A + B) · (Cin + A · B)
= A · Cin + B · Cin + A · B
Cout · (A + B + Cin ) = A · B · Cin + A · B · Cin + A · B · Cin
CMOS Implementation
VDD VDD
A B
A
Cin
B
B
A
Cout Cout Cout A
Sum Sum
Cin
A A
B
B
A
A Cin
B
Gnd
Complementation Property
Both Sum and Carry show an interesting symmetry:
Thus
This shows that the same hardware that produces sum from A, B and Cin ,
will produce sum if the inputs are changed to A, B and Cin
Complementation Property
Cout = A · B + Cin · (A + B)
So the same hardware which produces Cout from A, B and Cin , will produce
Cout from A, B and Cin .
Cout
Cout Cin
Cin B
B
B A A A B Cin
A
Gnd
Gnd
Cout = A.B + Cin . (A+B)
Sum = Cout . (A + B + Cin) + A . B . Cin
These are called mirror gates because the n and p transistors have the same
series parallel combination.
This is highly unusual.
The worst case delay of the ripple carry adder is linear in number of bits
to be added.
To reduce the delay per stage, we can eliminate the inverter from the
carry output.
All even bit adders accept a, b and Cin as inputs. The mirror gate without
inverter gives Cout as the output.
All odd bit adders accept A, B and Cin as inputs and thus produce Cout as
output.
Outputs of all bits are now compatible with inputs of the next stage.
VDD
P
Static implementation of look ahead carry is not
really fast if we try to look ahead by a large number Cin Cout
VDD
Ck
In all other cases, the output will remain high. Thus
this circuit implements the required logic.
Gnd
This circuit can be concatenated for all bits and since P and G are ready
before Cin arrives, the carry quickly ripples through from bit to bit.
VDD
P
Notice that the nMOS logic can be interpreted as:
Cin Cout
G P.Cin + G
As in the static case, there is a limit to the number of bits which can be so
connected.
If P = 1 for many successive bits, the discharge path is through series
connected pass transistors of all these gates. The discharge time for this
critical path has an n2 dependence.
P0 P1 P2 P3
Cin0 Cout0 Cout1 Cout2 Cout3
G0 G1 G2 G3
Ck
If G = 1 for any bit, the output is brought to ‘0’. (Recall that Carry
propagates – not Carry).
The time of carry arrival for all subsequent bits is from the last bit where P
= 0.
The worst case for delay occurs when P = 1 for all bits. In this case, all
load capacitors are shorted, so load capacitance ∝ n.
The discharge of capacitors is through n series connected pass
transistors, so average R is ∝ n.
Thus in the worst case, the delay ∝ RC ∝ n2 .
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 22 / 175
Carry Bypass Adder
The worst case for addition occurs when P = 1 for all bits and carry has to
ripple through all bits.
In carry bypass adder, we form groups of bits and if P = 1 for all members
of a group, we pass on the carry input to this group directly to the input of
the next group, without having to ripple through each bit.
This improves the worst case delay of the adder.
bypass = P0.P1.P2.P3
VDD
P0 P1 P2 P3
Cout0 Cout1 Cout2 Cout3
Cin0 G0 G1 G2 G3
Ck
One can make a fast adder at the cost of some added complexity, by
implementing two adders, one assuming that Cin = 0 and the other
assuming that Cin = 1.
When the actual carry input arrives at this bit, it chooses the correct one
using a multiplexer, depending on its value.
Since Cout = G + P · Cin , the two cases are:
For Cin = 0, Cout = G = A · B
For Cin = 1, Cout = G + P = A · B + A ⊕ B = A + B
Thus the two candidates for Cout are quite easy to generate, being just the
AND/OR of A and B.
This concept can be extended to multi-bit carry select adders.
a b
(0) (0) The two m bit sub-adders assume the
Generate
G, P, K carry to be 0 or 1 respectively.
(1)
Mux Cout
diagram.
Actual Cin (m+2)
(Unit delay times)
The two alternatives for the carry output are ready at (m+1) units of time.
If the actual Cin is available at n units of time, the output will be available
at (m+2) or (n+1), whichever is later.
In case of 4 bit adders, this is at 6 units of time or at Cin arrival + 1,
whichever is later.
The first stage of stacked Carry Select adders is different from the rest.
In this case, we do not have to wait for Cin to arrive – it is already known.
Therefore we do not have to use redundant adders – a single m bit adder
will do.
Since no multiplexing is required, the output of the first stage is ready at
(m + 1) units of time, rather than at (m + 2).
This is convenient – because the two alternatives of the second stage are
also ready at (m + 1) units of time.
Linear Stacking
Square-root Stacking
Can we speed up the adder if we don’t use the same no. of bits in every
stage?
In linear stacking, since all adders are identical, they are ready with their
alternative outputs at the same time.
But the carry arrives later and later at each successive group of carry
select adders.
We could have used this extra time to add up more bits in the later
stages, and still be ready with the alternative results before carry arrives!
Since the carry arrives one unit of time later at each successive group,
each successive group could be longer by one bit.
Square-root Stacking
s(m0 + m0 + s − 1)
n = m0 + m0 + (m0 + 1) + (m0 + 2) + · · · = m0 +
2
where s is the number of stages following the first one without carry
select.
The total delay will be m0 + 1 for the first stage. Each subsequent stage
takes just 1 unit of time since the candidates for selection are available
just in time.
The time taken is just m0 + s + 1 units. When s ≫ m0 , we have n ≈ s2 /2,
while the time taken is nearly s.
√
Thus the time taken to add n bits is ≈ 2n
Our sum will be ready at 11 - which is faster. This gain will be much higher for
wider additions.
Tree Adders
Terminology
Once the highest order P and G values have been generated, the final
carry can be computed in one step from the input carry.
The final result contains all the sum bits and the final carry. So it may
appear that we do not need the intermediate carries at each bit.
However, the sum bits depend on internal carries. The sum bits are given
by:
Si = Ai ⊕ Bi ⊕ Ci = Pi ⊕ Ci
Thus we do need the internal bit-wise carries for sum generation.
The group size over which the carry can be computed directly multiplies
by two each time we use a higher order for G and P values.
On the other hand, the time to compute the required higher order G and
P values increments by one gate delay.
(time to compute A + B · C for G and A · B for P).
This results in the ultimate time to generate the all the P and G values
being logarithmic in the number of bits being added.
Logarithmic Adders
Using P and G values of different orders, we can compute the bit wise
carry and sum values.
Notice that in logarithmic adders, internal bit-wise sum and carry values
may be available after the final carry.
Thus the critical path is not the generation of the final carry, but that of
bit-wise sums.
Different architectures have been described in literature for the order of
computation of G, P, Cout and Sum bits.
All of these compute the final result in times which are logarithmic
functions of the number of bits.
For wide adders,these can be much faster than other architectures.
The figure below shows the generation of P and G values for an 8 bit
adder.
a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0
P7:03 G7:03
P7:03 G7:03
P7:03 G7:03
3 3
In the next step, we use second order P,G values to generate P4i+3,4i , G4i+3,4i
with i = 0, 1.
3 2 2 2 3 2 2
G7,4 = G7,6 + P7,6 · G5,4 , P7,4 = P7,6 · P5,4
3 2 2 2 3 2 2
G3,0 = G3,2 + P3,2 · G1,0 , P3,0 = P3,2 · P1,0
P7:03 G7:03
3 3 4 4
Finally, using G4i+3,4i and P4i+3,4i (with i = 0, 1) we can compute P7,0 , G7,0 .
4 3 3 3
G7,0 = G7,4 + P7,4 · G3,0
4 3 3
P7,0 = P7,4 · P3,0
Once P and G terms of various orders are known, we can compute the values
of carry outputs which depend on these and the input carry C0 , which is
available at t = 0.
C1 = G01 + P01 · C0 , 2
C2 = G1,0 2
+ P1,0 · C0
3 3 4 4
C4 = G3,0 + P3,0 · C0 , C8 = G7,0 + P7,0 · C0
When these carry values are valid, the other carry values which depend on
these can be generated.
C7 = G61 + P61 · C6
With all carry values generated, the corresponding sum values can be
calculated using the relation Sumi = Pi1 ⊕ Ci .
Notice that G values are computed by the same logic relation as carry
outputs.
The input carry C0 is known at the start itself.
Whenever the carry is already known, we can replace Gl by this carry.
The computed value of G(u:l) will then be the carry output, rather than the
G value. This value can be used for further G calculations and will directly
give the carry each time.
This can reduce the computation required to generate the carry and sum
values since some of the carry values are already available.
We use a unit time model in which we assume that logic functions AND,
XOR, A + B.C as well as A.B + C.(A+B) take the same amount of time,
which defines 1 slot of time for this tutorial.
The single Bit G and P values (designated as order 0) are given by
An exception is made for the least significant bit of G because for this bit,
the input carry is known at the start.
We make use of this and compute effectively the carry output from bit 0
(c1 ) and map the output carry as if it was due to a generate signal at this
position. Thus,
G00 = c1 = a0 · b0 + c0 · (a0 + b0 )
All these functions can be computed in one unit of time directly from ai , bi
and input carry c0 . So these are all ready at the end of the first time slot.
Since c1 = G00 , c1 is also ready at the end of first slot.
We can define G and P functions which operate over multiple bits. Higher
order G and P values are computed as
G = Gu + Pu · Gl , P = Pu · Pl
where u and l stand for upper half range and lower half range for a range
of bit indices.
These can be computed within one time slot from the next lower order G
and P values. Thus higher orders of G and P values, (successively
covering twice the range of indices for the previous order) will be
available in each time slot.
Internal carries are computed using functions like C = G + P · Cin .
Depending on the order of G and P values, we can compute carry values
whose indices are 1, 2, 4, 8 . . . bits higher than the input carry. This
computation also takes one time slot, but can be performed only after the
needed Cin , P and G values are available.
G and P values for single bits are available at the end of first slot.
G and P values spanning groups of 2 bits are available at the end of
second slot. G and P values spanning groups of 4 bits are available at
the end of third slot. G and P values spanning groups of 8 bits are
available at the end of fourth slot. G and P values spanning groups of 16
bits are available at the end of fifth slot.
Finally, G and P values spanning the full word of 32 bits are available at
the end of sixth slot.
G and P values are available over spans of 2n bits. The start bit for these
spans has a granularity of 2n bits. For example, second order values
connect 0 → 4, 4 → 8 etc. We cannot connect using these from 1 → 5 in
a Brent Kung adder.
The lowest index G value for any order i is automatically the carry value
for bit index 2i .
at time =5, all 16 bit P and G values (P..4 and G..4 ) have been computed.
4
c16 = G(15:0) is also available.
c7 ← c6 using G60 , P60 and c6 ; c9 ← c8 using G8 0, P80 and c8 ;
1 1
c10 ← c8 using G(9:8) , P(9:8) and c8 ;
2 2
c12 ← c8 using G(11:8) , P(11:8) and c8 are all available.
5
at time =6, G(31:0) is generated. This is the value of c32 = Cout .
5
P(31:0) is not required.
0 0 0 0
c11 ← c10 using G10 , P10 and c10 ; c13 ← c12 using G12 , P12 and c12 ;
1 1
c14 ← c12 using G(13:12) , P(13:12) and c12 ;
0 0
c17 ← c16 using G16 , P16 and c16 ;
1 1
c18 ← c16 using G(17:16) , P(17:16) and c16 ;
2 2
c20 ← c16 using G(19:16) , P(19:16) and c16 ; and
3 3
c24 ← c16 using G(23:16) , P(23:16) and c16 have all been computed.
00 Cin
Carry input to bit number:
31
30
29
28
27
26
25
24
23
22
21
20
09
08
07
06
05
04
03
02
01
19
18
17
16
15
14
13
12
11
10
0
1 G0 P0
2 G1 P1
3 G2 P2
4 G3 P3
Time slot
5 G4 P4
6 G5
7
8
9
Pi0 = ai ⊕ bi , Gi0 = ai · bi
†G00 is generated as a0 · b0 + c0 · (a0 + b0 )
c1 = G00 = 1
In the second slot, we generate P and G values spanning two bits each.
From now on,
m+1
Prange = Pum · Plm , m+1
Grange = Gum + Pum · Glm ,
where u represents the upper half range and l represents the lower half range.
P1 10 01 10 11 11 00 00 11
G1 01 00 01 00 00 10 00 01
P2 0 0 0 1 1 0 0 1
G2 1 0 1 0 0 1 0 1
2
c4 = G3−0 = 1. We can also compute
c3 = G20 + P20 · c2 = 0 + 1 · 1 = 1,
s2 = P20 ⊕ c2 = 1 ⊕ 1 = 0
P2 0 0 0 1 1 0 0 1
G2 1 0 1 0 0 1 0 1
P3 0 0 0 0
G3 1 1 1 0
3
c8 = G7−0 = 0. We can also compute
c5 = G40 + P40 · c4 = 0 + 1 · 1 = 1, c6 = G5−4
1 1
+ P5−4 · c4 = 0 + 0 · 1 = 0.
s3 = P30 ⊕ c3 = 1 ⊕ 1 = 0, s4 = P40 ⊕ c4 = 1 ⊕ 1 = 0.
P3 0 0 0 0
G3 1 1 1 0
P4 0 0
G4 1 1
4
c16 = G15−0 = 1. We can also compute
c7 = G60 + P60 · c6 = 0 + 1 · 0 = 0, c9 = G80 + P80 · c8 = 0 + 0 · 0 = 0,
1 1
c10 = G9−8 + P9−8 · c8 = 0 + 0 · 0 = 0,
2 2
c12 = G11−8 + P11−8 · c8 = 1 + 0 · 0 = 1.
s5 = P50 ⊕ c5 = 0 ⊕ 1 = 1. s6 = P60 ⊕ c6 = 0 ⊕ 0 = 0.
s8 = P80 ⊕ c8 = 0 ⊕ 0 = 0.
5 4 4 4
In the sixth slot, we compute G31−0 = G31−16 + P31−16 · G15−0 .
5
P31−0 is not required.
5
This gives Cout = c32 = G31−0 = 1. We can further compute:
0 0
c11 = G10 + P10 · c10 = 0 + 0 · 0 = 0,
0 0
c13 = G12 + P12 · c12 = 0 + 1 · 1 = 1,
1 1
c14 = G13−12 + P13−12 · c12 = 0 + 1 · 1 = 1,
0 0
c17 = G16 + P16 · c16 = 0 + 1 · 1 = 1,
1 1
c18 = G17−16 + P17−16 · c16 = 1 + 1 · 1 = 1,
2 2
c20 = G19−16 + P19−16 · c16 = 1 + 1 · 1 = 1,
3 3
c24 = G23−16 + P23−16 · c16 = 0 + 1 · 1 = 1
s7 = P70 ⊕ c7 = 1 ⊕ 0 = 1, s9 = P90 ⊕ c9 = 0 ⊕ 0 = 0,
0 0
s10 = P10 ⊕ c10 = 0 ⊕ 0 = 0, s12 = P12 ⊕ c12 = 1 ⊕ 1 = 0,
0
s16 = P16 ⊕ c16 = 1 ⊕ 1 = 0,
In the seventh slot, All the required values of P and G are already available.
We can compute:
0 0 0 0
c15 = G14 + P14 · c14 = 0 + 1 · 1 = 1 c19 = G18 + P18 · c18 = 0 + 1 · 1 = 1
0 0 1 1
c21 = G20 + P20 · c20 = 0 + 0 · 1 = 0 c22 = G21−20 + P21−20 · c20 = 1 + 0 · 0 = 1
0 0 1 1
c25 = G24 + P24 · c24 = 0 + 1 · 1 = 1 c26 = G25−24 + P25−24 · c24 = 0 + 1 · 1 = 1
2 2
c28 = G27−24 + P27−24 · c24 = 0 + 0 · 1 = 0
0 0
s11 = P11 ⊕ c11 = 0 ⊕ 0 = 0, s13 = P13 ⊕ c13 = 1 ⊕ 1 = 0,
0 0
s14 = P14 ⊕ c14 = 1 ⊕ 1 = 0, s17 = P17 ⊕ c17 = 1 ⊕ 1 = 0,
0 0
s18 = P18 ⊕ c18 = 1 ⊕ 1 = 0, s20 = P20 ⊕ c20 = 0 ⊕ 1 = 1,
0
s24 = P10 ⊕ c24 = 1 ⊕ 1 = 0,
0 0
In the ninth slot, we can compute c31 = G30 + P30 · c30 = 0 + 1 · 1 = 1,
and the sum values
0
s23 = P23 ⊕ c23 = 1 ⊕ 1 = 0,
0
s27 = P29 ⊕ c29 = 0 ⊕ 1 = 1,
0
s29 = P29 ⊕ c29 = 1 ⊕ 1 = 0,
0
s30 = P30 ⊕ c30 = 1 ⊕ 1 = 0,
0
Finally in the tenth slot, we can evaluate s31 as s31 = P31 ⊕ c31 = 1 ⊕ 1 = 0.
Thus we have
Cin 1110 1111 1101 1111 1111 0000 0011 1111
a 1011 0111 1010 0101 0110 1000 1001 0011
b 0101 0000 0110 1010 1001 1000 0000 1100
sum 0000 1000 0001 0000 0000 0000 1010 0000
Serial Adders
Up to now, we have been concerned with making fast adders, even at the cost
of increased complexity and power.
In many applications, speed is not as important as low power consumption
and low cost.
Serial adders are an attractive option in such cases.
A single full adder is used.
If numbers to be added are available in parallel form, these can be serialized
using shift registers.
Serial Adders
A single full adder adds the incoming bits. Bits to be added are fed to it
serially, LSB first.
The sum bit goes to the output while carry is stored in a flip-flop.
Carry then gets added to the more significant bits which arrive next.
Output can be converted to parallel form if needed, using another shift
register.
Cin
Load Cprev
Csel Q
Cy Mux
A operand C D
A Shift Register
Shift Registers Sum
B
Output
B operand Full Adder Cout Latch
9 Barrel Shifters
Logarithmic Barrel Shifters
Combining Rotate and Shift Operations
Bidirectional Shift and Rotate Operations
B A
n bits n bits
n bits
Select n contiguous bits
We just have to choose B and A appropriately to implement a particular shift
or rotate operation.
Barrel Shifters
The brute force barrel shifter places a heavy load on input data lines
because each input bit is a candidate for each output position.
The control logic is complex because the amount of shift is variable.
The loading on data lines and control logic complexity can be reduced if
we break up the shift process into parts.
We can carry out shifts in different stages, each stage corresponding to a
single bit of the binary representation of the shift amount.
Thus a shift by 6 (binary: 110) will be carried out by first doing a 4 bit shift
and then a 2 bit shift.
X7 X6 X5 X4 X3 X2 X1 X0
X3 X2 X1 X0 X7 X6 X5 X4
n2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
p7 p6 p5 p4 p3 p2 p1 p0
Each input bit drives
just two muxes, each
with just 2 inputs.
n1
p7
0 1
p1 p6
0 1
p0 p5
0 1
p7 p4
0 1
p6 p3
0 1
p5 p2
0 1
p4 p1
0 1
p3 p0
0 1
p2
At each stage, the
q7 q6 q5 q4 q3 q2 q1 q0
muxes select either
the unshifted bit or a
bit 2n places from it.
3 stages are required
n0
q7
0 1
q0 q6
0 1
q7 q5
1 0
q6 q4
0 1
q5 q3
0 1
q4 q2
0 1
q3 q1
0 1
q2 q0
0 1
q1
for 0 to 7 bits of shift.
Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
X7 X6 X5 X4 X3 X2 X1 X0
`0'
X7 X6 X5 X4
`0' `0' `0' `0'
n2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
p7 p6 p5 p4 p3 p2 p1 p0 If we need a shift
instead of a rotate,
we feed a 0 instead
of the corresponding
p7 `0' p6 `0' p5 p7 p4 p6 p3 p5 p2 p4 p1 p3 p0 p2
n1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 bit.
q7 q6 q5 q4 q3 q2 q1 q0
This has to be done
for 4 muxes in the
first stage, 2 in the
q7 `0' q6 q7 q5 q6 q4 q5 q3 q4 q2 q3 q1 q2 q0 q1
second stage and 1
n0 0 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 in the last stage.
Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0
ASR
X3 X2 X1 X0
X7 1
’0’ 0
0 1 0 1 0 1 0 1
We can combine the circuits for
Rotate
rotate and shift functions by
4-bit shift rotate row putting muxes where different
inputs need to be presented for
Rotate
0 1 0 1
the two functions.
2-bit shift/rotate row We can include the Arithmetic
Shift function by choosing
0 1
Rotate between 0 or X7 as the bit to be
1-bit shift/rotate row
inserted.
X7 X6 X5 X4 X3 X2 X1 X0
X0 X1 X2 X3 X4 X5 X6 X7
Left 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 We can use the same hardware
for left and right shift/rotate
operations.
Y7 Y6 Y5 Y4 Y3 Y2 Y1 Y0 This can be done by adding rows
Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 of muxes at the input and output
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 which reverse the order of bits.
We can also make use of the fact that a left rotate by m places is the
same as a right rotate by 2n − m places where 2n is the width of the
operand (data being rotated).
2n − m is just the 2’s complement of m in an n bit representation.
By presenting the 2’s complement of m at the mux controls, we can
convert a right rotate to a left rotate.
This can be followed by a mask operation, if a shift operation was
required, rather than a rotate.
Multiplier Circuits
10 Shift and Add Multipliers
11 Array Multipliers
12 Speeding up Multipliers
Booth Encoding
Adding Partial Products
Wallace Multipliers
Dadda Multipliers
13 Multiply and Accumulate circuits
14 Serial Multipliers
Bit Serial Multipliers
Row Serial multipliers
Each term being added to form the product is called a partial product.
The name “partial product” is also used for individual bits of the terms
being added - so beware!
The paper-pencil procedure requires n-1 additions to a 2n bit
accumulator.
This uses a single adder, but takes long to complete the multiplication. A
32 x 32 multiplication will require 31 addition steps to a 64 bit
accumulator.
Multiplication can be made faster by using multiple adders and adding
terms in a tree structure.
Array Multiplier
X
n−1 X
n−1
A= 2i ai B= 2j bj
i=0 j=0
We can regard all bits of the partial products as an array, whose (i,j)th
element is ai · bj . Notice that each element is just the AND of ai and bj .
All elements of the array are available in parallel, within one gate delay of
arrival of A and B.
We can now use an array of full adders to produce the result. One input
of each adder is the sum from the previous row, the other is the AND of
appropriate ai and bj .
This architecture is called an array multiplier.
Array Multiplier
a3 b1 a2 b1 a1 b1 a0 b1
c FA c FA c FA c HA
s s s
a3 b2 a2 b2 a1 b2 a0 b2 s
c c c
c FA FA FA HA
a3 b3 a2 b3 s a1 b3 s a0 b3 s s
c c c
c FA FA FA HA
s s s s
a3 b1 a2 b1 a1 b1 a0 b1
c FA c FA c FA c HA
s s s
a3 b2 a2 b2 a1 b2 a0 b2 s
c c c
c FA FA FA HA
a3 b3 a2 b3 s a1 b3 s a0 b3 s s
c c c
c FA FA FA HA
s s s s
Speeding up Multipliers
The array multiplier has a regular layout with relatively short connections.
However, it is still rather slow.
How can we speed up a multiplier?
There are two possibilities:
Somehow reduce the number of partial products to be added. For
example, could we multiply 2 bits at a time rather than 1?
Since we have to add more than two terms at a time, use an adder
architecture which is optimized for this.
Booth Encoding
Booth Encoding
The partial product generator looks at the current 2 bits and the MSB of
the previous group of 2 bits to decide its action.
Thus, we scan the multiplier 3 bits at a time, with one bit overlapping.
For the first group of 2 bits, we assume a 0 to the right of it.
After handling the previous group, the multiplicand is shifted left by 2
positions. Thus, it has already been multiplied by 4.
Therefore, adding 4 A on behalf of the previous group is equivalent to
adding 1 to the multiplier corresponding to the current group.
The following table summarizes the effective multiplier for generating the
partial product.
Current Multiplier Previous Pending Total
2-bits for these MSBit Increment Multiplier
00 0 0 0 0
01 +1 0 0 +1
10 -2 0 0 -2
11 -1 0 0 -1
00 0 1 +1 +1
01 +1 1 +1 +2
10 -2 1 +1 -1
11 -1 1 +1 0
Notice that a 111 in the 3 bit group being scanned requires no work at all.
+
PPi +
Tree Addition
Time = (n-1) Tadd PPn T = (log2n) Tadd
Adders required: 1 Adders required: n/2
Ordinary adders are large and complex. Also, these are slow due to
rippling of carry.
Let us consider an adder which presents its output not as one word - but
two. The actual result is the sum of these.
Obviously, an adder of this type is of no use for adding just two numbers!
But it can be useful in a multiplier where we are adding multiple terms.
For each bit column, the sum goes into one output word, while carry outs
go into the other (without being added to the next more significant
column).
Now there is no rippling of carry and the output is available in constant
time.
We need a conventional adder in the end to add these two words.
This type of adder is called a “Carry Save Adder” or CSA.
A Carry Save Adder (whose output is two words which must be added to
produce the result) is of no use for adding just two words!
However, we can construct a useful CSA for adding 4 bits in the same
column.
We make use of the fact that all partial product bits are available in
constant time after the application of inputs.
Since there are multiple bits to be added, we can feed three of them to a
full adder.
The sum and carry output of this adder is then available in constant time.
The 4 input 2 output CSA uses two full adders as shown below:
a The first Full adder uses 3 bits of partial products of
b
c the same weight (bits in the same column).
d
FA These are available in parallel in constant time.
s The sum output of first FA goes to the second FA.
cy1 cy1
out FA in The carry output (cy1) of the first FS goes as
cy2 sum intermediate input to the CSA used in the column to
the left of this one.
The figure below shows how we can add 4 columns of 4 bits each.
Rows are labeled as a,b,c and d. Columns are 0,1,2 and 3.
Column 3 Column 2 Column 1 Column 0
a3b3 c3 d3 a2 b2 c2d2 a1 b1c1 d1 a0 b0 c0d0
FA FA FA FA
s s s s
cy1 cy1 cy1 cy1
FA FA FA FA
c s c s c s c s
FA FA FA FA
s s s s
cy1 cy1 cy1 cy1
FA FA FA FA
c s c s c s c s
One can see that the critical path has been broken up.
Addition of 4 words of 32 bits each will also have a critical path of the same
length.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 100 / 175
Speeding up Multipliers Wallace Multipliers
Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 101 / 175
Speeding up Multipliers Wallace Multipliers
Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 102 / 175
Speeding up Multipliers Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 103 / 175
Speeding up Multipliers Wallace Multipliers
Each reduction stage looks at the number of wires for each weight and if
any weight has more than 2 wires, it adds a layer of adders.
When the numbers of wires for each weight have been reduced to 2 or
less, we form one number with one of the wires at corresponding place
values and another with the other wire (if present).
These two numbers are added using a fast adder of appropriate size to
generate the final product.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 104 / 175
Speeding up Multipliers Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 105 / 175
Speeding up Multipliers Wallace Multipliers
The reduction procedure is carried out for all weights. starting from the
least significant weights.
At the end of of each layer, we count wires for each weight again, and if
none has more than 2 wires, we proceed to the final addition stage.
If any weight has 3 or more wires, we add another layer, and repeat this
procedure till the number of wires for all weights is reduced to 2 or less.
Now we compose one number from one of the left over wires at
corresponding weights and another from the remaining wires.
Finally, we use a conventional fast adder of appropriate size to add the
two numbers.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 106 / 175
Speeding up Multipliers Wallace Multipliers
Partial product bits where the sum of indices is the same have the same place
value and need to be added to each other.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 107 / 175
Speeding up Multipliers Wallace Multipliers
Partial products are generated in parallel and we have the following wires:
Bit Terms Wires
0 a0b0 1
1 a0b1, a1b0 2
2 a0b2, a1b1, a2b0 3
3 a0b3, a1b2, a2b1, a3b0 4
4 a1b3, a2b2, a3b1 3
5 a2b3, a3b2 2
6 a3b3 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 108 / 175
Speeding up Multipliers Wallace Multipliers
The multiplier has 4 rows of partial products, which are divided in groups of 3
and 1.
The bottom row is just passed on to the next stage.
Layer 1
adder. 3 4 FA
3
3
Bit 5 has 1 wire: passed through. 4
5 2
3 HA
3
6 1 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 109 / 175
Speeding up Multipliers Wallace Multipliers
3
Bit 5 has 3 wires: carry of bit 4 plus 2 fed through 4 3 HA
3
5 2
wires. 6 1 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 110 / 175
Speeding up Multipliers Wallace Multipliers
Since Bits 3, 4 and 5 have 3 wires each, we need another reduction layer.
This will be the last reduction layer.
Layer 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 111 / 175
Speeding up Multipliers Wallace Multipliers
Bits 0, 1, and 2 have single wires which carry the final result.
Layer 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 112 / 175
Speeding up Multipliers Wallace Multipliers
After the second layer, no bit has more than 2 wires. Single wires at bits 0, 1
and 2 are fed through to the output.
Layer 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 113 / 175
Speeding up Multipliers Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 114 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 115 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 116 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 117 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 118 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 119 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 120 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 121 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 122 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 123 / 175
Speeding up Multipliers Wallace Multipliers
One can avoid the redundant bit by modifying the reduction scheme.
We treat all wires in a column as equivalent.
No groups of 3 rows).
Make bunches of 3 wires and send each to a full adder.
Now we can be left with 0, 1 or 2 wires.
There is nothing to do for 0 wires left.
If one wire is left, it is passed through to next layer.
If two wires are left, we have a more complex decision.
We need to define the capacity of a reduction layer to describe the policy
for reduction of 2 wires.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 124 / 175
Speeding up Multipliers Wallace Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 125 / 175
Speeding up Multipliers Wallace Multipliers
The maximum number of wires for any weight in layer j+1 (from the end)
is the integral part of (3/2)dj .
j = 1 for the final adder. Thus d1 = 2.
We go up in j, till we reach a number which is just greater than or equal to
the largest bunch of wires in any weight.
The number of reduction layers required is this jfinal − 1.
Capacities of layers starting from last layer and moving towards the top
are 2, 3, 4, 6, 9, 13, 19 . . . .
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 126 / 175
Speeding up Multipliers Wallace Multipliers
Now we can define the policy for reduction of 2 left over wires after deploying
the maximum number of full adders.
If all columns at the right have a single wire, we reduce the two wires
using a half adder. (This helps in reducing the width of final adder).
If there is a column to the right with more than one wire, we pass through
the two wires to the next layer if it can accommodate these. (That is, the
total number of wires do not exceed the capacity of that layer).
If passing through the two wires would exceed the capacity of next layer,
we reduce these with a half adder.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 127 / 175
Speeding up Multipliers Wallace Multipliers
1 3 2 3 5 4 5 6 5 3 4 3 2 1 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 128 / 175
Speeding up Multipliers Wallace Multipliers
1 3 2 3 5 4 5 6 5 3 4 3 2 1 1
2 1 3 2 4 4 4 3 4 2 3 2 1 1 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 129 / 175
Speeding up Multipliers Wallace Multipliers
2 2 1 3 3 3 3 2 2 3 2 1 1 1 1
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 130 / 175
Speeding up Multipliers Wallace Multipliers
Thus we have reached two wires without generating a bit at b15. There are
two wires at b14 and if these produce a carry, it will go to b15 and there is no
redundant b16.(IIT B)
Dinesh Sharma Arithmetic Circuits October 16, 2022 131 / 175
Speeding up Multipliers Wallace Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 132 / 175
Speeding up Multipliers Dadda Multipliers
Dadda Multipliers
Dadda multipliers are very similar to Wallace multipliers and use the same 3
stages:
1 Generate all bits of the partial products in parallel.
2 Collect all partial products bits with the same place value in bunches of
wires and reduce these in several layers of adders till each weight has no
more than two wires.
3 For all bit positions which have two wires, take one wire at corresponding
place values to form one number, and the other wire to form another
number.
Add these two numbers using a fast adder of appropriate size.
The difference is in the reduction stage.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 133 / 175
Speeding up Multipliers Dadda Multipliers
Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 134 / 175
Speeding up Multipliers Dadda Multipliers
We work back from the final adder to earlier layers till we find that we can
manage all wires generated by the partial product generator.
We know that the final adder can take no more than 2 wires for each
weight.
Let dj represent the maximum number of wires for any weight in layer j,
where j = 1 for the final adder. (Thus d1 = 2).
The maximum number of wires which can be handled in layer j+1 (from
the end) is the integral part of 32 dj .
We go up in j, till we reach a number which is just greater than or equal to
the largest bunch of wires in any weight.
The number of reduction layers required is this jfinal − 1.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 135 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 136 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 137 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 138 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 139 / 175
Speeding up Multipliers Dadda Multipliers
Layer 1
Wt. 1 1 1 Wt.1 has the single wire which was fed through.
Wt. 2
2 2 Wt.2 has 2 fed through wires.
3 3 Wt.4 has 3 wires: all passed through.
Wt. 4
Wt.8 has 3 wires: sum of the half adder at wt.4, and 2
4 3 passed through.
Wt. 8
HA Wt.16 has 3 wires: carry of wt. 8, sum of half adder
3 3 at 16 and 1 passed through.
Wt. 16 HA Wt.32 has 3 wires: carry of wt. 16 and 2 passed
3
Wt. 32
2 through.
Wt. 64 1 1 Wt.64 has 1 fed through wire.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 140 / 175
Speeding up Multipliers Dadda Multipliers
In the second layer, we should leave no more than 2 wires at any weight, as
this is the last stage.
As before, we anticipate the number of carry wires transferred from the
lower weight when planning reduction using half or full adders.
In Dadda multipliers, we use minimum hardware during reduction. So the
smallest adder which will reduce the output wires to 2 will be used.
At the lowest weights, if the number of wires is less than or equal to 2, we
just pass these through.
So the single wire at Wt. 1, and the 2 wires at Wt. 2 are just fed through.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 141 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 142 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 143 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
capacity: 6
capacity: 4
capacity: 3
capacity: 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 144 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Capacity of next layer is 6. Since
bits 0-5 have six or less wires,
these are just passed through.
bit 6 has 7 wires. To reduce to six,
we place a half adder. (This gives
a sum wire at bit 6 and a carry
wire at bit 7). capacity: 6
crossed line.
capacity: 3
Remaining 5 bits are passed
through. capacity: 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 145 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 146 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
passed through.
capacity: 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 147 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Bit 9 has 6 wires, which should be
reduced to 4 (since two places
are taken up by carries of full and
half adders).
This can be done by a full adder
whose outputs are shown by dots
at bit 9 and 10 joined by a line. capacity: 6
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 148 / 175
Speeding up Multipliers Dadda Multipliers
is 4.
Wires of bits 0-3 can just be
passed through.
For all bit position, we reduce the
output places available by the
incoming carries of previous bit. capacity: 6
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 149 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(4-2) by a FA.
All other wires can be passed capacity: 3
through. capacity: 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 150 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
The reduction procedure can be
repeated at each layer.
If 2 wires (or multiples of 2) are to
be reduced, we place FAs till 1 or
0 wires are left.
If 1 wire remains, we place a Half
capacity: 6
adder.
This layer requires a half adder at
capacity: 4
bit 4, FA + HA at bit 5, 2 FAs at
bits 6-10 and a full adder at bit 11.
capacity: 3
Rest of the wires are just passed
through. capacity: 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 151 / 175
Speeding up Multipliers Dadda Multipliers
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
final product.
capacity: 3
Notice there is no extra bit!
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 152 / 175
Speeding up Multipliers Dadda Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 153 / 175
Multiply and Accumulate circuits
A
Pcommon task during data processing is the evaluation of quantities like
ci Xi .
This can be made easier if we have a dedicated hardware circuit which
can compute A × B + C. Here the size of the operand C is the same as
that of the product A × B.
This circuit is the multiply and accumulate or MAC circuit.
The MAC circuit is not much more complex compared to a multiplier. This
is because during multiplication we are anyway adding multiple bits in
every column. The accumulator just provides an additional wire at each
bit position.
This circuit is much faster than separate multiplication and addition
because the latter requires two steps of addition with rippling carry while
the MAC requires only one.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 154 / 175
Multiply and Accumulate circuits
Consider for example a MAC circuit which multiplies two 8 bit operands and adds the
product to a 16 bit accumulator.
The number of wires from partial sums of the multiplier is:
Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Wires 0 1 2 3 4 5 6 7 8 7 6 5 4 3 2 1
If we include a wire at each position from the accumulator, we get the wire count as:
Bit 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Wire 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 155 / 175
Multiply and Accumulate circuits
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 156 / 175
Multiply and Accumulate circuits
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 157 / 175
Multiply and Accumulate circuits
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 158 / 175
Multiply and Accumulate circuits
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 159 / 175
Multiply and Accumulate circuits
From
Partial Prod. Gen.
(Max wires = 9)
Stage 2
Capacity = 6
Stage 3
Capacity = 4
Stage 4
Caoacity = 3
Stage 5
Capacity = 2
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 160 / 175
Multiply and Accumulate circuits
8x8 MAC
Complexity of the MAC circuit is not much higher than a Wallace tree 8x8
multiplier.
This is so for any wire reduction scheme. We could have used the Dadda
scheme and the complexity would not be much more than the plain 8x8
Dadda multiplier.
The result is produced much faster than separate multiplication and
addition.
This is because a traditional adder is used only once in the MAC.
Otherwise, the multiplier will use it once and then the addition to the
accumulator will again involve a traditional addition.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 161 / 175
Serial Multipliers
Serial Multipliers
Often, we need multipliers which have very low complexity or very low power
consumption and speed is not very important.
Row serial multipliers require only n steps, but we require m full adders rather
than just one.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 162 / 175
Serial Multipliers Bit Serial Multipliers
Multiplicand
Each bit of the multiplier needs to be ANDed
cin with each bit of the multiplicand.
m bit c
x FA
fclock
Multiplier This requires that all multiplicand bits be
s
fclock / m
y presented one after the other, every time a
n bit
new bit from the multiplier is taken up.
This can be managed by using a re-circulating shift register for the
multiplicand, which is clocked at a rate which is m times faster than the
clock to the multiplier shift register.
The inputs y and Cin to the full adder have to be appropriately selected
and timed to generate the correct product.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 163 / 175
Serial Multipliers Bit Serial Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 164 / 175
Serial Multipliers Bit Serial Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 165 / 175
Serial Multipliers Bit Serial Multipliers
Let us put the arrival time of terms in parentheses next to each term.
a3 a2 a1 a0
× b3 b2 b1 b0
a3b0(3) a2b0(2) a1b0(1) a0b0(0)
a3b1(7) a2b1(6) a1b1(5) a0b1(4)
a3b2(11) a2b2(10) a1b2(9) a0b2(8)
a3b3(15) a2b3(14) a1b3(13) a0b3(12)
It is clear that for all additions, the earlier terms have to wait for 3 clock cycles before
the later terms arrive.
We can manage this by putting a 3 bit shift register at the sum output and presenting
the delayed output at the ‘y’ input of the full adder.
The carry output can be added immediately in the next clock, since it should go to the
next column to its left.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 166 / 175
Serial Multipliers Bit Serial Multipliers
A 3 clock delay for sum and a 1 clock delay for carry leads to the following
circuit.
co Reset
ci
x FF
s
y
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 167 / 175
Serial Multipliers Bit Serial Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 168 / 175
Serial Multipliers Bit Serial Multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 169 / 175
Serial Multipliers Bit Serial Multipliers
With exception handling at the end of rows, the serial multiplier will work.
Row End
Carry input is forced to 0 at row ends.
ci co
The mux normally inserts the sum into the shift
x s FF
y register. However, at row ends, it inserts the
delayed carry output.
The sum terms at row ends can be taken out as the low bits of the
product.
One can add another shift register at the output to collect these.
The 2 more significant bits of the shift register and the last sum and carry
provide the high bits of the product at the end.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 170 / 175
Serial Multipliers Row Serial multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 171 / 175
Serial Multipliers Row Serial multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 172 / 175
Serial Multipliers Row Serial multipliers
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 173 / 175
Serial Multipliers Row Serial multipliers
del_cy 0 0 0 0
Shift_sum 0 0 0 0 Notice that the same ‘a’ term is used in
PP_term a3b0 a2b0 a1b0 a0b0
0 c3 s3 c2 s2 c1 s1 c0 s0 a given adder.
del_cy c3 c2 c1 c0
Shift_sum 0 s3 s2 s1 s0 The ‘b’ term has to be shifted right every
PP_term a3b1 a2b1 a1b1 a0b1 time to generate the right partial product
0 c7 s7 c6 s6 c5 s5 c4 s4 bit.
del_cy c7 c6 c5 c4
Shift_sum 0 s7 s6 s5 s4 Sums have to be shifted right to be
a3b2 a2b2 a1b2 a0b2
0 c11 s11 c10 s10 c9 s9 c8 s8 added to the carry of the previous
del_cy c11 c10 c9 c8 addition in the same column.
Shift_sum 0 s11 s10 s9 s8
a3b3 a2b3 a1b3 a0b3 4 additional clock cycles will be required
c15 s15 c14 s14 c13 s13 c12 s12 to ripple the carry in the last addition.
del_cy c15 c14 c13 c12
Shift_sum 0 s15 s14 s13 During these, the partial product bits will
0 0 0 0 s12
be 0.
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 174 / 175
Serial Multipliers Row Serial multipliers
0 b3 b2 b1 b0
Shift Register
a3 a2 a1 a0
FF FF FF FF
Dinesh Sharma (IIT B) Arithmetic Circuits October 16, 2022 175 / 175