Module-2 Floor Planning and Routing
Module-2 Floor Planning and Routing
FLOORPLANNING
AND
PLACEMENT
The input to the floorplanning step is the output of system partitioning and design
entrya netlist. Floorplanning precedes placement, but we shall cover them
together. The output of the placement step is a set of directions for the routing
tools.
At the start of floorplanning we have a netlist describing circuit blocks, the logic
cells within the blocks, and their connections. For example, Figure 16.1 shows
the Viterbi decoder example as a collection of standard cells with no room set
aside yet for routing. We can think of the standard cells as a hod of bricks to be
made into a wall. What we have to do now is set aside spaces (we call these
spaces the channels ) for interconnect, the mortar, and arrange the cells.
Figure 16.2 shows a finished wallafter floorplanning and placement steps are
complete. We still have not completed any routing at this pointthat comes later
all we have done is placed the logic cells in a fashion that we hope will minimize
the total interconnect length, for example.
FIGURE 16.1 The starting point for the floorplanning and placement steps for
the Viterbi decoder (containing only standard cells). This is the initial display of
the floorplanning and placement tool. The small boxes that look like bricks are
the outlines of the standard cells. The largest standard cells, at the bottom of the
display (labeled dfctnb) are 188 D flip-flops. The '+' symbols represent the
drawing origins of the standard cellsfor the D flip-flops they are shifted to the
left and below the logic cell bottom left-hand corner. The large box surrounding
all the logic cells represents the estimated chip size. (This is a screen shot from
Cadence Cell Ensemble.)
FIGURE 16.2 The Viterbi Decoder (from Figure 16.1 ) after floorplanning and
placement. There are 18 rows of standard cells separated by 17 horizontal
channels (labeled 218). The channels are routed as numbered. In this example,
the I/O pads are omitted to show the cell placement more clearly. Figure 17.1
shows the same placement without the channel labels. (A screen shot from
Cadence Cell Ensemble.)
16.1 Floorplanning
Figure 16.3 shows that both interconnect delay and gate delay decrease as we
scale down feature sizesbut at different rates. This is because interconnect
capacitance tends to a limit of about 2 pFcm 1 for a minimum-width wire while
gate delay continues to decrease (see Section 17.4, Circuit Extraction and DRC).
Floorplanning allows us to predict this interconnect delay by estimating
interconnect length.
FIGURE 16.3 Interconnect
and gate delays. As feature
sizes decrease, both average
interconnect delay and
average gate delay decrease
but at different rates. This is
because interconnect
capacitance tends to a limit
that is independent of
scaling. Interconnect delay
now dominates gate delay.
Table 16.1 shows the estimated metal interconnect lengths, as a function of die
size and fanout, for a series of three-level metal gate arrays. In this case the
interconnect capacitance is about 2 pFcm 1 , a typical figure.
Figure 16.5 shows that, because we do not decrease chip size as we scale down
feature size, the worst-case interconnect delay increases. One way to measure the
worst-case delay uses an interconnect that completely crosses the chip, a
coast-to-coast interconnect . In certain cases the worst-case delay of a 0.25 m m
process may be worse than a 0.35 m m process, for example.
FIGURE 16.5
Worst-case
interconnect delay.
As we scale circuits,
but avoid scaling the
chip size, the
worst-case
interconnect delay
increases.
FIGURE 16.7 Congestion analysis. (a) The initial floorplan with a 2:1.5 die
aspect ratio. (b) Altering the floorplan to give a 1:1 chip aspect ratio. (c) A trial
floorplan with a congestion map. Blocks A and C have been placed so that we
know the terminal positions in the channels. Shading indicates the ratio of
channel density to the channel capacity. Dark areas show regions that cannot be
routed because the channel congestion exceeds the estimated capacity.
(d) Resizing flexible blocks A and C alleviates congestion.
FIGURE 16.8 Routing a T-junction between two channels in two-level metal.
The dots represent logic cell pins. (a) Routing channel A (the stem of the T) first
allows us to adjust the width of channel B. (b) If we route channel B first (the
top of the T), this fixes the width of channel A. We have to route the stem of a
T-junction before we route the top.
FIGURE 16.9 Defining the channel routing order for a slicing floorplan using a
slicing tree. (a) Make a cut all the way across the chip between circuit blocks.
Continue slicing until each piece contains just one circuit block. Each cut divides
a piece into two without cutting through a circuit block. (b) A sequence of cuts:
1, 2, 3, and 4 that successively slices the chip until only circuit blocks are left.
(c) The slicing tree corresponding to the sequence of cuts gives the order in
which to route the channels: 4, 3, 2, and finally 1.
Figure 16.10 shows a floorplan that is not a slicing structure. We cannot cut the
chip all the way across with a knife without chopping a circuit block in two. This
means we cannot route any of the channels in this floorplan without routing all of
the other channels first. We say there is a cyclic constraint in this floorplan. There
are two solutions to this problem. One solution is to move the blocks until we
obtain a slicing floorplan. The other solution is to allow the use of L -shaped,
rather than rectangular, channels (or areas with fixed connectors on all sidesa
switch box ). We need an area-based router rather than a channel router to route L
-shaped regions or switch boxes (see Section 17.2.6, Area-Routing Algorithms).
Figure 16.11 (a) displays the floorplan of the ASIC shown in Figure 16.7 . We
can remove the cyclic constraint by moving the blocks again, but this increases
the chip size. Figure 16.11 (b) shows an alternative solution. We merge the
flexible standard cell areas A and C. We can do this by selective flattening of the
netlist. Sometimes flattening can reduce the routing area because routing between
blocks is usually less efficient than routing inside the row-based blocks.
Figure 16.11 (b) shows the channel definition and routing order for our chip.
FIGURE 16.11 Channel definition and ordering. (a) We can eliminate the cyclic
constraint by merging the blocks A and C. (b) A slicing structure.
16.1.5 I/O and Power Planning
Every chip communicates with the outside world. Signals flow onto and off the
chip and we need to supply power. We need to consider the I/O and power
constraints early in the floorplanning process. A silicon chip or die (plural die,
dies, or dice) is mounted on a chip carrier inside a chip package . Connections are
made by bonding the chip pads to fingers on a metal lead frame that is part of the
package. The metal lead-frame fingers connect to the package pins . A die
consists of a logic core inside a pad ring . Figure 16.12 (a) shows a pad-limited
die and Figure 16.12 (b) shows a core-limited die . On a pad-limited die we use
tall, thin pad-limited pads , which maximize the number of pads we can fit
around the outside of the chip. On a core-limited die we use short, wide
core-limited pads . Figure 16.12 (c) shows how we can use both types of pad to
change the aspect ratio of a die to be different from that of the core.
FIGURE 16.12 Pad-limited and core-limited die. (a) A pad-limited die. The
number of pads determines the die size. (b) A core-limited die: The core logic
determines the die size. (c) Using both pad-limited pads and core-limited pads
for a square die.
Special power pads are used for the positive supply, or VDD, power buses (or
power rails ) and the ground or negative supply, VSS or GND. Usually one set of
VDD/VSS pads supplies one power ring that runs around the pad ring and
supplies power to the I/O pads only. Another set of VDD/VSS pads connects to a
second power ring that supplies the logic core. We sometimes call the I/O power
dirty power since it has to supply large transient currents to the output transistors.
We keep dirty power separate to avoid injecting noise into the internal-logic
power (the clean power ). I/O pads also contain special circuits to protect against
electrostatic discharge ( ESD ). These circuits can withstand very short
high-voltage (several kilovolt) pulses that can be generated during human or
machine handling.
Depending on the type of package and how the foundry attaches the silicon die to
the chip cavity in the chip carrier, there may be an electrical connection between
the chip carrier and the die substrate. Usually the die is cemented in the chip
cavity with a conductive epoxy, making an electrical connection between
substrate and the package cavity in the chip carrier. If we make an electrical
connection between the substrate and a chip pad, or to a package pin, it must be
to VDD ( n -type substrate) or VSS ( p -type substrate). This substrate connection
(for the whole chip) employs a down bond (or drop bond) to the carrier. We have
several options:
● We can dedicate one (or more) chip pad(s) to down bond to the chip
carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the chip pad to the chip carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the lead frame.
● We can down bond from the lead frame without using a chip pad.
Depending on the package design, the type and positioning of down bonds may
be fixed. This means we need to fix the position of the chip pad for down
bonding using a pad seed .
A double bond connects two pads to one chip-carrier finger and one package pin.
We can do this to save package pins or reduce the series inductance of bond
wires (typically a few nanohenries) by parallel connection of the pads. A
multiple-signal pad or pad group is a set of pads. For example, an oscillator pad
usually comprises a set of two adjacent pads that we connect to an external
crystal. The oscillator circuit and the two signal pads form a single logic cell.
Another common example is a clock pad . Some foundries allow a special form
of corner pad (normal pads are edge pads ) that squeezes two pads into the area at
the corners of a chip using a special two-pad corner cell , to help meet bond-wire
angle design rules (see also Figure 16.13 b and c).
To reduce the series resistive and inductive impedance of power supply networks,
it is normal to use multiple VDD and VSS pads. This is particularly important
with the simultaneously switching outputs ( SSOs ) that occur when driving buses
off-chip [ Wada, Eino, and Anami, 1990]. The output pads can easily consume
most of the power on a CMOS ASIC, because the load on a pad (usually tens of
picofarads) is much larger than typical on-chip capacitive loads. Depending on
the technology it may be necessary to provide dedicated VDD and VSS pads for
every few SSOs. Design rules set how many SSOs can be used per VDD/VSS
pad pair. These dedicated VDD/VSS pads must follow groups of output pads as
they are seeded or planned on the floorplan. With some chip packages this can
become difficult because design rules limit the location of package pins that may
be used for supplies (due to the differing series inductance of each pin).
Using a pad mapping we translate the logical pad in a netlist to a physical pad
from a pad library . We might control pad seeding and mapping in the
floorplanner. The handling of I/O pads can become quite complex; there are
several nonobvious factors that must be considered when generating a pad ring:
● Ideally we would only need to design library pad cells for one orientation.
For example, an edge pad for the south side of the chip, and a corner pad
for the southeast corner. We could then generate other orientations by
rotation and flipping (mirroring). Some ASIC vendors will not allow
rotation or mirroring of logic cells in the mask file. To avoid these
problems we may need to have separate horizontal, vertical, left-handed,
and right-handed pad cells in the library with appropriate logical to
physical pad mappings.
● If we mix pad-limited and core-limited edge pads in the same pad ring, this
complicates the design of corner pads. Usually the two types of edge pad
cannot abut. In this case a corner pad also becomes a pad-format changer ,
or hybrid corner pad .
● In single-supply chips we have one VDD net and one VSS net, both global
power nets . It is also possible to use mixed power supplies (for example,
3.3 V and 5 V) or multiple power supplies ( digital VDD, analog VDD).
Figure 16.13 (a) and (b) are magnified views of the southeast corner of our
example chip and show the different types of I/O cells. Figure 16.13 (c) shows a
stagger-bond arrangement using two rows of I/O pads. In this case the design
rules for bond wires (the spacing and the angle at which the bond wires leave the
pads) become very important.
FIGURE 16.13 Bonding pads. (a) This chip uses both pad-limited and
core-limited pads. (b) A hybrid corner pad. (c) A chip with stagger-bonded pads.
(d) An area-bump bonded chip (or flip-chip). The chip is turned upside down
and solder bumps connect the pads to the lead frame.
FIGURE 16.14 Gate-array I/O pads. (a) Cell-based ASICs may contain pad
cells of different sizes and widths. (b) A corner of a gate-array base. (c) A
gate-array base with different I/O cell and pad pitches.
FIGURE 16.15 Power distribution. (a) Power distributed using m1 for VSS and
m2 for VDD. This helps minimize the number of vias and layer crossings needed
but causes problems in the routing channels. (b) In this floorplan m1 is run
parallel to the longest side of all channels, the channel spine. This can make
automatic routing easier but may increase the number of vias and layer
crossings. (c) An expanded view of part of a channel (interconnect is shown as
lines). If power runs on different layers along the spine of a channel, this forces
signals to change layers. (d) A closeup of VDD and VSS buses as they cross.
Changing layers requires a large number of via contacts to reduce resistance.
Figure 16.15 shows two possible power distribution schemes. The long direction
of a rectangular channel is the channel spine . Some automatic routers may
require that metal lines parallel to a channel spine use a preferred layer (either
m1, m2, or m3). Alternatively we say that a particular metal layer runs in a
preferred direction . Since we can have both horizontal and vertical channels, we
may have the situation shown in Figure 16.15 , where we have to decide whether
to use a preferred layer or the preferred direction for some channels. This may or
may not be handled automatically by the routing software.
Clock skew represents a fraction of the clock period that we cannot use for
computation. A clock skew of 500 ps with a 200 MHz clock means that we waste
500 ps of every 5 ns clock cycle, or 10 percent of performance. Latency can
cause a similar loss of performance at the system level when we need to
resynchronize our output signals with a master system clock.
Figure 16.16 (c) illustrates the construction of a clock-driver cell. The delay
through a chain of CMOS gates is minimized when the ratio between the input
capacitance C 1 and the output (load) capacitance C 2 is about 3 (exactly e ª 2.7,
an exponential ratio, if we neglect the effect of parasitics). This means that the
fastest way to drive a large load is to use a chain of buffers with their input and
output loads chosen to maintain this ratio, or taper (we use this as a noun and a
verb). This is not necessarily the smallest or lowest-power method, though.
Suppose we have an ASIC with the following specifications:
● 40,000 flip-flops
The power dissipated charging the input capacitance of the flip-flop clock is fCV
2 or
All of this power is dissipated in the clock-driver cell. The worst problem,
however, is the enormous peak current in the final inverter stage. If we assume
the needed rise time is 0.1 ns (with a 200 MHz clock whose period is 5 ns), the
peak current would have to approach
(800 pF) (3.3 V)
I= ª 25 A . (16.4)
0.1 ns
Designing a clock tree that balances the rise and fall times at the leaf nodes has
the beneficial side-effect of minimizing the effect of hot-electron wearout . This
problem occurs when an electron gains enough energy to become hot and jump
out of the channel into the gate oxide (the problem is worse for electrons in n
-channel devices because electrons are more mobile than holes). The trapped
electrons change the threshold voltage of the device and this alters the delay of
the buffers. As the buffer delays change with time, this introduces unpredictable
skew. The problem is worst when the n -channel device is carrying maximum
current with a high voltage across the channelthis occurs during the rise-and
fall-time transitions. Balancing the rise and fall times in each buffer means that
they all wear out at the same rate, minimizing any additional skew.
A phase-locked loop ( PLL ) is an electronic flywheel that locks in frequency to
an input clock signal. The input and output frequencies may differ in phase,
however. This means that we can, for example, drive a clock network with a PLL
in such a way that the output of the clock network is locked in phase to the
incoming clock, thus eliminating the latency of the clock network . A PLL can
also help to reduce random variation of the input clock frequency, known as jitter
, which, since it is unpredictable, must also be discounted from the time available
for computation in each clock cycle. Actel was one of the first FPGA vendors to
incorporate PLLs, and Actels online product literature explains their use in ASIC
design.
Most ASICs currently use two or three levels of metal for signal routing. With two
layers of metal, we route within the rectangular channels using the first metal layer for
horizontal routing, parallel to the channel spine, and the second metal layer for the
vertical direction (if there is a third metal layer it will normally run in the horizontal
direction again). The maximum number of horizontal interconnects that can be placed
side by side, parallel to the channel spine, is the channel capacity .
FIGURE 16.19 Gate-array interconnect. (a) A small two-level metal gate array
(about 4.6 k-gate). (b) Routing in a block. (c) Channel routing showing channel
density and channel capacity. The channel height on a gate array may only be
increased in increments of a row. If the interconnect does not use up all of the
channel, the rest of the space is wasted. The interconnect in the channel runs in m1 in
the horizontal direction with m2 in the vertical direction.
Vertical interconnect uses feedthroughs (or feedthrus in the United States) to cross the
logic cells. Here are some commonly used terms with explanations (there are no
generally accepted definitions):
● An unused vertical track (or just track ) in a logic cell is called an uncommitted
feedthrough (also built-in feedthrough , implicit feedthrough , or jumper ).
● A vertical strip of metal that runs from the top to bottom of a cell (for
double-entry cells ), but has no connections inside the cell, is also called a
feedthrough or jumper.
● Two connectors for the same physical net are electrically equivalent connectors
(or equipotential connectors ). For double-entry cells these are usually at the top
and bottom of the logic cell.
● A dedicated feedthrough cell (or crosser cell ) is an empty cell (with no logic)
that can hold one or more vertical interconnects. These are used if there are no
other feedthroughs available.
● A feedthrough pin or feedthrough terminal is an input or output that has
connections at both the top and bottom of the standard cell.
● A spacer cell (usually the same as a feedthrough cell) is used to fill space in
rows so that the ends of all rows in a flexible block may be aligned to connect
to power buses, for example.
There is no standard terminology for connectors and the terms can be very confusing.
There is a difference between connectors that are joined inside the logic cell using a
high-resistance material such as polysilicon and connectors that are joined by
low-resistance metal. The high-resistance kind are really two separate alternative
connectors (that cannot be used as a feedthrough), whereas the low-resistance kind are
electrically equivalent connectors. There may be two or more connectors to a logic
cell, which are not joined inside the cell, and which must be joined by the router (
must-join connectors ).
There are also logically equivalent connectors (or functionally equivalent connectors,
sometimes also called just equivalent connectorswhich is very confusing). The two
inputs of a two-input NAND gate may be logically equivalent connectors. The
placement tool can swap these without altering the logic (but the two inputs may have
different delay properties, so it is not always a good idea to swap them). There can
also be logically equivalent connector groups . For example, in an OAI22
(OR-AND-INVERT) gate there are four inputs: A1, A2 are inputs to one OR gate
(gate A), and B1, B2 are inputs to the second OR gate (gate B). Then group A = (A1,
A2) is logically equivalent to group B = (B1, B2)if we swap one input (A1 or A2)
from gate A to gate B, we must swap the other input in the group (A2 or A1).
In the case of channeled gate arrays and FPGAs, the horizontal interconnect areasthe
channels, usually on m1have a fixed capacity (sometimes they are called
fixed-resource ASICs for this reason). The channel capacity of CBICs and channelless
MGAs can be expanded to hold as many interconnects as are needed. Normally we
choose, as an objective, to minimize the number of interconnects that use each
channel. In the vertical interconnect direction, usually m2, FPGAs still have fixed
resources. In contrast the placement tool can always add vertical feedthroughs to a
channeled MGA, channelless MGA, or CBIC. These problems become less important
as we move to three and more levels of interconnect.
16.2.2 Placement Goals and Objectives
The goal of a placement tool is to arrange all the logic cells within the flexible blocks
on a chip. Ideally, the objectives of the placement step are to
● Guarantee the router can complete the routing step
Objectives such as these are difficult to define in a way that can be solved with an
algorithm and even harder to actually meet. Current placement tools use more specific
and achievable criteria. The most commonly used placement objectives are one or
more of the following:
● Minimize the total estimated interconnect length
The minimum rectilinear Steiner tree ( MRST ) is the shortest interconnect using a
rectangular grid. The determination of the MRST is in general an NP-complete
problemwhich means it is hard to solve. For small numbers of terminals heuristic
algorithms do exist, but they are expensive to compute. Fortunately we only need to
estimate the length of the interconnect. Two approximations to the MRST are shown
in Figure 16.21 .
The complete graph has connections from each terminal to every other terminal [
Hanan, Wolff, and Agule, 1973]. The complete-graph measure adds all the
interconnect lengths of the complete-graph connection together and then divides by n
/2, where n is the number of terminals. We can justify this since, in a graph with n
terminals, ( n 1) interconnects will emanate from each terminal to join the other ( n
1) terminals in a complete graph connection. That makes n ( n 1) interconnects in
total. However, we have then made each connection twice. So there are one-half this
many, or n ( n 1)/2, interconnects needed for a complete graph connection. Now we
actually only need ( n 1) interconnects to join n terminals, so we have n /2 times as
many interconnects as we really need. Hence we divide the total net length of the
complete graph connection by n /2 to obtain a more reasonable estimate of minimum
interconnect length. Figure 16.21 (a) shows an example of the complete-graph
measure.
FIGURE 16.21 Interconnect-length
measures. (a) Complete-graph
measure. (b) Half-perimeter measure.
The bounding box is the smallest rectangle that encloses all the terminals (not to be
confused with a logic cell bounding box, which encloses all the layout in a logic cell).
The half-perimeter measure (or bounding-box measure) is one-half the perimeter of
the bounding box ( Figure 16.21 b) [ Schweikert, 1976]. For nets with two or three
terminals (corresponding to a fanout of one or two, which usually includes over 50
percent of all nets on a chip), the half-perimeter measure is the same as the minimum
Steiner tree. For nets with four or five terminals, the minimum Steiner tree is between
one and two times the half-perimeter measure [ Hanan, 1966]. For a circuit with m
nets, using the half-perimeter measure corresponds to minimizing the cost function,
1m
f= S h i , (16.5)
2i=1
It does not really matter if our approximations are inaccurate if there is a good
correlation between actual interconnect lengths (after routing) and our
approximations. Figure 16.22 shows that we can adjust the complete-graph and
half-perimeter measures using correction factors [ Goto and Matsuda, 1986]. Now our
wiring length approximations are functions, not just of the terminal positions, but also
of the number of terminals, and the size of the bounding box. One practical example
adjusts a Steiner-tree approximation using the number of terminals [ Chao, Nequist,
and Vuong, 1990]. This technique is used in the Cadence Gate Ensemble placement
tool, for example.
FIGURE 16.22 Correlation between total length
of chip interconnect and the half-perimeter and
complete-graph measures.
One problem with the measurements we have described is that the MRST may only
approximate the interconnect that will be completed by the detailed router. Some
programs have a meander factor that specifies, on average, the ratio of the
interconnect created by the routing tool to the interconnect-length estimate used by the
placement tool. Another problem is that we have concentrated on finding estimates to
the MRST, but the MRST that minimizes total net length may not minimize net delay
(see Section 16.2.8 ).
FIGURE 16.24 Min-cut placement. (a) Divide the chip into bins using a grid.
(b) Merge all connections to the center of each bin. (c) Make a cut and swap
logic cells between bins to minimize the cost of the cut. (d) Take the cut pieces
and throw out all the edges that are not inside the piece. (e) Repeat the process
with a new cut and continue until we reach the individual bins.
Usually we divide the placement area into bins . The size of a bin can vary, from a bin
size equal to the base cell (for a gate array) to a bin size that would hold several logic
cells. We can start with a large bin size, to get a rough placement, and then reduce the
bin size to get a final placement.
The eigenvalue placement algorithm uses the cost matrix or weighted connectivity
matrix ( eigenvalue methods are also known as spectral methods ) [Hall, 1970]. The
measure we use is a cost function f that we shall minimize, given by
1n
f= S c ij d ij 2 , (16.6)
2i=1
= x T Bx + y T By . (16.7)
In Eq. 16.7 , B is a symmetric matrix, the disconnection matrix (also called the
Laplacian).
We may express the Laplacian B in terms of the connectivity matrix C ; and D , a
diagonal matrix (known as the degree matrix), defined as follows:
B =D C; (16.8)
n
d ii = S c ij , i = 1, ... , ni ; d ij = 0, i À j
i=1
must be a permutation of the fixed positions, p . We can show that requiring the logic
cells to be in fixed positions in this way leads to a series of n equations restricting the
values of the logic cell coordinates [ Cheng and Kuh, 1984]. If we impose all of these
constraint equations the problem becomes very complex. Instead we choose just one
of the equations:
n n
S xi2=S p i 2 . (16.11)
i=1 i=1
Simplifying the problem in this way will lead to an approximate solution to the
placement problem. We can write this single constraint on the x -coordinates in matrix
form:
n
xTx=P; P=S p i 2 . (16.12)
i=1
where P is a constant. We can now summarize the formulation of the problem, with
the simplifications that we have made, for a one-dimensional solution. We must
minimize a cost function, g (analogous to the cost function f that we defined for the
two-dimensional problem in Eq. 16.7 ), where
g = x T Bx . (16.13)
This last equation is called the characteristic equation for the disconnection matrix B
and occurs frequently in matrix algebra (this l has nothing to do with scaling). The
solutions to this equation are the eigenvectors and eigenvalues of B . Multiplying Eq.
16.16 by x T we get:
l x T x = x T Bx . (16.17)
The eigenvectors of the disconnection matrix B are the solutions to our placement
problem. It turns out that (because something called the rank of matrix B is n 1) there
is a degenerate solution with all x -coordinates equal ( l = 0)this makes some sense
because putting all the logic cells on top of one another certainly minimizes the
interconnect. The smallest, nonzero, eigenvalue and the corresponding eigenvector
provides the solution that we want. In the two-dimensional placement problem, the x -
and y -coordinates are given by the eigenvectors corresponding to the two smallest,
nonzero, eigenvalues. (In the next section a simple example illustrates this
mathematical derivation.)
16.2.5 Eigenvalue Placement Example
Consider the following connectivity matrix C and its disconnection matrix B ,
calculated from Eq. 16.8 [ Hall, 1970]:
0001
C=0011
0100
1100
1000 0001 1 0 01
B=0200 0011= 0 211
0010 0100 01 1 0
1100 1100 11 0 2
(16.19)
Figure 16.25 (a) shows the corresponding network with four logic cells (14) and three
nets (AC). Here is a MatLab script to find the eigenvalues and eigenvectors of B :
For a one-dimensional placement ( Figure 16.25 b), we use the eigenvector (0.6533,
0.2706, 0.6533, 0.2706) corresponding to the smallest nonzero eigenvalue (which is
0.5858) to place the logic cells along the x -axis. The two-dimensional placement (
Figure 16.25 c) uses these same values for the x -coordinates and the eigenvector (0.5,
0.5, 0.5, 0.5) that corresponds to the next largest eigenvalue (which is 2.0) for the y
-coordinates. Notice that the placement shown in Figure 16.25 (c), which shows
logic-cell outlines (the logic-cell abutment boxes), takes no account of the cell sizes,
and cells may even overlap at this stage. This is because, in Eq. 16.11 , we discarded
all but one of the constraints necessary to ensure valid solutions. Often we use the
approximate eigenvalue solution as an initial placement for one of the iterative
improvement algorithms that we shall discuss in Section 16.2.6 .
● The measurement criteria that decides whether to move the selected cells.
There are several interchange or iterative exchange methods that differ in their
selection and measurement criteria:
● pairwise interchange,
● force-directed interchange,
All of these methods usually consider only pairs of logic cells to be exchanged. A
source logic cell is picked for trial exchange with a destination logic cell. We have
already discussed the use of interchange methods applied to the system partitioning
step. The most widely used methods use group migration, especially the Kernighan
Lin algorithm. The pairwise-interchange algorithm is similar to the interchange
algorithm used for iterative improvement in the system partitioning step:
1. Select the source logic cell at random.
2. Try all the other logic cells in turn as the destination logic cell.
3. Use any of the measurement methods we have discussed to decide on whether
to accept the interchange.
4. The process repeats from step 1, selecting each logic cell in turn as a source
logic cell.
Figure 16.26 (a) and (b) show how we can extend pairwise interchange to swap more
than two logic cells at a time. If we swap l logic cells at a time and find a locally
optimum solution, we say that solution is l -optimum . The neighborhood exchange
algorithm is a modification to pairwise interchange that considers only destination
logic cells in a neighborhood cells within a certain distance, e, of the source logic
cell. Limiting the search area for the destination logic cell to the e -neighborhood
reduces the search time. Figure 16.26 (c) and (d) show the one- and
two-neighborhoods (based on Manhattan distance) for a logic cell.
FIGURE 16.26 Interchange. (a) Swapping the source logic cell with a destination
logic cell in pairwise interchange. (b) Sometimes we have to swap more than two
logic cells at a time to reach an optimum placement, but this is expensive in
computation time. Limiting the search to neighborhoods reduces the search time.
Logic cells within a distance e of a logic cell form an e-neighborhood. (c) A
one-neighborhood. (d) A two-neighborhood.
The vector component x ij is directed from the center of logic cell i to the center of
logic cell j . The vector magnitude is calculated as either the Euclidean or Manhattan
distance between the logic cell centers. The c ij form the connectivity or cost matrix
(the matrix element c ij is the number of connections between logic cell i and logic
cell j ). If we want, we can also weight the c ij to denote critical connections.
Figure 16.27 illustrates the force-directed placement algorithm.
FIGURE 16.27 Force-directed placement. (a) A network with nine logic cells.
(b) We make a grid (one logic cell per bin). (c) Forces are calculated as if springs
were attached to the centers of each logic cell for each connection. The two nets
connecting logic cells A and I correspond to two springs. (d) The forces are
proportional to the spring extensions.
Kirkpatrick, Gerlatt, and Vecchi first described the use of simulated annealing applied
to VLSI problems [ 1983]. Experience since that time has shown that simulated
annealing normally requires the use of a slow cooling schedule and this means long
CPU run times [ Sechen, 1988; Wong, Leong, and Liu, 1988]. As a general rule,
experiments show that simple min-cut based constructive placement is faster than
simulated annealing but that simulated annealing is capable of giving better results at
the expense of long computer run times. The iterative improvement methods that we
described earlier are capable of giving results as good as simulated annealing, but they
use more complex algorithms.
While I am making wild generalizations, I will digress to discuss benchmarks of
placement algorithms (or any CAD algorithm that is random). It is important to
remember that the results of random methods are themselves random. Suppose the
results from two random algorithms, A and B, can each vary by ±10 percent for any
chip placement, but both algorithms have the same average performance. If we
compare single chip placements by both algorithms, they could falsely show
algorithm A to be better than B by up to 20 percent or vice versa. Put another way, if
we run enough test cases we will eventually find some for which A is better than B by
20 percenta trick that Ph.D. students and marketing managers both know well. Even
single-run evaluations over multiple chips is hardly a fair comparison. The only way
to obtain meaningful results is to compare a statistically meaningful number of runs
for a statistically meaningful number of chips for each algorithm. This same caution
applies to any VLSI algorithm that is random. There was a Design Automation
Conference panel session whose theme was Enough of algorithms claiming
improvements of 5 %.
16.2.8 Timing-Driven Placement Methods
Minimizing delay is becoming more and more important as a placement objective.
There are two main approaches: net based and path based. We know that we can use
net weights in our algorithms. The problem is to calculate the weights. One method
finds the n most critical paths (using a timing-analysis engine, possibly in the
synthesis tool). The net weights might then be the number of times each net appears in
this list. The problem with this approach is that as soon as we fix (for example) the
first 100 critical nets, suddenly another 200 become critical. This is rather like trying
to put worms in a canas soon as we open the lid to put one in, two more pop out.
Another method to find the net weights uses the zero-slack algorithm [ Hauge et al.,
1987]. Figure 16.29 shows how this works (all times are in nanoseconds).
Figure 16.29 (a) shows a circuit with primary inputs at which we know the arrival
times (this is the original definition, some people use the term actual times ) of each
signal. We also know the required times for the primary outputs the points in time at
which we want the signals to be valid. We can work forward from the primary inputs
and backward from the primary outputs to determine arrival and required times at
each input pin for each net. The difference between the required and arrival times at
each input pin is the slack time (the time we have to spare). The zero-slack algorithm
adds delay to each net until the slacks are zero, as shown in Figure 16.29 (b). The net
delays can then be converted to weights or constraints in the placement. Notice that
we have assumed that all the gates on a net switch at the same time so that the net
delay can be placed at the output of the gate driving the neta rather poor timing model
but the best we can use without any routing information.
FIGURE 16.29 The zero-slack algorithm. (a) The circuit with no net delays. (b) The
zero-slack algorithm adds net delays (at the outputs of each gate, equivalent to
increasing the gate delay) to reduce the slack times to zero.
An important point to remember is that adjusting the net weight, even for every net on
a chip, does not theoretically make the placement algorithms any more complexwe
have to deal with the numbers anyway. It does not matter whether the net weight is 1
or 6.6, for example. The practical problem, however, is getting the weight information
for each net (usually in the form of timing constraints) from a synthesis tool or timing
verifier. These files can easily be hundreds of megabytes in size (see Section 16.4 ).
With the zero-slack algorithm we simplify but overconstrain the problem. For
example, we might be able to do a better job by making some nets a little longer than
the slack indicates if we can tighten up other nets. What we would really like to do is
deal with paths such as the critical path shown in Figure 16.29 (a) and not just nets .
Path-based algorithms have been proposed to do this, but they are complex and not all
commercial tools have this capability (see, for example, [ Youssef, Lin, and
Shragowitz, 1992]).
There is still the question of how to predict path delays between gates with only
placement information. Usually we still do not compute a routing tree but use simple
approximations to the total net length (such as the half-perimeter measure) and then
use this to estimate a net delay (the same to each pin on a net). It is not until the
routing step that we can make accurate estimates of the actual interconnect delays.