0% found this document useful (0 votes)
38 views44 pages

Module-2 Floor Planning and Routing

Uploaded by

gadagtrupti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views44 pages

Module-2 Floor Planning and Routing

Uploaded by

gadagtrupti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

L ast E d ited b y S P 14 1 12 0 0 4

Module-2: Floor planning, Placement and Routing

FLOORPLANNING
AND
PLACEMENT
The input to the floorplanning step is the output of system partitioning and design
entrya netlist. Floorplanning precedes placement, but we shall cover them
together. The output of the placement step is a set of directions for the routing
tools.
At the start of floorplanning we have a netlist describing circuit blocks, the logic
cells within the blocks, and their connections. For example, Figure 16.1 shows
the Viterbi decoder example as a collection of standard cells with no room set
aside yet for routing. We can think of the standard cells as a hod of bricks to be
made into a wall. What we have to do now is set aside spaces (we call these
spaces the channels ) for interconnect, the mortar, and arrange the cells.
Figure 16.2 shows a finished wallafter floorplanning and placement steps are
complete. We still have not completed any routing at this pointthat comes later
all we have done is placed the logic cells in a fashion that we hope will minimize
the total interconnect length, for example.
FIGURE 16.1 The starting point for the floorplanning and placement steps for
the Viterbi decoder (containing only standard cells). This is the initial display of
the floorplanning and placement tool. The small boxes that look like bricks are
the outlines of the standard cells. The largest standard cells, at the bottom of the
display (labeled dfctnb) are 188 D flip-flops. The '+' symbols represent the
drawing origins of the standard cellsfor the D flip-flops they are shifted to the
left and below the logic cell bottom left-hand corner. The large box surrounding
all the logic cells represents the estimated chip size. (This is a screen shot from
Cadence Cell Ensemble.)
FIGURE 16.2 The Viterbi Decoder (from Figure 16.1 ) after floorplanning and
placement. There are 18 rows of standard cells separated by 17 horizontal
channels (labeled 218). The channels are routed as numbered. In this example,
the I/O pads are omitted to show the cell placement more clearly. Figure 17.1
shows the same placement without the channel labels. (A screen shot from
Cadence Cell Ensemble.)
16.1 Floorplanning
Figure 16.3 shows that both interconnect delay and gate delay decrease as we
scale down feature sizesbut at different rates. This is because interconnect
capacitance tends to a limit of about 2 pFcm 1 for a minimum-width wire while
gate delay continues to decrease (see Section 17.4, Circuit Extraction and DRC).
Floorplanning allows us to predict this interconnect delay by estimating
interconnect length.
FIGURE 16.3 Interconnect
and gate delays. As feature
sizes decrease, both average
interconnect delay and
average gate delay decrease
but at different rates. This is
because interconnect
capacitance tends to a limit
that is independent of
scaling. Interconnect delay
now dominates gate delay.

16.1.1 Floorplanning Goals and


Objectives
The input to a floorplanning tool is a hierarchical netlist that describes the
interconnection of the blocks (RAM, ROM, ALU, cache controller, and so on);
the logic cells (NAND, NOR, D flip-flop, and so on) within the blocks; and the
logic cell connectors (the terms terminals , pins , or ports mean the same thing as
connectors ). The netlist is a logical description of the ASIC; the floorplan is a
physical description of an ASIC. Floorplanning is thus a mapping between the
logical description (the netlist) and the physical description (the floorplan).
The goals of floorplanning are to:
● arrange the blocks on a chip,

● decide the location of the I/O pads,

● decide the location and number of the power pads,

● decide the type of power distribution, and

● decide the location and type of clock distribution.


The objectives of floorplanning are to minimize the chip area and minimize
delay. Measuring area is straightforward, but measuring delay is more difficult
and we shall explore this next.

16.1.2 Measurement of Delay in


Floorplanning
Throughout the ASIC design process we need to predict the performance of the
final layout. In floorplanning we wish to predict the interconnect delay before we
complete any routing. Imagine trying to predict how long it takes to get from
Russia to China without knowing where in Russia we are or where our
destination is in China. Actually it is worse, because in floorplanning we may
move Russia or China.
To predict delay we need to know the parasitics associated with interconnect: the
interconnect capacitance ( wiring capacitance or routing capacitance ) as well as
the interconnect resistance. At the floorplanning stage we know only the fanout (
FO ) of a net (the number of gates driven by a net) and the size of the block that
the net belongs to. We cannot predict the resistance of the various pieces of the
interconnect path since we do not yet know the shape of the interconnect for a
net. However, we can estimate the total length of the interconnect and thus
estimate the total capacitance. We estimate interconnect length by collecting
statistics from previously routed chips and analyzing the results. From these
statistics we create tables that predict the interconnect capacitance as a function
of net fanout and block size. A floorplanning tool can then use these
predicted-capacitance tables (also known as interconnect-load tables or wire-load
tables ). Figure 16.4 shows how we derive and use wire-load tables and illustrates
the following facts:
FIGURE 16.4 Predicted capacitance. (a) Interconnect lengths as a function of
fanout (FO) and circuit-block size. (b) Wire-load table. There is only one
capacitance value for each fanout (typically the average value). (c) The
wire-load table predicts the capacitance and delay of a net (with a considerable
error). Net A and net B both have a fanout of 1, both have the same predicted net
delay, but net B in fact has a much greater delay than net A in the actual layout
(of course we shall not know what the actual layout is until much later in the
design process).
● Typically between 60 and 70 percent of nets have a FO = 1.
● The distribution for a FO = 1 has a very long tail, stretching to
interconnects that run from corner to corner of the chip.
● The distribution for a FO = 1 often has two peaks, corresponding to a
distribution for close neighbors in subgroups within a block, superimposed
on a distribution corresponding to routing between subgroups.
● We often see a twin-peaked distribution at the chip level also,
corresponding to separate distributions for interblock routing (inside
blocks) and intrablock routing (between blocks).
● The distributions for FO > 1 are more symmetrical and flatter than for FO
= 1.
● The wire-load tables can only contain one number, for example the average
net capacitance, for any one distribution. Many tools take a worst-case
approach and use the 80- or 90-percentile point instead of the average.
Thus a tool may use a predicted capacitance for which we know 90 percent
of the nets will have less than the estimated capacitance.
● We need to repeat the statistical analysis for blocks with different sizes.
For example, a net with a FO = 1 in a 25 k-gate block will have a different
(larger) average length than if the net were in a 5 k-gate block.
● The statistics depend on the shape (aspect ratio) of the block (usually the
statistics are only calculated for square blocks).
● The statistics will also depend on the type of netlist. For example, the
distributions will be different for a netlist generated by setting a constraint
for minimum logic delay during synthesiswhich tends to generate large
numbers of two-input NAND gatesthan for netlists generated using
minimum-area constraints.
There are no standards for the wire-load tables themselves, but there are some
standards for their use and for presenting the extracted loads (see Section 16.4 ).
Wire-load tables often present loads in terms of a standard load that is usually the
input capacitance of a two-input NAND gate with a 1X (default) drive strength.
TABLE 16.1 A wire-load table showing average interconnect lengths (mm). 1
Fanout
Array (available gates) Chip size (mm) 1 2 4
3k 3.45 0.56 0.85 1.46
11 k 5.11 0.84 1.34 2.25
105 k 12.50 1.75 2.70 4.92

Table 16.1 shows the estimated metal interconnect lengths, as a function of die
size and fanout, for a series of three-level metal gate arrays. In this case the
interconnect capacitance is about 2 pFcm 1 , a typical figure.
Figure 16.5 shows that, because we do not decrease chip size as we scale down
feature size, the worst-case interconnect delay increases. One way to measure the
worst-case delay uses an interconnect that completely crosses the chip, a
coast-to-coast interconnect . In certain cases the worst-case delay of a 0.25 m m
process may be worse than a 0.35 m m process, for example.
FIGURE 16.5
Worst-case
interconnect delay.
As we scale circuits,
but avoid scaling the
chip size, the
worst-case
interconnect delay
increases.

16.1.3 Floorplanning Tools


Figure 16.6 (a) shows an initial random floorplan generated by a floorplanning
tool. Two of the blocks, A and C in this example, are standard-cell areas (the chip
shown in Figure 16.1 is one large standard-cell area). These are flexible blocks
(or variable blocks ) because, although their total area is fixed, their shape (aspect
ratio) and connector locations may be adjusted during the placement step. The
dimensions and connector locations of the other fixed blocks (perhaps RAM,
ROM, compiled cells, or megacells) can only be modified when they are created.
We may force logic cells to be in selected flexible blocks by seeding . We choose
seed cells by name. For example, ram_control* would select all logic cells whose
names started with ram_control to be placed in one flexible block. The special
symbol, usually ' * ', is a wildcard symbol . Seeding may be hard or soft. A hard
seed is fixed and not allowed to move during the remaining floorplanning and
placement steps. A soft seed is an initial suggestion only and can be altered if
necessary by the floorplanner. We may also use seed connectors within flexible
blocksforcing certain nets to appear in a specified order, or location at the
boundary of a flexible block.
FIGURE 16.6 Floorplanning a cell-based ASIC. (a) Initial floorplan generated
by the floorplanning tool. Two of the blocks are flexible (A and C) and contain
rows of standard cells (unplaced). A pop-up window shows the status of block
A. (b) An estimated placement for flexible blocks A and C. The connector
positions are known and a rats nest display shows the heavy congestion below
block B. (c) Moving blocks to improve the floorplan. (d) The updated display
shows the reduced congestion after the changes.

The floorplanner can complete an estimated placement to determine the positions


of connectors at the boundaries of the flexible blocks. Figure 16.6 (b) illustrates a
rat's nest display of the connections between blocks. Connections are shown as
bundles between the centers of blocks or as flight lines between connectors.
Figure 16.6 (c) and (d) show how we can move the blocks in a floorplanning tool
to minimize routing congestion .
We need to control the aspect ratio of our floorplan because we have to fit our
chip into the die cavity (a fixed-size hole, usually square) inside a package.
Figure 16.7 (a)(c) show how we can rearrange our chip to achieve a square
aspect ratio. Figure 16.7 (c) also shows a congestion map , another form of
routability display. There is no standard measure of routability. Generally the
interconnect channels , (or wiring channelsI shall call them channels from now
on) have a certain channel capacity ; that is, they can handle only a fixed number
of interconnects. One measure of congestion is the difference between the
number of interconnects that we actually need, called the channel density , and
the channel capacity. Another measure, shown in Figure 16.7 (c), uses the ratio of
channel density to the channel capacity. With practice, we can create a good
initial placement by floorplanning and a pictorial display. This is one area where
the human ability to recognize patterns and spatial relations is currently superior
to a computer programs ability.

FIGURE 16.7 Congestion analysis. (a) The initial floorplan with a 2:1.5 die
aspect ratio. (b) Altering the floorplan to give a 1:1 chip aspect ratio. (c) A trial
floorplan with a congestion map. Blocks A and C have been placed so that we
know the terminal positions in the channels. Shading indicates the ratio of
channel density to the channel capacity. Dark areas show regions that cannot be
routed because the channel congestion exceeds the estimated capacity.
(d) Resizing flexible blocks A and C alleviates congestion.
FIGURE 16.8 Routing a T-junction between two channels in two-level metal.
The dots represent logic cell pins. (a) Routing channel A (the stem of the T) first
allows us to adjust the width of channel B. (b) If we route channel B first (the
top of the T), this fixes the width of channel A. We have to route the stem of a
T-junction before we route the top.

16.1.4 Channel Definition


During the floorplanning step we assign the areas between blocks that are to be
used for interconnect. This process is known as channel definition or channel
allocation . Figure 16.8 shows a T-shaped junction between two rectangular
channels and illustrates why we must route the stem (vertical) of the T before the
bar. The general problem of choosing the order of rectangular channels to route is
channel ordering .

FIGURE 16.9 Defining the channel routing order for a slicing floorplan using a
slicing tree. (a) Make a cut all the way across the chip between circuit blocks.
Continue slicing until each piece contains just one circuit block. Each cut divides
a piece into two without cutting through a circuit block. (b) A sequence of cuts:
1, 2, 3, and 4 that successively slices the chip until only circuit blocks are left.
(c) The slicing tree corresponding to the sequence of cuts gives the order in
which to route the channels: 4, 3, 2, and finally 1.

Figure 16.9 shows a floorplan of a chip containing several blocks. Suppose we


cut along the block boundaries slicing the chip into two pieces ( Figure 16.9 a).
Then suppose we can slice each of these pieces into two. If we can continue in
this fashion until all the blocks are separated, then we have a slicing floorplan (
Figure 16.9 b). Figure 16.9 (c) shows how the sequence we use to slice the chip
defines a hierarchy of the blocks. Reversing the slicing order ensures that we
route the stems of all the channel T-junctions first.
FIGURE 16.10 Cyclic constraints. (a) A nonslicing floorplan with a cyclic
constraint that prevents channel routing. (b) In this case it is difficult to find a
slicing floorplan without increasing the chip area. (c) This floorplan may be
sliced (with initial cuts 1 or 2) and has no cyclic constraints, but it is inefficient
in area use and will be very difficult to route.

Figure 16.10 shows a floorplan that is not a slicing structure. We cannot cut the
chip all the way across with a knife without chopping a circuit block in two. This
means we cannot route any of the channels in this floorplan without routing all of
the other channels first. We say there is a cyclic constraint in this floorplan. There
are two solutions to this problem. One solution is to move the blocks until we
obtain a slicing floorplan. The other solution is to allow the use of L -shaped,
rather than rectangular, channels (or areas with fixed connectors on all sidesa
switch box ). We need an area-based router rather than a channel router to route L
-shaped regions or switch boxes (see Section 17.2.6, Area-Routing Algorithms).
Figure 16.11 (a) displays the floorplan of the ASIC shown in Figure 16.7 . We
can remove the cyclic constraint by moving the blocks again, but this increases
the chip size. Figure 16.11 (b) shows an alternative solution. We merge the
flexible standard cell areas A and C. We can do this by selective flattening of the
netlist. Sometimes flattening can reduce the routing area because routing between
blocks is usually less efficient than routing inside the row-based blocks.
Figure 16.11 (b) shows the channel definition and routing order for our chip.

FIGURE 16.11 Channel definition and ordering. (a) We can eliminate the cyclic
constraint by merging the blocks A and C. (b) A slicing structure.
16.1.5 I/O and Power Planning
Every chip communicates with the outside world. Signals flow onto and off the
chip and we need to supply power. We need to consider the I/O and power
constraints early in the floorplanning process. A silicon chip or die (plural die,
dies, or dice) is mounted on a chip carrier inside a chip package . Connections are
made by bonding the chip pads to fingers on a metal lead frame that is part of the
package. The metal lead-frame fingers connect to the package pins . A die
consists of a logic core inside a pad ring . Figure 16.12 (a) shows a pad-limited
die and Figure 16.12 (b) shows a core-limited die . On a pad-limited die we use
tall, thin pad-limited pads , which maximize the number of pads we can fit
around the outside of the chip. On a core-limited die we use short, wide
core-limited pads . Figure 16.12 (c) shows how we can use both types of pad to
change the aspect ratio of a die to be different from that of the core.

FIGURE 16.12 Pad-limited and core-limited die. (a) A pad-limited die. The
number of pads determines the die size. (b) A core-limited die: The core logic
determines the die size. (c) Using both pad-limited pads and core-limited pads
for a square die.

Special power pads are used for the positive supply, or VDD, power buses (or
power rails ) and the ground or negative supply, VSS or GND. Usually one set of
VDD/VSS pads supplies one power ring that runs around the pad ring and
supplies power to the I/O pads only. Another set of VDD/VSS pads connects to a
second power ring that supplies the logic core. We sometimes call the I/O power
dirty power since it has to supply large transient currents to the output transistors.
We keep dirty power separate to avoid injecting noise into the internal-logic
power (the clean power ). I/O pads also contain special circuits to protect against
electrostatic discharge ( ESD ). These circuits can withstand very short
high-voltage (several kilovolt) pulses that can be generated during human or
machine handling.
Depending on the type of package and how the foundry attaches the silicon die to
the chip cavity in the chip carrier, there may be an electrical connection between
the chip carrier and the die substrate. Usually the die is cemented in the chip
cavity with a conductive epoxy, making an electrical connection between
substrate and the package cavity in the chip carrier. If we make an electrical
connection between the substrate and a chip pad, or to a package pin, it must be
to VDD ( n -type substrate) or VSS ( p -type substrate). This substrate connection
(for the whole chip) employs a down bond (or drop bond) to the carrier. We have
several options:
● We can dedicate one (or more) chip pad(s) to down bond to the chip
carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the chip pad to the chip carrier.
● We can make a connection from a chip pad to the lead frame and down
bond from the lead frame.
● We can down bond from the lead frame without using a chip pad.

● We can leave the substrate and/or chip carrier unconnected.

Depending on the package design, the type and positioning of down bonds may
be fixed. This means we need to fix the position of the chip pad for down
bonding using a pad seed .
A double bond connects two pads to one chip-carrier finger and one package pin.
We can do this to save package pins or reduce the series inductance of bond
wires (typically a few nanohenries) by parallel connection of the pads. A
multiple-signal pad or pad group is a set of pads. For example, an oscillator pad
usually comprises a set of two adjacent pads that we connect to an external
crystal. The oscillator circuit and the two signal pads form a single logic cell.
Another common example is a clock pad . Some foundries allow a special form
of corner pad (normal pads are edge pads ) that squeezes two pads into the area at
the corners of a chip using a special two-pad corner cell , to help meet bond-wire
angle design rules (see also Figure 16.13 b and c).

To reduce the series resistive and inductive impedance of power supply networks,
it is normal to use multiple VDD and VSS pads. This is particularly important
with the simultaneously switching outputs ( SSOs ) that occur when driving buses
off-chip [ Wada, Eino, and Anami, 1990]. The output pads can easily consume
most of the power on a CMOS ASIC, because the load on a pad (usually tens of
picofarads) is much larger than typical on-chip capacitive loads. Depending on
the technology it may be necessary to provide dedicated VDD and VSS pads for
every few SSOs. Design rules set how many SSOs can be used per VDD/VSS
pad pair. These dedicated VDD/VSS pads must follow groups of output pads as
they are seeded or planned on the floorplan. With some chip packages this can
become difficult because design rules limit the location of package pins that may
be used for supplies (due to the differing series inductance of each pin).
Using a pad mapping we translate the logical pad in a netlist to a physical pad
from a pad library . We might control pad seeding and mapping in the
floorplanner. The handling of I/O pads can become quite complex; there are
several nonobvious factors that must be considered when generating a pad ring:
● Ideally we would only need to design library pad cells for one orientation.
For example, an edge pad for the south side of the chip, and a corner pad
for the southeast corner. We could then generate other orientations by
rotation and flipping (mirroring). Some ASIC vendors will not allow
rotation or mirroring of logic cells in the mask file. To avoid these
problems we may need to have separate horizontal, vertical, left-handed,
and right-handed pad cells in the library with appropriate logical to
physical pad mappings.
● If we mix pad-limited and core-limited edge pads in the same pad ring, this
complicates the design of corner pads. Usually the two types of edge pad
cannot abut. In this case a corner pad also becomes a pad-format changer ,
or hybrid corner pad .
● In single-supply chips we have one VDD net and one VSS net, both global
power nets . It is also possible to use mixed power supplies (for example,
3.3 V and 5 V) or multiple power supplies ( digital VDD, analog VDD).
Figure 16.13 (a) and (b) are magnified views of the southeast corner of our
example chip and show the different types of I/O cells. Figure 16.13 (c) shows a
stagger-bond arrangement using two rows of I/O pads. In this case the design
rules for bond wires (the spacing and the angle at which the bond wires leave the
pads) become very important.
FIGURE 16.13 Bonding pads. (a) This chip uses both pad-limited and
core-limited pads. (b) A hybrid corner pad. (c) A chip with stagger-bonded pads.
(d) An area-bump bonded chip (or flip-chip). The chip is turned upside down
and solder bumps connect the pads to the lead frame.

Figure 16.13 (d) shows an area-bump bonding arrangement (also known as


flip-chip, solder-bump or C4, terms coined by IBM who developed this
technology [ Masleid, 1991]) used, for example, with ball-grid array ( BGA )
packages. Even though the bonding pads are located in the center of the chip, the
I/O circuits are still often located at the edges of the chip because of difficulties
in power supply distribution and integrating I/O circuits together with logic in the
center of the die.
In an MGA the pad spacing and I/O-cell spacing is fixedeach pad occupies a
fixed pad slot (or pad site ). This means that the properties of the pad I/O are also
fixed but, if we need to, we can parallel adjacent output cells to increase the
drive. To increase flexibility further the I/O cells can use a separation, the
I/O-cell pitch , that is smaller than the pad pitch . For example, three 4 mA driver
cells can occupy two pad slots. Then we can use two 4 mA output cells in parallel
to drive one pad, forming an 8 mA output pad as shown in Figure 16.14 . This
arrangement also means the I/O pad cells can be changed without changing the
base array. This is useful as bonding techniques improve and the pads can be
moved closer together.

FIGURE 16.14 Gate-array I/O pads. (a) Cell-based ASICs may contain pad
cells of different sizes and widths. (b) A corner of a gate-array base. (c) A
gate-array base with different I/O cell and pad pitches.
FIGURE 16.15 Power distribution. (a) Power distributed using m1 for VSS and
m2 for VDD. This helps minimize the number of vias and layer crossings needed
but causes problems in the routing channels. (b) In this floorplan m1 is run
parallel to the longest side of all channels, the channel spine. This can make
automatic routing easier but may increase the number of vias and layer
crossings. (c) An expanded view of part of a channel (interconnect is shown as
lines). If power runs on different layers along the spine of a channel, this forces
signals to change layers. (d) A closeup of VDD and VSS buses as they cross.
Changing layers requires a large number of via contacts to reduce resistance.

Figure 16.15 shows two possible power distribution schemes. The long direction
of a rectangular channel is the channel spine . Some automatic routers may
require that metal lines parallel to a channel spine use a preferred layer (either
m1, m2, or m3). Alternatively we say that a particular metal layer runs in a
preferred direction . Since we can have both horizontal and vertical channels, we
may have the situation shown in Figure 16.15 , where we have to decide whether
to use a preferred layer or the preferred direction for some channels. This may or
may not be handled automatically by the routing software.

16.1.6 Clock Planning


Figure 16.16 (a) shows a clock spine (not to be confused with a channel spine)
routing scheme with all clock pins driven directly from the clock driver. MGAs
and FPGAs often use this fish bone type of clock distribution scheme.
Figure 16.16 (b) shows a clock spine for a cell-based ASIC. Figure 16.16 (c)
shows the clock-driver cell, often part of a special clock-pad cell. Figure 16.16
(d) illustrates clock skew and clock latency . Since all clocked elements are
driven from one net with a clock spine, skew is caused by differing interconnect
lengths and loads. If the clock-driver delay is much larger than the interconnect
delays, a clock spine achieves minimum skew but with long latency.
FIGURE 16.16 Clock distribution. (a) A clock spine for a gate array. (b) A
clock spine for a cell-based ASIC (typical chips have thousands of clock nets).
(c) A clock spine is usually driven from one or more clock-driver cells. Delay in
the driver cell is a function of the number of stages and the ratio of output to
input capacitance for each stage (taper). (d) Clock latency and clock skew. We
would like to minimize both latency and skew.

Clock skew represents a fraction of the clock period that we cannot use for
computation. A clock skew of 500 ps with a 200 MHz clock means that we waste
500 ps of every 5 ns clock cycle, or 10 percent of performance. Latency can
cause a similar loss of performance at the system level when we need to
resynchronize our output signals with a master system clock.
Figure 16.16 (c) illustrates the construction of a clock-driver cell. The delay
through a chain of CMOS gates is minimized when the ratio between the input
capacitance C 1 and the output (load) capacitance C 2 is about 3 (exactly e ª 2.7,
an exponential ratio, if we neglect the effect of parasitics). This means that the
fastest way to drive a large load is to use a chain of buffers with their input and
output loads chosen to maintain this ratio, or taper (we use this as a noun and a
verb). This is not necessarily the smallest or lowest-power method, though.
Suppose we have an ASIC with the following specifications:
● 40,000 flip-flops

● Input capacitance of the clock input to each flip-flop is 0.025 pF


● Clock frequency is 200 MHz
● V DD = 3.3 V
● Chip size is 20 mm on a side
● Clock spine consists of 200 lines across the chip
● Interconnect capacitance is 2 pFcm 1
In this case the clock-spine capacitance C L = 200 ¥ 2 cm ¥ 2 pFcm 1 = 800 pF.
If we drive the clock spine with a chain of buffers with taper equal to e ª 2.7, and
with a first-stage input capacitance of 0.025 pF (a reasonable value for a 0.5 m m
process), we will need
800 ¥ 10 12
log or 11 stages. (16.1)
0.025 ¥ 10 12

The power dissipated charging the input capacitance of the flip-flop clock is fCV
2 or

P 1 1 = (4 ¥ 10 4 ) (200 MHz) (0.025 pF) (3.3 V) 2 = 2.178 W . (16.2)

or approximately 2 W. This is only a little larger than the power dissipated


driving the 800 pF clock-spine interconnect that we can calculate as follows:
P 2 1 = (200 ) (200 MHz) (20 mm) (2 pFcm 1 )(3.3 V) 2 = 1.7424 W . (16.3)

All of this power is dissipated in the clock-driver cell. The worst problem,
however, is the enormous peak current in the final inverter stage. If we assume
the needed rise time is 0.1 ns (with a 200 MHz clock whose period is 5 ns), the
peak current would have to approach
(800 pF) (3.3 V)
I= ª 25 A . (16.4)
0.1 ns

Clearly such a current is not possible without extraordinary design techniques.


Clock spines are used to drive loads of 100200 pF but, as is apparent from the
power dissipation problems of this example, it would be better to find a way to
spread the power dissipation more evenly across the chip.
We can design a tree of clock buffers so that the taper of each stage is e • 2.7 by
using a fanout of three at each node, as shown in Figure 16.17 (a) and (b). The
clock tree , shown in Figure 16.17 (c), uses the same number of stages as a clock
spine, but with a lower peak current for the inverter buffers. Figure 16.17 (c)
illustrates that we now have another problemwe need to balance the delays
through the tree carefully to minimize clock skew (see Section 17.3.1, Clock
Routing).
FIGURE 16.17 A clock tree. (a) Minimum delay is achieved when the taper of
successive stages is about 3. (b) Using a fanout of three at successive nodes.
(c) A clock tree for the cell-based ASIC of Figure 16.16 b. We have to balance
the clock arrival times at all of the leaf nodes to minimize clock skew.

Designing a clock tree that balances the rise and fall times at the leaf nodes has
the beneficial side-effect of minimizing the effect of hot-electron wearout . This
problem occurs when an electron gains enough energy to become hot and jump
out of the channel into the gate oxide (the problem is worse for electrons in n
-channel devices because electrons are more mobile than holes). The trapped
electrons change the threshold voltage of the device and this alters the delay of
the buffers. As the buffer delays change with time, this introduces unpredictable
skew. The problem is worst when the n -channel device is carrying maximum
current with a high voltage across the channelthis occurs during the rise-and
fall-time transitions. Balancing the rise and fall times in each buffer means that
they all wear out at the same rate, minimizing any additional skew.
A phase-locked loop ( PLL ) is an electronic flywheel that locks in frequency to
an input clock signal. The input and output frequencies may differ in phase,
however. This means that we can, for example, drive a clock network with a PLL
in such a way that the output of the clock network is locked in phase to the
incoming clock, thus eliminating the latency of the clock network . A PLL can
also help to reduce random variation of the input clock frequency, known as jitter
, which, since it is unpredictable, must also be discounted from the time available
for computation in each clock cycle. Actel was one of the first FPGA vendors to
incorporate PLLs, and Actels online product literature explains their use in ASIC
design.

1. Interconnect lengths are derived from interconnect capacitance data.


Interconnect capacitance is 2 pFcm 1 .
16.2 Placement
After completing a floorplan we can begin placement of the logic cells within the
flexible blocks. Placement is much more suited to automation than floorplanning.
Thus we shall need measurement techniques and algorithms. After we complete
floorplanning and placement, we can predict both intrablock and interblock
capacitances. This allows us to return to logic synthesis with more accurate estimates
of the capacitive loads that each logic cell must drive.

16.2.1 Placement Terms and Definitions


CBIC, MGA, and FPGA architectures all have rows of logic cells separated by the
interconnectthese are row-based ASICs . Figure 16.18 shows an example of the
interconnect structure for a CBIC. Interconnect runs in horizontal and vertical
directions in the channels and in the vertical direction by crossing through the logic
cells. Figure 16.18 (c) illustrates the fact that it is possible to use over-the-cell routing
( OTC routing) in areas that are not blocked. However, OTC routing is complicated by
the fact that the logic cells themselves may contain metal on the routing layers. We
shall return to this topic in Section 17.2.7, Multilevel Routing. Figure 16.19 shows
the interconnect structure of a two-level metal MGA.
FIGURE 16.18 Interconnect structure. (a) The two-level metal CBIC floorplan
shown in Figure 16.11 b. (b) A channel from the flexible block A. This channel has a
channel height equal to the maximum channel density of 7 (there is room for seven
interconnects to run horizontally in m1). (c) A channel that uses OTC (over-the-cell)
routing in m2.

Most ASICs currently use two or three levels of metal for signal routing. With two
layers of metal, we route within the rectangular channels using the first metal layer for
horizontal routing, parallel to the channel spine, and the second metal layer for the
vertical direction (if there is a third metal layer it will normally run in the horizontal
direction again). The maximum number of horizontal interconnects that can be placed
side by side, parallel to the channel spine, is the channel capacity .

FIGURE 16.19 Gate-array interconnect. (a) A small two-level metal gate array
(about 4.6 k-gate). (b) Routing in a block. (c) Channel routing showing channel
density and channel capacity. The channel height on a gate array may only be
increased in increments of a row. If the interconnect does not use up all of the
channel, the rest of the space is wasted. The interconnect in the channel runs in m1 in
the horizontal direction with m2 in the vertical direction.

Vertical interconnect uses feedthroughs (or feedthrus in the United States) to cross the
logic cells. Here are some commonly used terms with explanations (there are no
generally accepted definitions):
● An unused vertical track (or just track ) in a logic cell is called an uncommitted
feedthrough (also built-in feedthrough , implicit feedthrough , or jumper ).
● A vertical strip of metal that runs from the top to bottom of a cell (for
double-entry cells ), but has no connections inside the cell, is also called a
feedthrough or jumper.
● Two connectors for the same physical net are electrically equivalent connectors
(or equipotential connectors ). For double-entry cells these are usually at the top
and bottom of the logic cell.
● A dedicated feedthrough cell (or crosser cell ) is an empty cell (with no logic)
that can hold one or more vertical interconnects. These are used if there are no
other feedthroughs available.
● A feedthrough pin or feedthrough terminal is an input or output that has
connections at both the top and bottom of the standard cell.
● A spacer cell (usually the same as a feedthrough cell) is used to fill space in
rows so that the ends of all rows in a flexible block may be aligned to connect
to power buses, for example.
There is no standard terminology for connectors and the terms can be very confusing.
There is a difference between connectors that are joined inside the logic cell using a
high-resistance material such as polysilicon and connectors that are joined by
low-resistance metal. The high-resistance kind are really two separate alternative
connectors (that cannot be used as a feedthrough), whereas the low-resistance kind are
electrically equivalent connectors. There may be two or more connectors to a logic
cell, which are not joined inside the cell, and which must be joined by the router (
must-join connectors ).
There are also logically equivalent connectors (or functionally equivalent connectors,
sometimes also called just equivalent connectorswhich is very confusing). The two
inputs of a two-input NAND gate may be logically equivalent connectors. The
placement tool can swap these without altering the logic (but the two inputs may have
different delay properties, so it is not always a good idea to swap them). There can
also be logically equivalent connector groups . For example, in an OAI22
(OR-AND-INVERT) gate there are four inputs: A1, A2 are inputs to one OR gate
(gate A), and B1, B2 are inputs to the second OR gate (gate B). Then group A = (A1,
A2) is logically equivalent to group B = (B1, B2)if we swap one input (A1 or A2)
from gate A to gate B, we must swap the other input in the group (A2 or A1).
In the case of channeled gate arrays and FPGAs, the horizontal interconnect areasthe
channels, usually on m1have a fixed capacity (sometimes they are called
fixed-resource ASICs for this reason). The channel capacity of CBICs and channelless
MGAs can be expanded to hold as many interconnects as are needed. Normally we
choose, as an objective, to minimize the number of interconnects that use each
channel. In the vertical interconnect direction, usually m2, FPGAs still have fixed
resources. In contrast the placement tool can always add vertical feedthroughs to a
channeled MGA, channelless MGA, or CBIC. These problems become less important
as we move to three and more levels of interconnect.
16.2.2 Placement Goals and Objectives
The goal of a placement tool is to arrange all the logic cells within the flexible blocks
on a chip. Ideally, the objectives of the placement step are to
● Guarantee the router can complete the routing step

● Minimize all the critical net delays

● Make the chip as dense as possible

We may also have the following additional objectives:


● Minimize power dissipation

● Minimize cross talk between signals

Objectives such as these are difficult to define in a way that can be solved with an
algorithm and even harder to actually meet. Current placement tools use more specific
and achievable criteria. The most commonly used placement objectives are one or
more of the following:
● Minimize the total estimated interconnect length

● Meet the timing requirements for critical nets

● Minimize the interconnect congestion

Each of these objectives in some way represents a compromise.

16.2.3 Measurement of Placement Goals


and Objectives
In order to determine the quality of a placement, we need to be able to measure it. We
need an approximate measure of interconnect length, closely correlated with the final
interconnect length, that is easy to calculate.
The graph structures that correspond to making all the connections for a net are
known as trees on graphs (or just trees ). Special classes of trees Steiner trees
minimize the total length of interconnect and they are central to ASIC routing
algorithms. Figure 16.20 shows a minimum Steiner tree. This type of tree uses
diagonal connectionswe want to solve a restricted version of this problem, using
interconnects on a rectangular grid. This is called rectilinear routing or Manhattan
routing (because of the eastwest and northsouth grid of streets in Manhattan). We say
that the Euclidean distance between two points is the straight-line distance (as the
crow flies). The Manhattan distance (or rectangular distance) between two points is
the distance we would have to walk in New York.
FIGURE 16.20 Placement using trees on graphs. (a) The floorplan from Figure 16.11
b. (b) An expanded view of the flexible block A showing four rows of standard cells
for placement (typical blocks may contain thousands or tens of thousands of logic
cells). We want to find the length of the net shown with four terminals, W through Z,
given the placement of four logic cells (labeled: A.211, A.19, A.43, A.25). (c) The
problem for net (W, X, Y, Z) drawn as a graph. The shortest connection is the
minimum Steiner tree. (d) The minimum rectilinear Steiner tree using Manhattan
routing. The rectangular (Manhattan) interconnect-length measures are shown for
each tree.

The minimum rectilinear Steiner tree ( MRST ) is the shortest interconnect using a
rectangular grid. The determination of the MRST is in general an NP-complete
problemwhich means it is hard to solve. For small numbers of terminals heuristic
algorithms do exist, but they are expensive to compute. Fortunately we only need to
estimate the length of the interconnect. Two approximations to the MRST are shown
in Figure 16.21 .

The complete graph has connections from each terminal to every other terminal [
Hanan, Wolff, and Agule, 1973]. The complete-graph measure adds all the
interconnect lengths of the complete-graph connection together and then divides by n
/2, where n is the number of terminals. We can justify this since, in a graph with n
terminals, ( n 1) interconnects will emanate from each terminal to join the other ( n
1) terminals in a complete graph connection. That makes n ( n 1) interconnects in
total. However, we have then made each connection twice. So there are one-half this
many, or n ( n 1)/2, interconnects needed for a complete graph connection. Now we
actually only need ( n 1) interconnects to join n terminals, so we have n /2 times as
many interconnects as we really need. Hence we divide the total net length of the
complete graph connection by n /2 to obtain a more reasonable estimate of minimum
interconnect length. Figure 16.21 (a) shows an example of the complete-graph
measure.
FIGURE 16.21 Interconnect-length
measures. (a) Complete-graph
measure. (b) Half-perimeter measure.

The bounding box is the smallest rectangle that encloses all the terminals (not to be
confused with a logic cell bounding box, which encloses all the layout in a logic cell).
The half-perimeter measure (or bounding-box measure) is one-half the perimeter of
the bounding box ( Figure 16.21 b) [ Schweikert, 1976]. For nets with two or three
terminals (corresponding to a fanout of one or two, which usually includes over 50
percent of all nets on a chip), the half-perimeter measure is the same as the minimum
Steiner tree. For nets with four or five terminals, the minimum Steiner tree is between
one and two times the half-perimeter measure [ Hanan, 1966]. For a circuit with m
nets, using the half-perimeter measure corresponds to minimizing the cost function,
1m
f= S h i , (16.5)
2i=1

where h i is the half-perimeter measure for net i .

It does not really matter if our approximations are inaccurate if there is a good
correlation between actual interconnect lengths (after routing) and our
approximations. Figure 16.22 shows that we can adjust the complete-graph and
half-perimeter measures using correction factors [ Goto and Matsuda, 1986]. Now our
wiring length approximations are functions, not just of the terminal positions, but also
of the number of terminals, and the size of the bounding box. One practical example
adjusts a Steiner-tree approximation using the number of terminals [ Chao, Nequist,
and Vuong, 1990]. This technique is used in the Cadence Gate Ensemble placement
tool, for example.
FIGURE 16.22 Correlation between total length
of chip interconnect and the half-perimeter and
complete-graph measures.

One problem with the measurements we have described is that the MRST may only
approximate the interconnect that will be completed by the detailed router. Some
programs have a meander factor that specifies, on average, the ratio of the
interconnect created by the routing tool to the interconnect-length estimate used by the
placement tool. Another problem is that we have concentrated on finding estimates to
the MRST, but the MRST that minimizes total net length may not minimize net delay
(see Section 16.2.8 ).

There is no point in minimizing the interconnect length if we create a placement that


is too congested to route. If we use minimum interconnect congestion as an additional
placement objective, we need some way of measuring it. What we are trying to
measure is interconnect density. Unfortunately we always use the term density to
mean channel density (which we shall discuss in Section 17.2.2, Measurement of
Channel Density). In this chapter, while we are discussing placement, we shall try to
use the term congestion , instead of density, to avoid any confusion.
One measure of interconnect congestion uses the maximum cut line . Imagine a
horizontal or vertical line drawn anywhere across a chip or block, as shown in
Figure 16.23 . The number of interconnects that must cross this line is the cut size (the
number of interconnects we cut). The maximum cut line has the highest cut size.
FIGURE 16.23 Interconnect congestion for the cell-based ASIC from Figure 16.11
(b). (a) Measurement of congestion. (b) An expanded view of flexible block A shows
a maximum cut line.

Many placement tools minimize estimated interconnect length or interconnect


congestion as objectives. The problem with this approach is that a logic cell may be
placed a long way from another logic cell to which it has just one connection. This
logic cell with one connection is less important as far as the total wire length is
concerned than other logic cells, to which there are many connections. However, the
one long connection may be critical as far as timing delay is concerned. As technology
is scaled, interconnection delays become larger relative to circuit delays and this
problem gets worse.
In timing-driven placement we must estimate delay for every net for every trial
placement, possibly for hundreds of thousands of gates. We cannot afford to use
anything other than the very simplest estimates of net delay. Unfortunately, the
minimum-length Steiner tree does not necessarily correspond to the interconnect path
that minimizes delay. To construct a minimum-delay path we may have to route with
non-Steiner trees. In the placement phase typically we take a simple
interconnect-length approximation to this minimum-delay path (typically the
half-perimeter measure). Even when we can estimate the length of the interconnect,
we do not yet have information on which layers and how many vias the interconnect
will use or how wide it will be. Some tools allow us to include estimates for these
parameters. Often we can specify metal usage , the percentage of routing on the
different layers to expect from the router. This allows the placement tool to estimate
RC values and delaysand thus minimize delay.

16.2.4 Placement Algorithms


There are two classes of placement algorithms commonly used in commercial CAD
tools: constructive placement and iterative placement improvement. A constructive
placement method uses a set of rules to arrive at a constructed placement. The most
commonly used methods are variations on the min-cut algorithm . The other
commonly used constructive placement algorithm is the eigenvalue method. As in
system partitioning, placement usually starts with a constructed solution and then
improves it using an iterative algorithm. In most tools we can specify the locations
and relative placements of certain critical logic cells as seed placements .
The min-cut placement method uses successive application of partitioning [ Breuer,
1977]. The following steps are shown in Figure 16.24 :
1. Cut the placement area into two pieces.
2. Swap the logic cells to minimize the cut cost.
3. Repeat the process from step 1, cutting smaller pieces until all the logic cells are
placed.

FIGURE 16.24 Min-cut placement. (a) Divide the chip into bins using a grid.
(b) Merge all connections to the center of each bin. (c) Make a cut and swap
logic cells between bins to minimize the cost of the cut. (d) Take the cut pieces
and throw out all the edges that are not inside the piece. (e) Repeat the process
with a new cut and continue until we reach the individual bins.

Usually we divide the placement area into bins . The size of a bin can vary, from a bin
size equal to the base cell (for a gate array) to a bin size that would hold several logic
cells. We can start with a large bin size, to get a rough placement, and then reduce the
bin size to get a final placement.
The eigenvalue placement algorithm uses the cost matrix or weighted connectivity
matrix ( eigenvalue methods are also known as spectral methods ) [Hall, 1970]. The
measure we use is a cost function f that we shall minimize, given by
1n
f= S c ij d ij 2 , (16.6)
2i=1

where C = [ c ij ] is the (possibly weighted) connectivity matrix, and d ij is the


Euclidean distance between the centers of logic cell i and logic cell j . Since we are
going to minimize a cost function that is the square of the distance between logic
cells, these methods are also known as quadratic placement methods. This type of cost
function leads to a simple mathematical solution. We can rewrite the cost function f in
matrix form:
1n
f= S c ij ( x i x j ) 2 + (y i y j ) 2
2i=1

= x T Bx + y T By . (16.7)

In Eq. 16.7 , B is a symmetric matrix, the disconnection matrix (also called the
Laplacian).
We may express the Laplacian B in terms of the connectivity matrix C ; and D , a
diagonal matrix (known as the degree matrix), defined as follows:
B =D C; (16.8)

n
d ii = S c ij , i = 1, ... , ni ; d ij = 0, i À j
i=1

We can simplify the problem by noticing that it is symmetric in the x - and y


-coordinates. Let us solve the simpler problem of minimizing the cost function for the
placement of logic cells along just the x -axis first. We can then apply this solution to
the more general two-dimensional placement problem. Before we solve this simpler
problem, we introduce a constraint that the coordinates of the logic cells must
correspond to valid positions (the cells do not overlap and they are placed on-grid).
We make another simplifying assumption that all logic cells are the same size and we
must place them in fixed positions. We can define a vector p consisting of the valid
positions:
p = [ p 1 , ..., p n ] . (16.9)

For a valid placement the x -coordinates of the logic cells,


x = [ x 1 , ..., x n ] . (16.10)

must be a permutation of the fixed positions, p . We can show that requiring the logic
cells to be in fixed positions in this way leads to a series of n equations restricting the
values of the logic cell coordinates [ Cheng and Kuh, 1984]. If we impose all of these
constraint equations the problem becomes very complex. Instead we choose just one
of the equations:
n n
S xi2=S p i 2 . (16.11)
i=1 i=1

Simplifying the problem in this way will lead to an approximate solution to the
placement problem. We can write this single constraint on the x -coordinates in matrix
form:
n
xTx=P; P=S p i 2 . (16.12)
i=1

where P is a constant. We can now summarize the formulation of the problem, with
the simplifications that we have made, for a one-dimensional solution. We must
minimize a cost function, g (analogous to the cost function f that we defined for the
two-dimensional problem in Eq. 16.7 ), where
g = x T Bx . (16.13)

subject to the constraint:


x T x = P . (16.14)

This is a standard problem that we can solve using a Lagrangian multiplier:


L = x T Bx l [ x T x P] . (16.15)

To find the value of x that minimizes g we differentiate L partially with respect to x


and set the result equal to zero. We get the following equation:
[ B l I ] x = 0 . (16.16)

This last equation is called the characteristic equation for the disconnection matrix B
and occurs frequently in matrix algebra (this l has nothing to do with scaling). The
solutions to this equation are the eigenvectors and eigenvalues of B . Multiplying Eq.
16.16 by x T we get:
l x T x = x T Bx . (16.17)

However, since we imposed the constraint x T x = P and x T Bx = g , then


l = g /P . (16.18)

The eigenvectors of the disconnection matrix B are the solutions to our placement
problem. It turns out that (because something called the rank of matrix B is n 1) there
is a degenerate solution with all x -coordinates equal ( l = 0)this makes some sense
because putting all the logic cells on top of one another certainly minimizes the
interconnect. The smallest, nonzero, eigenvalue and the corresponding eigenvector
provides the solution that we want. In the two-dimensional placement problem, the x -
and y -coordinates are given by the eigenvectors corresponding to the two smallest,
nonzero, eigenvalues. (In the next section a simple example illustrates this
mathematical derivation.)
16.2.5 Eigenvalue Placement Example
Consider the following connectivity matrix C and its disconnection matrix B ,
calculated from Eq. 16.8 [ Hall, 1970]:
0001
C=0011
0100
1100

1000 0001 1 0 01
B=0200 0011= 0 211
0010 0100 01 1 0
1100 1100 11 0 2
(16.19)

Figure 16.25 (a) shows the corresponding network with four logic cells (14) and three
nets (AC). Here is a MatLab script to find the eigenvalues and eigenvectors of B :

FIGURE 16.25 Eigenvalue placement. (a) An example network. (b) The


one-dimensional placement.The small black squares represent the centers of the logic
cells. (c) The two-dimensional placement. The eigenvalue method takes no account
of the logic cell sizes or actual location of logic cell connectors. (d) A complete
layout. We snap the logic cells to valid locations, leaving room for the routing in the
channel.
C=[0 0 0 1; 0 0 1 1; 0 1 0 0; 1 1 0 0]
D=[1 0 0 0; 0 2 0 0; 0 0 1 0; 0 0 0 2]
B=D-C
[X,D] = eig(B)
Running this script, we find the eigenvalues of B are 0.5858, 0.0, 2.0, and 3.4142. The
corresponding eigenvectors of B are
0.6533 0.5000 0.5000 0.2706
0.2706 0.5000 0.5000 0.6533
0.6533 0.5000 0.5000 0.2706
0.2706 0.5000 0.5000 0.6533
(16.20)

For a one-dimensional placement ( Figure 16.25 b), we use the eigenvector (0.6533,
0.2706, 0.6533, 0.2706) corresponding to the smallest nonzero eigenvalue (which is
0.5858) to place the logic cells along the x -axis. The two-dimensional placement (
Figure 16.25 c) uses these same values for the x -coordinates and the eigenvector (0.5,
0.5, 0.5, 0.5) that corresponds to the next largest eigenvalue (which is 2.0) for the y
-coordinates. Notice that the placement shown in Figure 16.25 (c), which shows
logic-cell outlines (the logic-cell abutment boxes), takes no account of the cell sizes,
and cells may even overlap at this stage. This is because, in Eq. 16.11 , we discarded
all but one of the constraints necessary to ensure valid solutions. Often we use the
approximate eigenvalue solution as an initial placement for one of the iterative
improvement algorithms that we shall discuss in Section 16.2.6 .

16.2.6 Iterative Placement Improvement


An iterative placement improvement algorithm takes an existing placement and tries
to improve it by moving the logic cells. There are two parts to the algorithm:
● The selection criteria that decides which logic cells to try moving.

● The measurement criteria that decides whether to move the selected cells.

There are several interchange or iterative exchange methods that differ in their
selection and measurement criteria:
● pairwise interchange,

● force-directed interchange,

● force-directed relaxation, and

● force-directed pairwise relaxation.

All of these methods usually consider only pairs of logic cells to be exchanged. A
source logic cell is picked for trial exchange with a destination logic cell. We have
already discussed the use of interchange methods applied to the system partitioning
step. The most widely used methods use group migration, especially the Kernighan
Lin algorithm. The pairwise-interchange algorithm is similar to the interchange
algorithm used for iterative improvement in the system partitioning step:
1. Select the source logic cell at random.
2. Try all the other logic cells in turn as the destination logic cell.
3. Use any of the measurement methods we have discussed to decide on whether
to accept the interchange.
4. The process repeats from step 1, selecting each logic cell in turn as a source
logic cell.
Figure 16.26 (a) and (b) show how we can extend pairwise interchange to swap more
than two logic cells at a time. If we swap l logic cells at a time and find a locally
optimum solution, we say that solution is l -optimum . The neighborhood exchange
algorithm is a modification to pairwise interchange that considers only destination
logic cells in a neighborhood cells within a certain distance, e, of the source logic
cell. Limiting the search area for the destination logic cell to the e -neighborhood
reduces the search time. Figure 16.26 (c) and (d) show the one- and
two-neighborhoods (based on Manhattan distance) for a logic cell.

FIGURE 16.26 Interchange. (a) Swapping the source logic cell with a destination
logic cell in pairwise interchange. (b) Sometimes we have to swap more than two
logic cells at a time to reach an optimum placement, but this is expensive in
computation time. Limiting the search to neighborhoods reduces the search time.
Logic cells within a distance e of a logic cell form an e-neighborhood. (c) A
one-neighborhood. (d) A two-neighborhood.

Neighborhoods are also used in some of the force-directed placement methods .


Imagine identical springs connecting all the logic cells we wish to place. The number
of springs is equal to the number of connections between logic cells. The effect of the
springs is to pull connected logic cells together. The more highly connected the logic
cells, the stronger the pull of the springs. The force on a logic cell i due to logic cell j
is given by Hookes law , which says the force of a spring is proportional to its
extension:
F ij = c ij x ij . (16.21)

The vector component x ij is directed from the center of logic cell i to the center of
logic cell j . The vector magnitude is calculated as either the Euclidean or Manhattan
distance between the logic cell centers. The c ij form the connectivity or cost matrix
(the matrix element c ij is the number of connections between logic cell i and logic
cell j ). If we want, we can also weight the c ij to denote critical connections.
Figure 16.27 illustrates the force-directed placement algorithm.

FIGURE 16.27 Force-directed placement. (a) A network with nine logic cells.
(b) We make a grid (one logic cell per bin). (c) Forces are calculated as if springs
were attached to the centers of each logic cell for each connection. The two nets
connecting logic cells A and I correspond to two springs. (d) The forces are
proportional to the spring extensions.

In the definition of connectivity (Section 15.7.1, Measuring Connectivity) it was


pointed out that the network graph does not accurately model connections for nets
with more than two terminals. Nets such as clock nets, power nets, and global reset
lines have a huge number of terminals. The force-directed placement algorithms
usually make special allowances for these situations to prevent the largest nets from
snapping all the logic cells together. In fact, without external forces to counteract the
pull of the springs between logic cells, the network will collapse to a single point as it
settles. An important part of force-directed placement is fixing some of the logic cells
in position. Normally ASIC designers use the I/O pads or other external connections
to act as anchor points or fixed seeds.
Figure 16.28 illustrates the different kinds of force-directed placement algorithms.
The force-directed interchange algorithm uses the force vector to select a pair of logic
cells to swap. In force-directed relaxation a chain of logic cells is moved. The
force-directed pairwise relaxation algorithm swaps one pair of logic cells at a time.

FIGURE 16.28 Force-directed iterative placement improvement. (a) Force-directed


interchange. (b) Force-directed relaxation. (c) Force-directed pairwise relaxation.

We reach a force-directed solution when we minimize the energy of the system,


corresponding to minimizing the sum of the squares of the distances separating logic
cells. Force-directed placement algorithms thus also use a quadratic cost function.

16.2.7 Placement Using Simulated


Annealing
The principles of simulated annealing were explained in Section 15.7.8, Simulated
Annealing. Because simulated annealing requires so many iterations, it is critical that
the placement objectives be easy and fast to calculate. The optimum connection
pattern, the MRST, is difficult to calculate. Using the half-perimeter measure (
Section 16.2.3 ) corresponds to minimizing the total interconnect length. Applying
simulated annealing to placement, the algorithm is as follows:
1. Select logic cells for a trial interchange, usually at random.
2. Evaluate the objective function E for the new placement.
3. If D E is negative or zero, then exchange the logic cells. If D E is positive, then
exchange the logic cells with a probability of exp( D E / T ).
4. Go back to step 1 for a fixed number of times, and then lower the temperature T
according to a cooling schedule: T n +1 = 0.9 T n , for example.

Kirkpatrick, Gerlatt, and Vecchi first described the use of simulated annealing applied
to VLSI problems [ 1983]. Experience since that time has shown that simulated
annealing normally requires the use of a slow cooling schedule and this means long
CPU run times [ Sechen, 1988; Wong, Leong, and Liu, 1988]. As a general rule,
experiments show that simple min-cut based constructive placement is faster than
simulated annealing but that simulated annealing is capable of giving better results at
the expense of long computer run times. The iterative improvement methods that we
described earlier are capable of giving results as good as simulated annealing, but they
use more complex algorithms.
While I am making wild generalizations, I will digress to discuss benchmarks of
placement algorithms (or any CAD algorithm that is random). It is important to
remember that the results of random methods are themselves random. Suppose the
results from two random algorithms, A and B, can each vary by ±10 percent for any
chip placement, but both algorithms have the same average performance. If we
compare single chip placements by both algorithms, they could falsely show
algorithm A to be better than B by up to 20 percent or vice versa. Put another way, if
we run enough test cases we will eventually find some for which A is better than B by
20 percenta trick that Ph.D. students and marketing managers both know well. Even
single-run evaluations over multiple chips is hardly a fair comparison. The only way
to obtain meaningful results is to compare a statistically meaningful number of runs
for a statistically meaningful number of chips for each algorithm. This same caution
applies to any VLSI algorithm that is random. There was a Design Automation
Conference panel session whose theme was Enough of algorithms claiming
improvements of 5 %.
16.2.8 Timing-Driven Placement Methods
Minimizing delay is becoming more and more important as a placement objective.
There are two main approaches: net based and path based. We know that we can use
net weights in our algorithms. The problem is to calculate the weights. One method
finds the n most critical paths (using a timing-analysis engine, possibly in the
synthesis tool). The net weights might then be the number of times each net appears in
this list. The problem with this approach is that as soon as we fix (for example) the
first 100 critical nets, suddenly another 200 become critical. This is rather like trying
to put worms in a canas soon as we open the lid to put one in, two more pop out.
Another method to find the net weights uses the zero-slack algorithm [ Hauge et al.,
1987]. Figure 16.29 shows how this works (all times are in nanoseconds).
Figure 16.29 (a) shows a circuit with primary inputs at which we know the arrival
times (this is the original definition, some people use the term actual times ) of each
signal. We also know the required times for the primary outputs the points in time at
which we want the signals to be valid. We can work forward from the primary inputs
and backward from the primary outputs to determine arrival and required times at
each input pin for each net. The difference between the required and arrival times at
each input pin is the slack time (the time we have to spare). The zero-slack algorithm
adds delay to each net until the slacks are zero, as shown in Figure 16.29 (b). The net
delays can then be converted to weights or constraints in the placement. Notice that
we have assumed that all the gates on a net switch at the same time so that the net
delay can be placed at the output of the gate driving the neta rather poor timing model
but the best we can use without any routing information.
FIGURE 16.29 The zero-slack algorithm. (a) The circuit with no net delays. (b) The
zero-slack algorithm adds net delays (at the outputs of each gate, equivalent to
increasing the gate delay) to reduce the slack times to zero.

An important point to remember is that adjusting the net weight, even for every net on
a chip, does not theoretically make the placement algorithms any more complexwe
have to deal with the numbers anyway. It does not matter whether the net weight is 1
or 6.6, for example. The practical problem, however, is getting the weight information
for each net (usually in the form of timing constraints) from a synthesis tool or timing
verifier. These files can easily be hundreds of megabytes in size (see Section 16.4 ).

With the zero-slack algorithm we simplify but overconstrain the problem. For
example, we might be able to do a better job by making some nets a little longer than
the slack indicates if we can tighten up other nets. What we would really like to do is
deal with paths such as the critical path shown in Figure 16.29 (a) and not just nets .
Path-based algorithms have been proposed to do this, but they are complex and not all
commercial tools have this capability (see, for example, [ Youssef, Lin, and
Shragowitz, 1992]).
There is still the question of how to predict path delays between gates with only
placement information. Usually we still do not compute a routing tree but use simple
approximations to the total net length (such as the half-perimeter measure) and then
use this to estimate a net delay (the same to each pin on a net). It is not until the
routing step that we can make accurate estimates of the actual interconnect delays.

16.2.9 A Simple Placement Example


Figure 16.30 shows an example network and placements to illustrate the measures for
interconnect length and interconnect congestion. Figure 16.30 (b) and (c) illustrate the
meaning of total routing length, the maximum cut line in the x -direction, the
maximum cut line in the y -direction, and the maximum density. In this example we
have assumed that the logic cells are all the same size, connections can be made to
terminals on any side, and the routing channels between each adjacent logic cell have
a capacity of 2. Figure 16.30 (d) shows what the completed layout might look like.

FIGURE 16.30 Placement example.


(a) An example network. (b) In this
placement, the bin size is equal to the
logic cell size and all the logic cells
are assumed equal size. (c) An
alternative placement with a lower
total routing length. (d) A layout that
might result from the placement
shown in b. The channel densities
correspond to the cut-line sizes.
Notice that the logic cells are not all
the same size (which means there are
errors in the interconnect-length
estimates we made during
placement).
16.3 Physical Design Flow
Historically placement was included with routing as a single tool (the term P&R
is often used for place and route). Because interconnect delay now dominates
gate delay, the trend is to include placement within a floorplanning tool and use a
separate router. Figure 16.31 shows a design flow using synthesis and a
floorplanning tool that includes placement. This flow consists of the following
steps:
1. Design entry. The input is a logical description with no physical
information.
2. Synthesis. The initial synthesis contains little or no information on any
interconnect loading. The output of the synthesis tool (typically an EDIF
netlist) is the input to the floorplanner.
3. Initial floorplan. From the initial floorplan interblock capacitances are
input to the synthesis tool as load constraints and intrablock capacitances
are input as wire-load tables.
4. Synthesis with load constraints. At this point the synthesis tool is able to
resynthesize the logic based on estimates of the interconnect capacitance
each gate is driving. The synthesis tool produces a forward annotation file
to constrain path delays in the placement step.
FIGURE 16.31 Timing-driven floorplanning and placement design flow.
Compare with Figure 15.1 on p. 806.
5. Timing-driven placement. After placement using constraints from the
synthesis tool, the location of every logic cell on the chip is fixed and
accurate estimates of interconnect delay can be passed back to the
synthesis tool.
6. Synthesis with in-place optimization ( IPO ). The synthesis tool changes
the drive strength of gates based on the accurate interconnect delay
estimates from the floorplanner without altering the netlist structure.
7. Detailed placement. The placement information is ready to be input to the
routing step.
In Figure 16.31 we iterate between floorplanning and synthesis, continuously
improving our estimate for the interconnect delay as we do so.
16.5 Summary
Floorplanning follows the system partitioning step and is the first step in
arranging circuit blocks on an ASIC. There are many factors to be considered
during floorplanning: minimizing connection length and signal delay between
blocks; arranging fixed blocks and reshaping flexible blocks to occupy the
minimum die area; organizing the interconnect areas between blocks; planning
the power, clock, and I/O distribution. The handling of some of these factors may
be automated using CAD tools, but many still need to be dealt with by hand.
Placement follows the floorplanning step and is more automated. It consists of
organizing an array of logic cells within a flexible block. The criterion for
optimization may be minimum interconnect area, minimum total interconnect
length, or performance. There are two main types of placement algorithms: based
on min-cut or eigenvector methods. Because interconnect delay in a submicron
CMOS process dominates logic-cell delay, planning of interconnect will become
more and more important. Instead of completing synthesis before starting
floorplanning and placement, we will have to use synthesis and
floorplanning/placement tools together to achieve an accurate estimate of timing.
The key points of this chapter are:
● Interconnect delay now dominates gate delay.

● Floorplanning is a mapping between logical and physical design.

● Floorplanning is the center of ASIC design operations for all types of


ASIC.
● Timing-driven floorplanning is becoming an essential ASIC design tool.

● Placement is now an automated function.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy