Abcdplace Tcad2020 Lin
Abcdplace Tcad2020 Lin
Abstract—Placement is an important step in modern very-large-scale placement problem for wirelength minimization. Figure 1 roughly
integrated (VLSI) designs. Detailed placement is a placement refining sketches the runtime scaling trends for recent placers. While the
procedure intensively called throughout the design flow, thus its efficiency
comparison may not be fair due to different objectives and constraints
has a vital impact on design closure. However, since most detailed place-
ment techniques are inherently greedy and sequential, they are generally during optimization, it still shows the runtime of placement engines
difficult to parallelize. In this work, we present a concurrent detailed has not been improved even with more and more powerful CPUs.
placement framework, ABCDPlace, exploiting multithreading and GPU NTUplace3 [6] proposed in 2008 is still competitive in efficiency for
acceleration. We propose batch-based concurrent algorithms for widely- wirelength optimization, especially that it is widely used in the most
adopted sequential detailed placement techniques, such as independent
set matching, global swap, and local reordering. Experimental results recent placement researches [18]–[21].
demonstrate that ABCDPlace can achieve 2 × −5× faster runtime than With modern computing platforms like multi-core processors and
sequential implementations with multi-threaded CPU and over 10× with graphic processing units (GPUs), massively parallel computing has the
GPU on ISPD 2005 contest benchmarks without quality degradation. On potential to accelerate the placement optimization. So far, a majority
larger industrial benchmarks, we show more than 16× speedup with GPU
over the state-of-the-art sequential detailed placer. ABCDPlace finishes the
of the literature has been exploring global placement acceleration
detailed placement of a 10-million-cell industrial design in one minute. or simulated annealing-based approaches [22]–[27]. The recent study
from DREAMPlace [28] demonstrated that after accelerating global
placement, detailed placement becomes the runtime bottleneck by tak-
I. I NTRODUCTION ing more than 75% of the entire placement time for a benchmark with
Placement is a critical stage in the VLSI design flow. It determines 2M cells. Thus, accelerated detailed placement engines are urgently
the physical locations of logic gates (cells) in the circuit layout and desired to further speedup the flow. However, the greedy and iterative
its solution has a huge impact on the subsequent routing and post- nature of existing detailed placement techniques raise the bar of
routing closure. Placement usually consists of three stages: global effective parallelization. There is limited prior work investigating the
placement, legalization, and detailed placement. Global placement potential of massive parallelization for detailed placement techniques.
provides rough locations of standard cells. Legalization then removes Only recently, Dhar et al [29] have explored multithreading and
overlaps and design rule violations based on the global placement GPU acceleration for a row-based interleaving algorithm in FPGA
solution. In the end, detailed placement incrementally improves the placement. This is far from enough, as effective detailed placement
solution quality. In the VLSI design iterations, detailed placement engines usually consist of multiple techniques and typically involve
may be invoked many times to recover the solution quality from graph algorithms and greedy heuristics.
post-placement perturbations, such as buffer insertion, routability and With increasingly powerful multi-core processors and GPUs, par-
timing optimization. Therefore, the efficiency and quality of detailed allel computing has demonstrated its efficiency in solving large graph
placement play an important role in speeding up the design iterations. problems [30]–[32]. As detailed placement heavily involves graph
Detailed placement widely involves combinatorial optimization, traversal and analytics, there is a high potential to accelerate detailed
graph algorithms, and greedy heuristics. Various effective algorithms placement algorithms with massive parallelization as well. Therefore,
have been proposed with the strategy of extracting a subset of cells in this work, we present ABCDPlace, an open-source GPU-accelerated
and exploring the corresponding solution space iteratively [1]–[11], detailed placement engine leveraging batch-based concurrency. We re-
including independent set matching, global swap, local reordering, design the widely-adopted detailed placement techniques and propose
and row-based refinement, etc. These algorithms are usually designed parallel versions for multi-threaded CPUs and GPUs. The main
for sequential execution on single-threaded machines. They are very contributions of the paper are summarized as follows.
difficult to be parallelized. • We propose an open-source batch-based concurrent detailed
With the increasing design scale and complexity, sequential algo- placement framework, ABCDPlace, with multithreading and
rithms are encountering challenges in efficiency due to tight time- GPU acceleration.
to-market budgets. They are becoming bottlenecks that hinder the • We propose parallel detailed placement algorithms for widely-
turn-around time. As most of the recent research efforts for detailed adopted sequential techniques, such as independent set matching,
placement have been switched to incorporating new objectives and global swap, and local reordering, leveraging batch execution.
constraints, such as routability, mixed-cell-height designs, manufac- • Experimental results demonstrate that, compared with a highly
turing constraints [12]–[17], there is little progress on the core detailed efficient sequential detailed placer NTUplace3 [6], ABCDPlace
is able to achieve over 10× and 16× speedup with GPUs on
Y. Lin is with the Center for Energy-Efficient Computing and Applications,
School of Electronics Engineering and Computer Science, Peking University, ISPD 2005 contest benchmarks and industrial benchmarks, re-
Beijing, China. spectively, without quality degradation. The multi-threaded CPU
W. Li is with Xilinx Inc., CA, USA. version also achieves around 2 × −5× speedup. Experiments on
J. Gu and D. Z. Pan are with The Department of Electrical and Computer ISPD 2015 contest benchmarks also indicate that our placer does
Engineering, The University of Texas at Austin, TX, USA.
H. Ren and B. Khailany are with NVIDIA Corporation, Austin, TX, USA.
not degrade the global routing congestion.
This work was supported in part by NVIDIA. ABCDPlace has been integrated into DREAMPlace 2.0 as the default
700 1 1’
NTUplace3,TCAD08@2.1GHz 1
600 Chow+,ISPD14@3.4GHz 2 2’
500 Gang+,TCAD15@2.4GHz 2 3 3 3’
MrDP,TCAD17@3.4GHz 4 4’
Runtime (s)
400 Chen+,ICCAD18@3.4GHz 4 5
5 5’
300 (a) (b)
200
Fig. 2: Independent set matching. (a) A set of independent cells
100 within a window. (b) Construction of a bipartite graph to find the
0 best assignment.
0 500 1000 1500 2000
#Cells (K)
1 2 3 4 4 2 1 3
Fig. 1: Rough runtime scaling for the recent development of detailed
placement engines [6], [12], [13], [17], [33]. The runtime values are (a) (b)
collected from the papers with the CPU frequencies shown in the
legend, except for NTUplace3 [6] where we ran the experiments with Fig. 3: Example of local reordering with sliding window. (a) Sliding
the binary release. window at one step; (b) next step.
detailed placement engine released on Github 1 . Each algorithm is bipartite matching problem. Sometimes one cell cannot be assigned
implemented as an operator that can be invoked with the Python API to the location of another cell due to lack of space. In this case, no
in DREAMPlace. The rest of the paper is organized as follows. Sec- edge will be added between that cell and location in the bipartite
tion II introduces the background of detailed placement algorithms and graph. The independent set is created by searching for unconnected
the problem formulation. Section III explains the parallel algorithms in cells greedily in a window. Due to the connectivity between cells,
details. Section VI validates the algorithms with experimental results. sequential implementation is again a natural choice.
Section VII concludes the paper.
B. Global Swap
II. P RELIMINARIES
A general description of global swap is to repeat the following
Detailed placement typically assumes a legal initial placement is process: pick a cell i; find another cell or space j in a search region
given and performs incremental refinement. The main objective is that maximally improves the wirelength after swap; then swap the
usually half-perimeter wirelength (HPWL), which is computed as the two cells. There are various heuristics to determine the search region
bounding box of each net as follows, for a cell. It can be the bin in which the cell is located or the optimal
X
HPWL = max |xi − xj | + max |yi − yj | , (1) region of the cell [2]. Considering that a subsequent cell movement is
i,j∈e i,j∈e dependent to the previous cell movements due to the connectivity and
e∈E
potential overlap issue, this process is usually performed sequentially.
where e represents a net (hyperedge) in a set of nets E and i, j
represent any of two cells connected in e. The output of detailed
placement is a legal placement solution with optimized wirelength. In C. Local Reordering
general, a detailed placer often runs several key strategies to explore
Local reordering shuffles a sequence of consecutive cells for the
different solution spaces iteratively. For example, FastPlace serializes
best permutation [2], [6], as shown in Figure 3. It is a window-based
iterations of global swap and local reordering until no significant
strategy that works on k cells at each step and repeats by sliding
wirelength improvements occur [2], [11], [34]. NTUplace serializes
the window from left to right. As the number of permutation for a
iterations of local reordering (branch-and-bound cell swap [6]) and
sequence of k is k!, it is only affordable to use a small value of k, e.g.,
independent set matching [6], [35]. Each of these strategies extract a
3 or 4. When k goes down to 2, this local reordering step becomes a
small set of cells and perturb them to find a better solution. In this
special case of global swap. As the sliding windows have to overlap
work, we focus on the parallelization of the following three strategies:
for large enough solution space and cells are connected, most of the
(1) independent set matching; (2) global swap; (3) local reordering.
existing implementations are sequential.
2
Algorithm 1 Sequential Independent Set Matching
Control Grid
ALU
Require: A circuit netlist G = (V, E), locations of cells, and maximum
Block (0, 0) Block (1, 0)
size of an independent set L;
Cache
Ensure: Minimize wirelength by independent set matching;
Shared Memory Shared Memory 1: for each cell v ∈ V do
DRAM
2: Search independent cells with the same sizes as v in the neigh-
CPU borhood and form an independent set I, s.t., |I| ≤ L;
Thread Thread Thread Thread
(0, 0) (1, 0) (0, 0) (1, 0)
3: Compute costs of permuting cell locations in I ;
4: Solve the LAP with the weights;
5: Apply the assignment solution;
Local Local Local Local
Memory Memory Memory Memory
A. Concurrent Independent Set Matching
DRAM
Global Memory Algorithm 1 sketches a rough procedure for independent set
GPU
matching according to [6]. The word independent describes cells not
(a) (b) connected to each other. Thus, the movement cost for a cell in an
Fig. 4: (a) Comparison between CPU and GPU architectures. (b) independent set can be computed without considering the locations
Computation units and memory hierarchy on GPU. of other cells in the set. With an independent set, we can construct
a bipartite graph and solve the linear assignment problem (LAP) for
the best locations of cells in the set. The algorithm follows four steps
iteratively: 1) extract an independent set; 2) compute permutation
E. GPU Architecture and Programming
costs, which means the cost of moving one cell to the location of
GPU programming is quite different from CPU due to the discrep- another one in the independent set I; 3) solve the LAP; 4) move
ancy in architectures and programming models. Figure 4 compares the the cells according to the solution of LAP. The algorithm is very
architectures for CPU and GPU. A CPU is roughly composed of a difficult to parallelize following the same procedure, because the
control unit, computation units (ALU), cache, and memory (DRAM). maximum size of independent set L is usually limited to around 100.
A GPU also has such components, but with very different scale and Even though the cost computation step for an L × L matrix can be
performance. It consists of a grid of computation units with simple parallelized, the bulk runtime actually comes from independent set
control units and small cache, which indicates that a GPU prefers extraction and LAP solving, which are typically sequential.
parallel execution of small tasks with simple control flows. Each To improve the parallelization, we design a concurrent independent
computation unit on a GPU may not be as powerful as that on CPU, set matching algorithm. Although the algorithm still runs in iterations,
but due to massive parallelization, it can potentially be faster. it converges much faster than iterating through all cells like Algo-
Unlike relatively mature programming models on CPU, GPU rithm 1. The major advantage lies in the fully parallelizable internal
programming still requires careful design of both algorithms and steps. Figure 5 provides an intuitive explanation of the steps. We first
implementations. The performance is likely to be much slower than extract a maximal independent set 2 with given seeds, which are likely
CPU even for a fully parallelizable task due to poor implementations. to contain tens of thousands of cells. The set is then partitioned into
The reason mainly comes from the flexible configurations to the many small subsets such that physically close cells with the same
computation units at runtime. The computation units on a GPU can sizes are in the same subsets. Next, we can solve the LAP instances
be viewed as a grid of blocks. Each block consists of many threads for all the subsets in the batch independently after computing the costs
(at most 1024 for most GPUs). A block can be assigned with a piece of cell permutation. In the end, the solutions are applied in parallel
of shared memory that can be accessed by its threads more efficiently as well. The rest of the section will cover the non-trivial parallel
than the global memory. However, there is an upper limit to the total implementation of these steps, including parallel maximal independent
amount of the shared memory for all blocks, e.g,. 48∼96KB according set extraction, parallel partitioning, and batch LAP solving.
to GPU devices. Moreover, thread synchronization within a block 1) Parallel Maximal Independent Set Extraction: A sequential
is much cheaper than the device-level synchronization. To perform maximal independent set algorithm follows the procedure:
computation on a GPU, a program on the host CPU needs to launch • For each node in the graph:
a kernel function call with the configurations of blocks, threads, and • If not in the set yet:
shared memory. Such a function call has an overhead around several • Add to the set and remove all its neighbors in the graph.
micro seconds. In other words, frequent kernel calls are not preferred This algorithm has a time complexity of O(|V |) if |E| is of the same
in GPU programming. magnitude as |V |, as it needs to traverse the entire graph once. The
With all these differences in the hardware architectures and pro- algorithm becomes slow with large graph sizes. There exist parallel
gramming models, straightforward parallelization schemes on CPUs maximal independent set algorithms that can achieve O(log2 |V |)
often do not work on GPUs. In other words, GPUs require specially time complexity, e.g., Blelloch’s Algorithm [37].
designed algorithms and threading schemes to demonstrate the power Algorithm 2 describes a parallel maximal independent set algo-
of massive parallelization. rithm based on the Blelloch’s algorithm [37] that is suitable for our
application. The algorithm is general enough to handle hypergraphs as
well. Lines 4 to line 7 ensure that only the vertex v with lowest order
R(v) among its neighbors joins I. As the vertex with the globally
III. T HE ABCDP LACE A LGORITHMS
smallest R value will always join I, the algorithm guarantees to
This section explains the details on concurrent versions of indepen- 2 A maximal independent set is different from a maximum independent set.
dent set matching, global swap, and local reordering. The former can be found greedily, while the latter is NP-complete [36].
3
1 1
Hungarian 20T
2 2 Network Simplex 20T
Runtime (ms)
3 3 1000
Auction 20T
4 4
5 6 5 6 500
(a) (b)
0
0 200 400 600 800 1000
Graph 1 Weight Matrix Graph 2 Weight Matrix
Batch Size
1 1’ 1 0
20
30 4 4’ 40
50
60
" # " #
1 a110 a120 a130 4 a440 a450 a460
2 2’ 5 5’ Fig. 6: Runtime comparison of batch solving for LAPs on CPU.
2 a210 a220 a230 5 a540 a550 a560
3 3’ 3 <latexit sha1_base64="BMD1NNxVQ9d26Elgzv1BKDy/7kE=">AAAEOnicfVJNj9MwEHU2fCzhqwtHLiMitpyqfBzguIgDHBeJ7q4UR5XjuqlVxwmOs1Bl88/4I4gbN8SVH4CbplHbXXakRE8z7/mNZ5wUgpfa835YB/adu/fuHz5wHj56/OTp4OjZWZlXirIxzUWuLhJSMsElG2uuBbsoFCNZIth5sni/qp9fMlXyXH7Wy4LFGUkln3FKtElNBj9x5OCEpVzWicjpgihFlk1NTTTOMfhDOIZg9QuHABjvcA0tMry4cQB8QyGT2veHTYeCHoUGtWKAYJ0KelrQ04ItWrhOhT0t7GlhT8NMTrtGtvD6Ag7g2JkMXG/ktQHXgd8BF3VxOjk6+ICnOa0yJjUVpCwj3yt0XBOlORXM2FQlK4gxSVlkoCQZK+O63UIDr0xmCrNcmU9qaLPbippkZbnMEsPMiJ6X+7VV8qZaVOnZ27jmsqg0k3RtNKsE6BxWK4UpV4xqsTSAUMVNr0DnRBGqzeJ3XBKxmY/jYMm+0jzLiJkcThRpIj+uzRzLSrFVDzUWbKaxIDIVzPWx4ulcXzXNrnDB9M3Cq40Eq/aEfaFxpK04uM0V8GUxJ1LnWe0GTdcDtJwRuMF22W9u9TOHK/6NiSYK/28Ie45u2MDGdOXWXgzccMfXNAb71ubp+fsP7To4C0Z+OPI+Be4JdI/wEL1AL9Fr5KM36AR9RKdojKj1zkqtwvpif7d/2b/tP2vqgdVpnqOdsP/+A6ybW9I=</latexit>
a310 a320 a330 6 6’ 6 <latexit sha1_base64="Utl2oCBXVn0JmqvuGw+dx90Tres=">AAAEOnicfVJNj9MwEE03fCzhqwtHLiMitpyqpF9wXMQBjotEd1eKo8px3dSq4wTbWaiy+Wf8EcSNG+LKD8BJ06rtLjtSoqeZ9/zGM44yzpT2vB+tA/vO3Xv3Dx84Dx89fvK0ffTsTKW5JHRMUp7KiwgrypmgY800pxeZpDiJOD2PFu+r+vkllYql4rNeZjRMcCzYjBGsTWrS/okCB0U0ZqKIeEoWWEq8LAtionSOYdCBYxhWv1EHAKEdrqEFhheWDsDAUPCkGAw6ZYOGGzQyqBYDDFep4YY23NCGW7TRKjXa0EYb2mhDQ1RMm0a28OoCDqDQmbRdr+vVAdeB3wDXauJ0cnTwAU1TkidUaMKxUoHvZTossNSMcGpsckUzbExiGhgocEJVWNRbKOGVyUxhlkrzCQ11dltR4ESpZRIZZoL1XO3XquRNtSDXs7dhwUSWayrIymiWc9ApVCuFKZOUaL40ABPJTK9A5lhios3id1wivp6P4yBBv5I0SbCZHIokLgM/LMwcVS5p1UOBOJ1pxLGIOXV9JFk811dluStcUH2z8GotQbI+YV9oHEkt7t3mCugym2Oh06Rwe2XTA9ScLri97bJf3upnDpfsG+Vl0P+/Iew5uv0S1qaVW30xcPs7vqYx2Lc2T8/ff2jXwVmv6/e73qeeewLNIzy0XlgvrdeWb72xTqyP1qk1tkjrXStuZa0v9nf7l/3b/rOiHrQazXNrJ+y//wCgTFwa</latexit>
a640 a650 a660 “20T” stands for 20 threads on CPU.
(c)
Fig. 5: Steps for batch-based concurrent independent set matching. (a) where the weight wi for each cluster is adjusted in each iteration
Maximal independent set extraction. (b) Independent set partitioning. to penalize large clusters. Given a target cluster size st , weights are
(c) Linear assignment solving for a batch of bipartite graphs. initialized to 1 and empirically updated at the kth iteration as follows,
wik+1 ← wik × (1 + 0.5 · log(max{1, |Si |/st })). (4)
Algorithm 2 Parallel Maximal Independent Set Algorithm
Intuitively, the weight of a cluster increases if its size goes beyond
Require: A graph G = (V, E), a random order R, s.t., |R| = |V |;
st . This is an empirically determined function. Other functions with
Ensure: A maximal independent set I contains vargmin R ∈ V ;
1: I ← ∅;
similar trends may also work. In the experiment, the number of clus-
2: while V is not empty do ters K is computed as the ratio of the number of nodes in a maximal
3: for each v ∈ V do . Parallel kernel independent set over the expected cluster size st , where st = 128. We
4: R(w), ∀(v, w) ∈ E then
if R(v) < S observe that 2 iterations of K-Means have already achieved reasonable
5: I ← I {v}; . Add to I partitioning results with rather balanced distribution of subsets.
6: G ← G \ v; . Remove v from G 3) Batch Solving for Linear Assignment Problems: LAP can be
7: G ← G \ w, ∀(v, w) ∈ E ; . Remove v ’s neighbors solved with many algorithms, such as Hungarian algorithm, network
flow algorithms, auction algorithm, etc. The mathematical formulation
can be written as,
make progress in each round (line 3 to line 7). It will take at most X
max aij xij ,
O(log2 |V |) rounds. Meanwhile, the task within each round is fully
i,j
parallelizable, as each vertex can be processed independently within N
the for loop. In practice, early exit is possible if enough vertices are s.t.
X
xij = 1, j = 1, 2, . . . , N,
collected for solving LAP. The random order sequence R can also be i=1 (5)
generated efficiently with a parallel random shuffling of the sequence XN
0, 1, . . . , |V |. xij = 1, i = 1, 2, . . . , N,
2) Parallel Partitioning with Balanced K-Means Clustering: The j=1
maximal independent set is too large for LAP algorithms. Observing xij ≥ 0, i, j = 1, 2, . . . , N,
that most of the cell movement happens locally, it is good to
where xij = 1 indicates assigning i to j, and aij is the weight of such
partition the independent set into a batch of small subsets such
an assignment. The Hungarian and network flow algorithms are widely
that cells in the same subset are physically close to each other. A
adopted as sequential solvers [6], [33], while auction algorithms [38]
sequential implementation might be distributing the cells into bins
are generally the choice for distributed computing platforms due to
and performing spiral walk to greedily add nearby cells to subsets.
its easy-to-parallelize nature [32], [39].
This is not runtime-efficient for GPU because of its sequential nature
Figure 6 shows a comparison on solving a batch of LAP instances
and irregular memory access patterns in spiral search, while on CPU,
with different algorithms on a multi-threaded CPU. Hungarian adopts
the runtime is acceptable and not the bottleneck. Therefore, we adopt
the implementation from [40] and network simplex (a highly efficient
K-Means clustering for GPU in this step, leveraging parallel reduction
network flow algorithm in practice [33]) uses the solver from Lemon
to find the closest centroids for clusters. The conventional objective
[41]. The auction algorithm adopts our own implementation described
for K-Means clustering is to find K centers,
in Algorithm 3. Each LAP instance is solved with a single thread.
K X
X We can see that network simplex is much faster than the Hungarian
min kx − µi k2 , (2) algorithm, and our auction algorithm can achieve further 2× speedup
i=1 x∈Si
over network simplex. While the solvers are mostly treated as black
where µ denotes the centers. However, such an objective may result boxes in placement algorithms, it is sometimes necessary to study the
in imbalanced clusters. This is not preferred for parallel solving of details for acceleration targeting specific hardware platforms.
LAP instances, especially for GPU acceleration. To achieve balanced Auction algorithms consider the problem in which N persons (cells
clustering results, we consider a weighted objective, in this work) bid for N items (locations in this work) with a weight of
K X
aij (negative wirelength cost of assigning a cell to a location as the al-
gorithm maximizes the objective). A typical auction algorithm repeats
X
min wi k(x − µi )k2 , (3)
i=1 x∈Si
a bidding phase and an assignment phase that are fully parallelizable,
4
as shown in Algorithm 3. In the bidding phase from line 5 to 9, each Algorithm 3 Auction Algorithm for One LAP Instance
person finds the item j ∗ with the largest aij − pricej value recorded Require: An N × N weight matrix A for LAP and auction ;
in the temporary variable vij ∗ and the second largest aij − pricej Ensure: Find the assignment solution with maximum objective;
value recorded in the temporary variable wij ∗ , respectively. Then the 1: Define price as a length-N array, initialized to 0;
price increment for bidding is computed as bidij ∗ and item j ∗ is 2: Define bid as an N × N matrix, sbid as a length-N array;
marked in sbidj ∗ . In the assignment phase from line 10 to 16, each bid 3: while not all items assigned do
4: bidij ← 0, ∀i, j; sbidj ← 0, ∀j ;
item looks for bidder i∗ with the highest bidding price increment bi∗ ,
5: for each person i do . Parallel bidding kernel
assigns itself to person i∗ , and raises its price by bi∗ . Looping between
6: vij ∗ ← maxj (aij − pricej ); . Largest
these two phases will lead to an optimal assignment solution with the 7: wij ∗ ← maxj6=j ∗ (aij − pricej ); . Second largest
maximum objective. The auction epsilon , which indicates the price 8: bidij ∗ ← vij ∗ − wij ∗ + ;
augmentation step, controls the numerical precision of convergence 9:
atomic
sbidj ∗ ←−−−−− 1;
to the optimal solution. Parallelization of Algorithm 3 can be realized
10: for each item j do . Parallel assignment kernel
by parallelizing the bidding and assignment phases. For example, one 11: if sbidj then
thread can be allocated to work on each person independently in the 12: bi∗ ← maxi bidij ;
bidding phase (line 5 to line 9); we can also have one thread work 13: if item j has been assigned then
on each item independently in the assignment phase as well. 14: Unassign j ;
Efforts have been spent on accelerating LAP algorithms with 15: pricej ← pricej + bi∗ ;
large N (> 1000). However, in this problem, cells only need local 16: Assign item j to person i∗ ;
movements, so each LAP instance is small with N around 100. There
are many such small instances [6]. Our simple experiment on an LAP
Threads Bidding Assignment
with N = 128 shows that the GPU implementation requires around
0 Person 0 Item 0
Sync Threads
Sync Threads
5ms, while a single-threaded CPU implementation takes around 17ms,
Terminate
Converge?
2ms, 1ms using Hungarian algorithm [40], network simplex [41], and Block 1 Person 1 Item 1
auction algorithm (equivalent to the results of batch size 1 in Figure 6), 0 … … …
respectively, making GPU unable to compete with CPUs. N-1 Person N-1 Item N-1
Therefore, a batch-based auction algorithm is required to solve
multiple LAP instances with the same problem size simultaneously
While Loop
with massive parallelization on GPUs. A naive way for batch exe-
cution is to adopt CUDA’s multi-stream scheme by assigning each Fig. 7: Parallelization scheme for batch-based auction algorithm. One
LAP to one CUDA stream. Unfortunately, multi-stream is usually thread block is used to solve one LAP with the while loop within the
inefficient for thousand of streams due to crowded kernel launches kernel.
and we even observe longer runtime with that. Hence, we propose a
GPU implementation specifically optimized for batch solving of small
LAP instances, as shown in Figure 7 and Algorithm 4. To mitigate runs for each cell. There are various heuristics to compute the search
the expensive communication and synchronization overhead between region, such as directly using the bin in which a cell is located or its
CPUs and GPUs, the batch-based auction algorithm integrates the optimal region [2]. Moreover, during the sequential search for best
entire while loop into a CUDA kernel and assigns each LAP instance swap candidates, we can start from potentially good regions to bad
to one thread block. By doing so, expensive device-wide synchro- regions, such that early exit is possible if a candidate has been found.
nization is minimized and the number of kernel launches is reduced Figure 8a shows the runtime breakdown of a sequential imple-
from O(BK), empirically around 1 million, down to one, where B is mentation of Algorithm 5 running on CPUs. The runtime portion
the batch size and K is the largest number of iterations for one LAP for ApplyCand is not shown as it is too small. The plot indicates
instance in the batch, usually larger than 1000. This implementation is that CalcSwapCosts takes the majority of the runtime. Naive
specifically designed for small LAP instances because the maximum parallelization of CalcSwapCosts does not lead to much speedup
number of threads in one block is typically limited to 1024, which is as for each cell, we do not look for a large enough number of candidate
a constraint from GPUs. Although device-wise synchronization is no
longer required, block-wise synchronization is still needed, as shown
in line 8 and line 10 of Algorithm 4, which can be invoked within Algorithm 4 Batch-based Auction Algorithm for LAP instances
a kernel. Threads within a thread block need to wait for all their Require: A B × N × N weight tensor, max , min , γ ;
tasks finished before moving forward. Furthermore, we introduce an Ensure: Find B × N assignment matrix with maximum objectives;
annealing scheme for to speed up the convergence by solving the 1: Define price as a B × N matrix, initialized to 0;
LAP instance from a large max to small min and use the prices 2: Define bid as a B × N × N tensor, sbid as a B × N matrix;
of previous solving as the starting point for the next, as described 3: ← max ; . Parallel kernel begins
4: bid ← blockIdx, tid ← threadIdx;
in line 5 and line 13 of Algorithm 4. In the experiment, we set
5: while ≥ min do
max = 10, min = 1, γ = 0.1. We can improve the runtime on 6: while not all items assigned for LAP instance bid do
GPU by more than 100× over the multi-stream implementation for 7: Initialize bid and sbid;
typical batch size of 1024 and N = 128. 8: Synchronize threads;
9: Bidding phase for person tid with ;
B. Concurrent Global Swap 10: Synchronize threads;
11: Assignment phase for item tid;
Global swap is another widely-adopted placement technique [2], 12: Synchronize threads;
[4]. Without loss of generality, a typical procedure of global swap is
13: ← γ; . Parallel kernel ends
shown in Algorithm 5. It consists of five major steps and iteratively
5
1.8 1T 8T 3
2.0
12.3 1.6
2T 10T
0.5 4T 4
Speedup
CalcSearchBox 1.4
CalcSwapCosts 1
85.3 1.2
FindBestCand 2
1.0
CollectCands
(a) (b)
Fig. 8: (a) Runtime breakdown of sequential global swap implemen-
tation on CPU for bigblue4 with 2M cells. (b) CalcSwapCosts Fig. 9: Batch-based concurrent global swap. Cells in the batch and
speedup with naive multithreading. “T” stands for threads on CPU. regions to search for candidates are highlighted.
…
0 4 0 4
…
5: Find candidate c∗ with minimum cost; . FindBestCand
Block 1 5 Block 1 5
6: Apply the swap with the candidate c∗ ; . ApplyCand
1 4
… … … …
255 511 255 511
cells for swapping and the threading overhead is not affordable. As Fig. 10: Parallelization scheme for CollectCands.
shown in Figure 8b, our experiments on bigblue4 actually show the
speedup to CalcSwapCosts quickly saturates to 1.4× even with
10 threads enabled, when we use OpenMP to parallelize the for loop
for swap cost computation. Especially when considering that GPUs maximum number of candidates. The runtime of this step can be
have low single-threaded performance, but with thousands of threads significantly improved since the memory access pattern is rather
available, creating enough parallel tasks for CalcSwapCosts to regular. Besides, the cost computation for all candidates can be done
hide latency is essential to good speedups with GPU acceleration. in parallel as well. Due to the large number of candidates, i.e., batch
To apply task decomposition, we develop a batch-based concurrent size × maximum number of candidates per cell, the workload can be
global swap to improve the performance of parallelization. Figure 9 distributed to many threads for speedup. The details on batch sizes
explains the intuition. Instead of processing one cell each time as in and maximum number of candidates per cell are discussed in the
Algorithm 5, processing a batch of cells explores more parallelism following paragraph.
for CalcSwapCosts and thus is potentially beneficial. The overall After parallelizing the candidate collection and cost computation,
algorithm of the concurrent global swap is presented in Algorithm 6. the swap costs are stored in a B × K matrix-like structure with B as
We precompute the search regions for all cells in parallel in line 1 batch size and K as maximum number of candidates for each cell.
and line 2. From lines 3-7, we fetch one batch of cells every time Best candidates with minimum costs can be selected by using parallel
and perform concurrent global swapping. Suppose that there are B reduction operations [42]. Given P threads, the time complexity is
cells in a batch and on average we check 100 swap candidates for O( BKP
+ log K) [43], where B is around 32 to 256, K is around
each cell, there will be 100 × B concurrent tasks, which is enough 512 to 1024 in the experiments, and P is in thousands for GPUs.
for relatively high occupancy of GPU resources. This batch-based
The last step is to apply swapping to the best candidates, shown
concurrency applies to both CollectCands and CalcSwapCost.
as function ApplyCand. As the cells and candidates in the previous
We will discuss the aforementioned functions one-by-one in the rest
steps may have dependency to each other, this is the step we resolve
of the section.
such dependency issues. For a given batch, there are at most B
Figure 10 shows the parallel implementation of CollectCands
candidates that need to be applied. There might be conflicts between
conceptually. We allocate a fixed number of thread blocks for each
candidates. For example, two cells in the batch may tend to swap
cell in the batch. Each thread will collect one candidate cell for the
with the same cell, which will result in incorrect costs if both swaps
candidate array. The candidate array is pre-allocated with a fixed
applied. To resolve such a data race, sequential execution is adopted.
We give up candidates whose cells to swap with have been moved or
Algorithm 6 Concurrent Global Swap Algorithm other cells connected to these two cells have been moved in this batch.
Note that this step is not the runtime bottleneck even with sequential
Require: A circuit netlist G = (V, E), locations of cells, batch size B ;
Ensure: Minimize wirelength by swapping cells; execution.
1: for each cell v ∈ V do . Parallel In the experiment, the search region for one cell is set to one bin,
2: R(v) ← CalcSearchBox(v); whose width and height are around 3-row height. We can control the
3: for each batch of cells Bv ∈ V do batch size for the efficiency and resource usage. With larger batch
4: BC ← CollectCands(Bv , R); . Parallel sizes, the efficiency may improve, but require more GPU memories,
5: CalcSwapCosts(BC ); . Parallel as all the storage needs to be pre-allocated. We observe the speedup
6: Bc∗ ← FindBestCand(BC ); . Parallel Reduction
starts to saturate when the batch size is around 256, so we adopt this
7: ApplyCand(Bc∗ ); . Sequential
value.
6
Window 1 Window 2 Window 3 100 5000
#Windows
#Rows
Step 1 1 2 3 4 5 6 7 8 9 10 11 12
Step 2 1 2 3 4 5 6 7 8 9 10 11 12
0 0
0 200 0 200
(a)
Row Group Window Group
Row 3
3) Independent Rows: We further investigate possible parallel tasks
by extracting independent rows, as shown in Figure 11b. Independent
Row 4
rows refer to a group of rows that do not have any two cells share a
(b) common net. In the figure, we can find three such groups of {row 1,
row4}, {row 2}, {row 3}. This approach is valid under the assumption
Fig. 11: Explore parallelization in local reordering. (a) Parallel sliding that most connections are local in the detailed placement problem.
windows within a row. (b) Groups of independent rows ({row 1, With the aforementioned three dimensions of parallelism, a signif-
row4}, {row 2}, {row 3}). icant number of parallel tasks can be performed. Figure 12 shows
the distribution of group sizes at row group level and window group
level on bigblue4. Row group refers to the grouping of independent
C. Concurrent Local Reordering rows. Window group refers to the incorporation of independent rows
and parallel sliding windows. All the instances within a group can
Different from global swap in which a cell may search for candi- be solved in parallel. Hence, we can see the effectiveness in finding
dates quite far away from its own location, local reordering works on independent tasks.
a very small sliding window with k cells in a row. The parameter k is
small due to the k! number of possible permutations for enumeration. IV. S UMMARY OF PARALLEL CPU AND GPU I MPLEMENTATIONS
1) Parallel Enumeration: One straightforward multithreading To achieve high efficiency on both multi-threaded CPU and GPU,
scheme is parallel enumeration of the k! permutations. There are we optimize the implementations separately with slightly different
two issues for this scheme: i) for multi-threaded CPU, the task of threading strategies for the three algorithms. We summarize the major
enumerating one permutation is too small to compensate the threading differences in Table I and in this section.
overhead; ii) for GPU acceleration, the number of permutations like 1) Concurrent Independent Set Matching: Both multi-threaded
3! = 6 and 4! = 24 are not enough to fill thousands of GPU threads. CPU and GPU versions implement the same parallel maximal inde-
Therefore, parallel enumeration itself is not enough to boost the pendent set algorithm, but the single-threaded CPU adopts the greedy
efficiency. We need to find more parallel tasks by exploring multiple sequential algorithm mentioned at the beginning of Section III-A1,
dimensions of parallelism. because the sequential algorithm can finish the extraction in one
2) Parallel Sliding Windows: The original sequential algorithm iteration, while the parallel algorithm requires multiple iterations.
slides a window from left to right and solves one window at a time. Another main difference lies in the partitioning step, where the
We consider the potential of solving multiple windows at the same CPU version adopts sequential spiral search and the GPU version
time, as shown in Figure 11a, where window 1, 2, 3 can be solved in adopts the K-Means clustering, as spiral search is too expensive on
parallel. However, one may argue the connectivity of cells between GPUs and K-Means clustering is too expensive on CPUs. In the LAP
windows is likely to cause suboptimality, e.g., cell 6 connects to cell solving step, each CPU thread solves one LAP instance, while the
2 and their locations are undecided during the solving. Actually, this GPU version adopts the batch implementation discussed in Section
will not affect the optimality for each permutation problem within III-A3. When applying the solutions, we use each CPU thread for
a window, because the relative ordering of cells between different one independent set, while each GPU thread only works on one cell
windows is fixed. More specifically, cell 6 knows cell 2 is always at in an independent set.
its left side, and so as cell 2. To this end, when computing the HPWL 2) Concurrent Global Swap: For the CPU version, we allocate
of the net incident to both cell 2 and 6, cell 6 can assume cell 2 is each thread for one cell during the batch execution of candidate
located at the left boundary of cell 6’s window. Similarly, cell 2 can collection, cost computation, and best candidate finding. As the typical
assume cell 6 is located at the right boundary of cell 2’s window. This batch size is 256 or larger, there are enough tasks for each CPU
trick is widely-used in ordered row placement [2] and the optimality thread. For the GPU version, we allocate threads in a finer granularity.
within each window still holds. In candidate collection, each thread collects one candidate for a
Solving parallel sliding windows once cannot cover enough solution cell, as mentioned in Figure 10; in cost computation, each thread
space as explored by sequential sliding. The parallel solving needs to computes costs for one candidate; in finding the best candidates,
be conducted multiple times, as shown in Figure 11a. After step 1, parallel reduction is used.
step 2 shifts all the parallel windows with an offset and performs 3) Concurrent Local Reordering: For the CPU version, the par-
another round of solving. We can control the step size for the offset allelization is implemented at the level of independent rows. That
for a reasonable number of rounds. In the experiment, we set the step is, within a group of independent rows, each thread will solve the
size as k2 . enumeration problems sequentially by sliding windows in one row.
7
For the GPU version, we allocate each thread for each independent and runtime for single-threaded, 10-thread, 20-thread, and GPU for
window with one permutation. Then parallel reduction is performed ISPD 2005 contest benchmarks and industrial benchmarks. The file
to find the best permutation for each window. IO time is shown in separate columns for reference. As the file
IO of ABCDPlace has a sequential implementation, we only show
V. P OSSIBLE E XTENSIONS the time for single thread to save space. Multithreading has similar
values. The GPU version has longer file IO time than the CPU
ABCDPlace aims at accelerating the fundamental wirelength op-
versions, as we need to first read data from disk to CPU memory
timization techniques in detailed placement. In modern design flow,
and then copy to GPU global memory. If we run a full placement
detailed placement sometimes also needs to consider other objectives
flow, including global placement, legalization, and detailed placement,
such as routability and timing. The parallelization strategies developed
we can initialize all data in the GPU global memory. Thus, this
in this work can be extended to handle these objectives.
is not a mandatory overhead. As mentioned in Section IV, we
For routability optimization, a typical way for extension is to add
choose different algorithms for the maximal independent set and
overflow penalty to the objective along with the wirelength cost.
the partitioning steps in independent set matching to maximize the
One example is the NTUplace4 series [5], [44]. As the routing or
efficiency, so the wirelength results are slightly different, but they are
density overflow map can be precomputed, the overflow penalty for
almost the same on average.
the movement of an individual cell can be calculated incrementally.
On ISPD 2005 benchmarks, with a single thread, ABCDPlace
Then, a weighted sum of overflow penalty and wirelength cost can
demonstrates competitive runtime compared with NTUplace3, while
guide the detailed placement engine to optimize routing congestion.
ABCDPlace can achieve more than 2× speedup with 20 threads, and
For timing optimization, typical techniques include net weighting and
more than 10× speedup with GPU. On large industrial benchmarks,
incorporation of extra timing cost into the objective [45], [46]. An
the speedup from multithreading is more than 4× and that from
external timing analysis engine can help achieve this goal.
GPU is more than 16× on average. The difference in the speedup
Therefore, extensions of these detailed placement techniques for
mainly comes from discrepant experimental environment for the two
routability and timing optimization in general involve in integrating
benchmarks and design sizes. Figure 13 plots the speedup over our
other penalty terms into the wirelength cost, while the skeletons of
single thread implementation versus the design sizes from 200K to
the algorithms remain similar. This work can provide practical insights
10M. Generally speaking, the speedup increases with the design sizes
in developing algorithms for routability and timing optimization. We
and saturates at 1M to 2M, especially on GPU. For million-size
leave the incorporation of these objectives in the future.
designs, the speedup values stay above 15× for GPU, while the CPU
Meanwhile, multiple-row height cells become common in mod-
speedup varies between 2 × −5×.
ern designs. Our current implementations only work on single-row
Another observation from the tables is that the speedup values are
height cells and fix the multiple-row height ones. For independent
close between 10 and 20 threads, i.e., much less than the number of
set matching and global swap, we extend to handle multiple-row
threads, indicating that the benefits from CPU parallelization saturate.
height cells if we work on cells with the same sizes. We plan to
With current implementations in the experiments, GPU acceleration
incorporate this feature in the future. For local reordering, it may not
provides more speedup than CPU multithreading, while the CPU
be straightforward, as we have to work on both multiple-row height
parallelization may be further improved in the future.
cells and single-row height cells together, making the enumeration of
all permutations complicated.
B. Routability Evaluation
VI. E XPERIMENTAL R ESULTS Although ABCDPlace does not explicitly consider routability so far,
ABCDPlace was implemented with C++/CUDA for GPU and we perform experiments on ISPD 2015 contest benchmarks [48] to see
C++/OpenMP for multi-threaded CPU, respectively. The framework whether it leads to significant overhead in congestion. The original
was validated under ISPD 2005 and 2015 contest benchmarks [47], objective of the contest is detailed-routability-driven placement and
[48] and a set of benchmarks from industry. We adopt sequential NTUplace4dr was the winner [44]. We obtained the binary from
detailed placement engines in NTUplace3 [6] and NTUplace4dr [44] the NTUplace4dr team and conducted experiments with the legalized
for comparison. The runtime environments for the three sets of global placement solutions dumped by NTUplace4dr as the input. The
benchmarks are slightly different, which are shown at the bottom of results are shown in Table IV. As the original industrial evaluation
Table II, Table III, and Table IV, respectively. While the proposed platform for detailed routability is no longer available, we report the
parallel placement algorithms can be arbitrarily combined according “top5 overflow”, i.e., the average of the global routing overflow in the
to the real applications, we fixed the detailed placement flow in the top 5% congested routing grids, evaluated from the NCTUgr global
experiment as the following sequence: local reordering, independent router [49] integrated in NTUplace4dr. It can be seen that besides
set matching, global swap, and local reordering to search for different improving the HPWL, ABCDPlace even slightly reduces the global
solution spaces. All the runtime values reported in this section are routing overflow by 2.9%, which is even better than NTUplace4dr.
wall-time for detailed placement excluding the file IO time, as in This indicates that our algorithms do not harm the routability too
practice, all the data is already in the memory if running in the entire much in these benchmarks. However, it also needs to be noted that
backend flow. although our global routing overflow is less than NTUplace4dr, it does
not mean we can achieve better detailed routability. As NTUplace4dr
spends a lot of efforts to optimize the detailed routing congestion
A. HPWL and Runtime Evaluation issues, e.g., DRC errors, we expect ABCDPlace to have more DRC
We compare our parallel algorithms on both CPU and GPU with errors if detailed routing is performed.
NTUplace3 [6] in terms of HPWL and runtime. The legalized global For runtime comparison, we report the detailed placement runtime
placement solutions are generated by an open-source placer DREAM- along with the file IO time for reference. As NTUplace4dr can run the
Place [28]. As Figure 1 indicates, NTUplace3 is very competitive entire placement flow and report the runtime in each stage, the file
in its efficiency. In Table II and Table III, we show the HPWL IO time is not retrieved and only the core detailed placement time
8
TABLE I: Summary of Major Differences in CPU and GPU Implementations
Algorithm Multi-threaded CPU GPU
Partitioning Sequential spiral search Parallel K-Means clustering
Concurrent Independent Set Matching
Batch LAP One thread for each LAP instance One thread block for each LAP instance
One thread for candidate collection One thread block for candidate collection of one
CollectCands
of one cell cell at one bin
Concurrent Global Swap
CalcSwapCosts One thread for swap candidates of One thread for each swap candidate
one cell 2D parallel reduction for a matrix of candidates
FindBestCand
(batch size × max candidates per cell)
Parallelization One thread for each row in each One thread for one permutation of one sliding
Concurrent Local Reordering
granularity independent row group window in a row in each independent row group
is reported. Meanwhile, due to the detailed-routability optimization significantly accelerates this part such that the portion of ini-
in NTUplace4dr, it is slower than ABCDPlace even with a single tialization becomes not negligible.
thread. It is not very meaningful to compare placers with different These breakdown maps for the concurrent algorithms show the
objectives. We focus on the speedup over single-threaded CPU from effectiveness of our acceleration techniques to speedup the critical
multithreading and GPU. The average speedup is around 5× with portions of the computation and achieve more balanced runtime
20 threads and 6.8× with GPU for all designs. However, for large distribution of each step.
designs, e.g., the superblue series, the speedup from GPU over a
single thread can go to 25× and beyond. In Figure 13, the speedup
D. Clarification to the Combination of Placement Techniques
curve for GPU climbs up quickly with the increase of design sizes.
In all the experiments, we apply the placement techniques in the fol-
lowing sequence: local reordering, independent set matching, global
C. Runtime Breakdown
swap, and local reordering. It needs to be clarified that this com-
Figure 14 examines the runtime breakdown of NTUplace3 (rep- bination is empirically determined according to the experiments and
resenting sequential implementation), ABCDPlace with 20 threads these techniques can be arbitrarily combined. Users are encouraged to
and with GPU on ISPD 2005 benchmarks. NTUplace3 runs the customize the combination according to their benchmarks. Intuitively,
local reordering and independent set matching steps twice and the it is better to interleave different techniques, as they explore different
runtime breakdown is around half and half. ABCDPlace also runs solution spaces. Thus, we go through local reordering, independent set
four steps, i.e., local reordering twice, independent set matching and matching, and global swap each once. Then, we find the wirelength
global swap once. The runtime breakdown maps for CPU and GPU are improvement mostly saturates after applying another round of local
similar. On CPU, independent set matching takes the largest portion, reordering, so we choose the current combination for the experiments.
while on GPU, local reordering is most time-consuming. On ISPD
2015 benchmarks, the runtime distribution is different, as shown in
Figure 15, where independent set matching takes the largest portion
VII. C ONCLUSION
of the runtime for both CPU and GPU.
We also plot the speedup of each individual step from multi- We present ABCDPlace, an open-source batch-based acceleration
threaded and GPUs over single-threaded execution, as shown in of detailed placement on multi-threaded CPUs and GPUs. We propose
Figure 14d. Here we use the single-threaded version of the proposed efficient parallel algorithms for classic sequential detailed placement
concurrent algorithms as the baseline for fair comparison. With 10 and techniques based on batch execution. The placer can achieve over
20 threads, we can achieve around 5× speedup for local reordering 10× and 16× speedup with GPUs on ISPD 2005 contest benchmarks
and global swap, as well as 3× speedup for independent set matching. and industrial benchmarks, respectively, without quality degradation.
With GPU, the speedup can reach over 32× for local reordering, 22× The multi-threaded CPU version also achieves around 2 × −5×
for global swap, and 19× for independent set matching. speedup. For a 10-million-cell design, our placer is able to finish
Figure 16, 17, and 18 draw the runtime breakdown of each internal within one minute on GPU, while it takes almost half an hour for
step for concurrent independent set matching, global swap, and local a sequential implementation like NTUplace3. Experiments on ISPD
reordering, respectively. 2015 benchmarks also show that the placer has minimal overhead in
global routing congestion, even though routability is not explicitly
• Concurrent independent set matching. The breakdown maps
considered. We believe the parallelization strategies can shed lights
have different flavors between multi-threaded CPU and GPU
to accelerating other sequential design automation algorithms for fast
implementation. Most of the efforts are spent on partitioning
design closure.
and maximal independent set for CPU, while for GPU, LAP and
In the future, there are many perspectives to further improve
partitioning take the largest portions.
ABCDPlace.
• Concurrent global swap. The breakdown maps for both
multi-threaded CPU and GPU are similar. One may note that • Incorporate holistic optimization objectives such as routability,
ApplyCand takes quite some portion of the runtime, because timing, and multiple-row height cells.
that is the only step that has to run sequentially. • Better parallelization. Current speedup for CPU is still limited
• Concurrent local reordering. There are only two steps for this and there is room to improve. We also consider to explore other
algorithm: an initialization step to compute the independent rows parallelization strategies such as diagonal partitioning.
and an iterative enumeration step. The initialization step is done • Incorporate more detailed placement techniques such as row-
sequentially on CPU, while the enumeration step runs for two based placement algorithms.
iterations in Figure 18. With multi-threaded CPU, enumeration As an open-source project, ABCDPlace can provide an initial devel-
takes over 99% of the runtime, while GPU implementation opment platform for efficient and effective placement engines.
9
TABLE II: Comparison of runtime (in seconds) and HPWL with NTUplace3 [6] on ISPD 2005 contest benchmarks. “1T”, “10T”, and “20T”
denote single, 10, and 20 threads, respectively. “RT” denotes the core detailed placement runtime and “IO” denotes the file IO time.
NTUplace3 ABCDPlace
Initial
Design #cells #nets Single thread 1T 10T 20T GPU
HPWL
HPWL RT IO HPWL RT IO HPWL RT HPWL RT HPWL RT IO
adaptec1 211K 221K 74.35 73.28 25 3 73.23 38 6 73.21 11 73.21 10 73.21 3 13
adaptec2 254K 266K 83.22 82.14 32 4 82.03 44 7 82.03 15 82.03 18 82.05 4 14
adaptec3 451K 467K 199.01 193.98 59 7 193.32 96 12 193.21 24 193.21 22 193.47 7 21
adaptec4 495K 516K 178.28 174.40 68 7 174.41 94 13 174.41 25 174.41 25 174.49 6 23
bigblue1 278K 284K 90.18 89.44 35 4 89.46 47 8 89.45 15 89.45 14 89.43 4 15
bigblue2 535K 577K 139.46 136.76 95 8 136.92 97 15 136.92 24 136.92 23 137.00 6 25
bigblue3 1093K 1123K 310.91 303.98 148 15 304.14 196 28 304.17 59 304.17 57 304.46 10 43
bigblue4 2169K 2230K 751.08 743.75 354 34 744.46 369 61 744.42 91 744.42 80 744.35 16 88
ratio 1.017 1.000 1.000 1.000 1.330 1.000 0.380 1.000 0.369 1.000 0.091
The CPU results for the ISPD 2005 benchmarks were collected from a Linux server with a 20-core Intel Xeon E5-2650 v3 @ 2.3GHz. The GPU results
were collected from a Linux server with a 15-core Intel Xeon Silver 4110 CPU @ 2.1GHz CPU and an NVIDIA Tesla V100 GPU.
TABLE III: Comparison of runtime (in seconds) and HPWL with NTUplace3 [6] on industrial benchmarks. “RT” denotes the core detailed
placement runtime and “IO” denotes the file IO time.
NTUplace3 ABCDPlace
Initial
Design #cells #nets Single thread 1T 10T 20T GPU
HPWL
HPWL RT IO HPWL RT IO HPWL RT HPWL RT HPWL RT IO
design1 1329K 1389K 346.23 340.96 194 35 341.04 236 40 341.03 59 341.03 42 341.00 14 42
design2 1300K 1355K 281.45 275.79 203 33 275.63 232 41 275.64 56 275.64 41 275.56 13 43
design3 2246K 2276K 531.93 523.06 332 58 522.93 384 66 522.97 95 522.97 78 522.98 19 69
design4 1512K 1528K 459.49 454.14 233 41 453.97 292 47 453.99 65 453.99 51 453.91 15 49
design5 1306K 1364K 294.05 288.38 203 34 288.47 236 39 288.49 55 288.49 41 288.45 13 43
design6 10504K 10747K 2348.81 2346.33 1565 331 2348.64 1391 331 2348.65 442 2348.65 349 2348.34 58 350
ratio 1.014 1.000 1.000 1.000 1.137 1.000 0.283 1.000 0.215 1.000 0.059
The results for the industrial benchmarks were collected from a Linux server with a 20-core Intel E5-2698 v4 CPU @ 2.2GHz CPU and an NVIDIA Tesla V100 GPU.
TABLE IV: Comparison of runtime (in seconds), HPWL, and congestion with NTUplace4dr [44] on ISPD 2015 contest benchmarks. “1T”and
“20T” denote single and 20 threads, respectively. “RT” denotes the core detailed placement runtime and “IO” denotes the file IO time.
Initial Initial NTUplace4dr ABCDPlace 1T ABCDPlace 20T ABCDPlace GPU
#cells #nets
HPWL Top5 Top5 Top5 Top5 Top5
HPWL RT HPWL RT IO HPWL RT HPWL RT IO
Overflow Overflow Overflow Overflow Overflow
mgc des perf 1 113K 113K 1.22 64.97 1.19 63.89 76 1.16 63.17 27 3 1.16 63.32 6 1.16 62.58 5 9
mgc des perf a 108K 115K 2.47 74.45 2.44 73.14 91 2.37 71.53 45 3 2.37 71.54 9 2.37 71.59 5 9
mgc des perf b 113K 113K 2.02 72.44 2.00 71.71 80 1.95 70.68 42 3 1.95 70.22 9 1.95 70.51 5 10
mgc edit dist a 127K 134K 5.03 93.71 4.91 92.95 80 4.60 92.12 61 3 4.60 91.97 12 4.62 92.13 6 10
mgc fft 1 32K 33K 0.46 61.59 0.46 61.87 22 0.44 59.08 3 1 0.44 59.39 1 0.44 59.36 2 8
mgc fft 2 32K 33K 0.48 55.20 0.48 55.10 28 0.48 54.76 2 1 0.48 55.00 1 0.48 54.78 1 6
mgc fft a 31K 32K 0.65 36.91 0.64 36.07 34 0.63 35.53 3 1 0.63 35.54 2 0.63 35.59 2 8
mgc fft b 31K 32K 0.86 53.84 0.85 53.19 26 0.84 52.87 3 1 0.84 52.80 1 0.84 52.86 1 7
mgc matrix mult 1 155K 159K 2.30 73.71 2.27 73.43 61 2.21 72.85 27 4 2.21 72.53 6 2.21 72.77 3 10
mgc matrix mult 2 155K 159K 2.28 73.87 2.25 73.44 65 2.21 72.66 26 4 2.21 72.24 7 2.21 72.37 3 10
mgc matrix mult a 150K 154K 3.37 49.04 3.33 48.62 87 3.32 48.49 24 4 3.32 48.48 5 3.32 48.51 3 11
mgc matrix mult b 146K 152K 3.67 49.80 3.63 49.29 69 3.61 48.81 30 4 3.61 48.79 7 3.61 48.88 5 11
mgc matrix mult c 146K 152K 3.51 48.84 3.47 48.22 69 3.46 48.09 27 4 3.46 48.03 6 3.46 48.11 5 11
mgc pci bridge32 a 30K 34K 0.47 41.74 0.47 41.52 25 0.47 40.49 5 1 0.47 40.75 2 0.46 40.28 4 7
mgc pci bridge32 b 29K 33K 0.70 35.90 0.69 35.38 23 0.67 34.40 7 1 0.67 34.48 3 0.68 34.49 5 7
mgc superblue11 a 926K 936K 39.67 58.69 38.99 57.64 4297 38.41 57.04 874 30 38.41 57.08 102 38.42 57.05 31 40
mgc superblue12 1287K 1293K 35.52 79.28 34.76 78.84 1805 34.09 78.08 700 41 34.10 78.12 90 34.10 78.08 24 46
mgc superblue14 612K 620K 25.41 70.31 24.85 67.76 751 24.19 66.30 933 20 24.20 66.52 115 24.22 66.53 35 26
mgc superblue16 a 680K 697K 31.24 97.45 30.34 94.24 746 29.17 90.24 831 22 29.16 90.20 86 29.25 90.55 30 27
mgc superblue19 506K 512K 17.45 65.08 17.13 64.55 967 16.97 63.80 287 16 16.97 63.88 41 16.99 64.01 14 22
ratio 1.015 1.013 1.000 1.000 1.000 0.979 0.984 0.412 0.979 0.985 0.084 0.979 0.984 0.061
The CPU results for the ISPD 2015 benchmarks were collected from a Linux server with a 20-core Intel Xeon E5-2650 v3 @ 2.3GHz. The GPU results were collected from a Linux server with a 14-core
Intel Xeon E5-2690 v4 @ 2.6GHz and an NVIDIA Tesla V100 GPU.
ACKNOWLEDGE
The authors would like to thank Dr. Chau-Chin Huang at Syn-
opsys and Prof. Yao-Wen Chang at National Taiwan University for
preparing the binary of NTUplace4dr [44], helpful comments on the
experimental setups, and verifying the results.
10
ISPD2005-20T Industrial-20T ISPD2015-20T
ISPD2005-GPU Industrial-GPU ISPD2015-GPU
30
25
Speedup over 1T
20
15
10
0
0 500 1000 1500 2000 2500 3000 10000 10250 10500 10750 11000
#Cells (K)
Fig. 13: The trend of speedup over single thread with design sizes. “T” stands for threads on CPU.
80 15 40
K-Reorder K-Reorder K-Reorder K-Reorder
300
Matching 60 Matching Matching 30 Matching
Runtime (s)
10
Speedup
Global Swap Global Swap
200 Global Swap
40 20
5
100 20 10
0 0 0 0
1T 10T 20T GPU
ap 1
ap 2
ap 3
gb 4
bi lue1
bi lue2
bi lue3
e4
ap 1
ap 2
ap 3
gb 4
bi lue1
bi lue2
bi lue3
e4
ap 1
ap 2
ap 3
gb 4
bi lue1
bi lue2
bi lue3
e4
ad tec
ad tec
ad tec
bi tec
ad tec
ad tec
ad tec
bi tec
ad tec
ad tec
ad tec
bi tec
lu
lu
lu
ap
gb
gb
gb
ap
gb
gb
gb
ap
gb
gb
gb
ad
ad
ad
200 60
K-Reorder K-Reorder
150 Matching Matching CalcSearchBox 42.9 37.3
Global Swap 40 Global Swap CalcSwapCosts 16.9
100 1.2 0.5
FindBestCand 18.1
CollectCands
6.2
20 0.5
50 37.3 39.1
ApplyCand
0 0 (a) (b)
19
14
12
19
14
12
16
11
16
11
sb
sb
sb
sb
sb
sb
sb
sb
sb
63.3
0.2
35.5 43.9
99.8
MaximalIndependentSet IndependentRows
17.1 36.7
Partitioning
10.5 15.7 Enumeration
CostMatrices 35.2 12.2
LAP 29.8 (a) (b)
Fig. 18: Runtime breakdown for local reordering on bigblue4: (a)
(a) (b)
20 threads; (b) GPU.
Fig. 16: Runtime breakdown for independent set matching on
bigblue4: (a) 20 threads; (b) GPU.
11
R EFERENCES [23] A. Choong, R. Beidas, and J. Zhu, “Parallelizing simulated annealing-
based placement using gpgpu,” in 2010 International Conference on Field
[1] K. Shahookar and P. Mazumder, “Vlsi cell placement techniques,” ACM Programmable Logic and Applications. IEEE, 2010, pp. 31–34.
Computing Surveys (CSUR), vol. 23, no. 2, pp. 143–220, 1991. [24] C. Fobel, G. Grewal, and D. Stacey, “A scalable, serially-equivalent, high-
[2] M. Pan, N. Viswanathan, and C. Chu, “An efficient and effective detailed quality parallel placement methodology suitable for modern multicore
placement algorithm,” in Proc. ICCAD, 2005, pp. 48–55. and gpu architectures,” in 2014 24th International Conference on Field
[3] Z.-W. Jiang, H.-C. Chen, T.-C. Chen, and Y.-W. Chang, “Challenges and Programmable Logic and Applications (FPL), Sep. 2014, pp. 1–8.
solutions in modern vlsi placement,” in 2007 International Symposium [25] J. Cong and Y. Zou, “Parallel multi-level analytical global placement on
on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2007, pp. graphics processing units,” in Proc. ICCAD. ACM, 2009, pp. 681–688.
1–5. [26] T. Lin, C. Chu, and G. Wu, “Polar 3.0: An ultrafast global placement
[4] S. Popovych, H.-H. Lai, C.-M. Wang, Y.-L. Li, W.-H. Liu, and T.-C. engine,” in Proc. ICCAD. IEEE, 2015, pp. 520–527.
Wang, “Density-aware detailed placement with instant legalization,” in [27] W. Li, M. Li, J. Wang, and D. Z. Pan, “UTPlaceF 3.0: A parallelization
Proc. DAC, 2014, pp. 122:1–122:6. framework for modern fpga global placement,” in Proc. ICCAD. IEEE,
[5] M.-K. Hsu, Y.-F. Chen, C.-C. Huang, S. Chou, T.-H. Lin, T.-C. Chen, 2017, pp. 922–928.
and Y.-W. Chang, “NTUplace4h: A novel routability-driven placement [28] Y. Lin, S. Dhar, W. Li, H. Ren, B. Khailany, and D. Pan, “DREAM-
algorithm for hierarchical mixed-size circuit designs,” IEEE TCAD, Place: Deep learning toolkit-enabled gpu acceleration for modern vlsi
vol. 33, no. 12, pp. 1914–1927, 2014. placement,” in Proc. DAC, 2019.
[6] T.-C. Chen, Z.-W. Jiang, T.-C. Hsu, H.-C. Chen, and Y.-W. Chang, [29] S. Dhar and D. Z. Pan, “GDP: GPU accelerated detailed placement,” in
“Ntuplace3: An analytical placer for large-scale mixed-size designs with Proc. HPEC, Sept 2018.
preplaced blocks and density constraints,” IEEE TCAD, vol. 27, no. 7, [30] C.-X. Lin, T.-W. Huang, G. Guo, and M. D. Wong, “Cpp-taskflow: Fast
pp. 1228–1240, 2008. parallel programming with task dependency graphs,” Proc. IPDPS, 2019.
[7] W.-K. Chow, J. Kuang, X. He, W. Cai, and E. F. Young, “Cell density- [31] Y.-S. Lu and K. Pingali, Can Parallel Programming Revolutionize EDA
driven detailed placement with displacement constraint,” in Proceedings Tools? Cham: Springer International Publishing, 2018, pp. 21–41.
of the 2014 on International symposium on physical design. ACM, [Online]. Available: https://doi.org/10.1007/978-3-319-67295-3 2
2014, pp. 3–10. [32] Y. Wang, Y. Pan, A. Davidson, Y. Wu, C. Yang, L. Wang, M. Osama,
[8] A. B. Kahng, I. L. Markov, and S. Reda, “On legalization of row-based C. Yuan, W. Liu, A. T. Riffel et al., “Gunrock: Gpu graph analytics,”
placements,” in Proc. GLSVLSI, 2004, pp. 214–219. ACM Transactions on Parallel Computing (TOPC), vol. 4, no. 1, p. 3,
[9] J. Chen, P. Yang, X. Li, W. Zhu, and Y.-W. Chang, “Mixed-cell- 2017.
height placement with complex minimum-implant-area constraints,” in [33] Y. Lin, B. Yu, X. Xu, J.-R. Gao, N. Viswanathan, W.-H. Liu, Z. Li,
Proceedings of the International Conference on Computer-Aided Design. C. J. Alpert, and D. Z. Pan, “Mrdp: Multiple-row detailed placement
ACM, 2018, p. 66. of heterogeneous-sized cells for advanced nodes,” IEEE Transactions on
[10] I. L. Markov, J. Hu, and M. Kim, “Progress and challenges Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 6,
in VLSI placement research,” Proceedings of the IEEE, vol. pp. 1237–1250, 2017.
103, no. 11, pp. 1985–2003, 2015. [Online]. Available: https: [34] N. Viswanathan, M. Pan, and C. Chu, “FastPlace 3.0: A fast multilevel
//doi.org/10.1109/JPROC.2015.2478963 quadratic placement algorithm with placement congestion control,” in
[11] N. Viswanathan and C.-N. Chu, “Fastplace: efficient analytical placement Proc. ASPDAC, 2007, pp. 135–140.
using cell shifting, iterative local refinement, and a hybrid net model,” [35] T.-C. Chen, T.-C. Hsu, Z.-W. Jiang, and Y.-W. Chang, “NTUplace: a
IEEE Transactions on Computer-Aided Design of Integrated Circuits and ratio partitioning based placement algorithm for large-scale mixed-size
Systems, vol. 24, no. 5, pp. 722–733, 2005. designs,” in Proc. ISPD, 2005, pp. 236–238.
[12] W.-K. Chow, J. Kuang, X. He, W. Cai, and E. F. Y. Young, “Cell density- [36] F. Gavril, “Algorithms for minimum coloring, maximum clique, mini-
driven detailed placement with displacement constraint,” in Proc. ISPD, mum covering by cliques, and maximum independent set of a chordal
2014, pp. 3–10. graph,” SIAM Journal on Computing (SICOMP), vol. 1, no. 2, pp. 180–
[13] G. Wu and C. Chu, “Detailed placement algorithm for VLSI design with 187, 1972.
double-row height standard cells,” IEEE TCAD, vol. 35, no. 9, pp. 1569– [37] G. E. Blelloch, J. T. Fineman, and J. Shun, “Greedy sequential maximal
1573, 2016. independent set and matching are parallel on average,” CoRR, vol.
[14] K. Han, A. B. Kahng, and H. Lee, “Scalable detailed placement legal- abs/1202.3205, 2012. [Online]. Available: http://arxiv.org/abs/1202.3205
ization for complex sub-14nm constraints,” in Proc. ICCAD, 2015, pp. [38] D. P. Bertsekas, “A new algorithm for the assignment problem,” Mathe-
867–873. matical Programming, vol. 21, no. 1, pp. 152–171, 1981.
[15] T. Lin and C. Chu, “TPL-aware displacement-driven detailed placement [39] M. M. Zavlanos, L. Spesivtsev, and G. J. Pappas, “A distributed auction
refinement with coloring constraints,” in Proc. ISPD, 2015, pp. 75–80. algorithm for the assignment problem,” in 2008 47th IEEE Conference
[16] Y. Lin, B. Yu, B. Xu, and D. Z. Pan, “Triple patterning aware detailed on Decision and Control. IEEE, 2008, pp. 1212–1217.
placement toward zero cross-row middle-of-line conflict,” IEEE TCAD, [40] “Munkres-CPP,” https://github.com/saebyn/munkres-cpp.
vol. 36, no. 7, pp. 1140–1152, 2017. [41] “LEMON,” http://lemon.cs.elte.hu/trac/lemon.
[17] J. Chen, P. Yang, X. Li, W. Zhu, and Y.-W. Chang, “Mixed-cell- [42] V. Kumar, Introduction to parallel computing. Addison-Wesley Long-
height placement with complex minimum-implant-area constraints,” in man Publishing Co., Inc., 2002.
Proceedings of the International Conference on Computer-Aided Design. [43] J. Jaja, An introduction to parallel algorithms. Addison-Wesley Long-
ACM, 2018, p. 66. man Publishing Co., Inc., 1992.
[18] C.-K. Cheng, A. B. Kahng, I. Kang, and L. Wang, “Replace: Advancing [44] C.-C. Huang, H.-Y. Lee, B.-Q. Lin, S.-W. Yang, C.-H. Chang, S.-T.
solution quality and routability validation in global placement,” IEEE Chen, Y.-W. Chang, T.-C. Chen, and I. Bustany, “Ntuplace4dr: a detailed-
TCAD, 2018. routing-driven placer for mixed-size circuit designs with technology and
[19] Z. Zhu, J. Chen, Z. Peng, W. Zhu, and Y.-W. Chang, “Generalized region constraints,” IEEE Transactions on Computer-Aided Design of
augmented lagrangian and its applications to vlsi global placement,” Integrated Circuits and Systems, vol. 37, no. 3, pp. 669–681, 2017.
in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). [45] J. Jung, G.-J. Nam, L. N. Reddy, I. H.-R. Jiang, and Y. Shin, “Owaru:
IEEE, 2018, pp. 1–6. Free space-aware timing-driven incremental placement with critical path
[20] W. Zhu, Z. Huang, J. Chen, and Y.-W. Chang, “Analytical solution smoothing,” IEEE Transactions on Computer-Aided Design of Integrated
of poisson’s equation and its application to vlsi global placement,” in Circuits and Systems, vol. 37, no. 9, pp. 1825–1838, 2017.
2018 IEEE/ACM International Conference on Computer-Aided Design [46] J. A. S. Jesuthasan, “Incremental timing-driven placement with displace-
(ICCAD). IEEE, 2018, pp. 1–8. ment constraint,” Master’s thesis, University of Waterloo, 2015.
[21] F.-K. Sun and Y.-W. Chang, “Big: A bivariate gradient-based wirelength [47] G.-J. Nam, C. J. Alpert, P. Villarrubia, B. Winter, and M. Yildiz, “The
model for analytical circuit placement,” in Proceedings of the 56th ispd2005 placement contest and benchmark suite,” in Proc. ISPD. ACM,
Annual Design Automation Conference 2019. ACM, 2019, p. 118. 2005, pp. 216–220.
[22] J. A. Chandy and P. Banerjee, “Parallel simulated annealing strategies [48] I. S. Bustany, D. Chinnery, J. R. Shinnerl, and V. Yutsis, “ISPD 2015
for vlsi cell placement,” in Proceedings of 9th International Conference benchmarks with fence regions and routing blockages for detailed-
on VLSI Design, Jan 1996, pp. 37–42. routing-driven placement,” in Proc. ISPD, 2015, pp. 157–164.
12
[49] K.-R. Dai, W.-H. Liu, and Y.-L. Li, “NCTU-GR: efficient simulated Brucek Khailany (M’00–SM’13) received the the
evolution-based rerouting and congestion-relaxed layer assignment on Ph.D. degree from Stanford University in Stanford,
3-D global routing,” IEEE TVLSI, vol. 20, no. 3, pp. 459–472, 2012. CA in 2003 and the B.S.E. degree from the Uni-
versity of Michigan in Ann Arbor, MI in 1997, in
electrical engineering. He joined NVIDIA in 2009
and is currently the Director of the ASIC and VLSI
Research group. He leads research into innovative
design methodologies for integrated circuit (IC) de-
velopment, machine learning (ML) and GPU-assisted
electronic design automation (EDA) algorithms, and
energy-efficient ML accelerators. Over 10 years at
Yibo Lin (S’16–M’19) received the B.S. degree in NVIDIA, he has contributed to many projects in research and product groups
microelectronics from Shanghai Jiaotong University spanning computer architecture and VLSI design. Previously, from 2004-2009,
in 2013, and his Ph.D. degree from the Electrical he was a Co-Founder and Principal Architect at Stream Processors, Inc (SPI)
and Computer Engineering Department of the Uni- where he led research and development activities related to parallel processor
versity of Texas at Austin in 2018. He is current an architectures.
assistant professor in the Computer Science Depart-
ment associated with the Center for Energy-Efficient
Computing and Applications at Peking University,
China. His research interests include physical design,
machine learning applications, GPU acceleration, and
hardware security. He has received 3 Best Paper
Awards at premier venues (DAC 2019, VLSI Integration 2018, and SPIE
2016). He has also served in the Technical Program Committees of many
major conferences, including ICCAD, ICCD, ISPD, and DAC.