0% found this document useful (0 votes)

123 views34 pages

Parallel Algorithms For VLSI Layout Verication

This document discusses parallel algorithms for design rule checking (DRC) in VLSI layout verification. It presents ProperDRC, a parallel DRC program implemented on the ProperCAD environment. ProperDRC can exploit both data and task parallelism and is portable across shared-memory and distributed-memory architectures. Experimental results show ProperDRC can reduce runtimes for DRC on large chip designs that may not fit in a single workstation's memory through parallel processing.

Uploaded by

Anoop Kuttierimmal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views34 pages

Parallel Algorithms For VLSI Layout Verication

Uploaded by

Anoop Kuttierimmal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Parallel Algorithms for VLSI Layout Veri cation

Ky MacPherson and Prithviraj Banerjee

Computer and Systems Research Laboratory University of Illinois at Urbana-Champaign 1308 West Main St. Urbana, IL 61801 Tel : (217)333-6564 Fax : (217)333-1910 E-mail : banerjee@crhc.uiuc.edu
Layout veri cation determines whether the polygons that represent di erent mask layers in the chip conform to the technology speci cations. Commercial layout veri cation programs can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the DRC problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors. In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and hybrid architectures comprised of uni- and multiprocessor workstations connected by a network. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. This paper presents speci cs of the implementation of ProperDRC, provides an analysis of the methods used to obtain parallelism, addresses load balancing issues, and reports on experimental results on various benchmark circuits.
This research was supported in part by the Advanced Research Projects Agency under contract DAAH04-94-G-0273 administered by the Army Research O ce.

Abstract

1 Introduction
Layout veri cation determines whether the polygons that represent di erent mask layers in a VLSI chip conform to the technology speci cations. One aspect of layout veri cation is design rule checking (DRC) which detects violations of rules such as width, space and overlap rules that govern the technology in which the chip is to be fabricated. The computational complexity of layout veri cation programs is not due to the intrinsic complexity of each operation but to large number of parts in the layout which can consist of tens of millions of rectangles for large designs. The most sophisticated commercial layout veri cation programs such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the layout veri cation problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors. In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. ProperDRC currently works on Manhattan geometries only (where the edges of rectangles are parallel to the X and Y axes), but conceptually the parallel approaches can be extended to handle non-Manhattan geometries as well since the algorithms for layout operations are all based on scanline algorithms. The objectives of the ProperCAD project are to develop e cient parallel algorithms for VLSI CAD tasks that can utilize the computing power of a wide range of parallel platforms in order to reduce the design turnaround time of complex chips 1, 2, 3]. We have developed a PoRtable Object-oriented Parallel EnviRonment for CAD algorithms (ProperCAD II), which is a C++ object library targeted at medium-grain parallelism, and MIMD parallel architectures (shared memory, and message passing). Parallel CAD algorithms developed on this library run unchanged, e ciently on both shared memory and message-passing architectures. The di erences from all the previous work on portable parallel programming and our 1

ProperCAD e ort is that we have avoided de ning a new language for writing parallel programs. We have instead used an existing established object-oriented language (C++) as the base language for exploiting the objected oriented nature of programming, and augmented it with an e cient C++ class library to help write portable parallel programs. The ProperCAD II framework runs on shared memory multiprocessors such as the SUN 4/600MP, SUN Sparcserver 1000, the Encore Multimax, and the Silicon Graphics Challenge, and distributed memory message-passing multicomputers such as the Intel iPSC/860 hypercube the Intel Paragon, the Thinking Machines CM-5, the IBM SP-2, on also on a network of SUN workstations. We are investigating parallel algorithms for various VLSI CAD applications on top of the ProperCAD II framework. The applications include cell placement 4, 5], global and detailed routing, circuit extraction 6], logic synthesis 7, 8], test generation 9, 10], fault simulation 11], circuit, logic and behavioral simulation, and high level synthesis. In this paper, we describe parallel algorithms for layout veri cation of attened VLSI layouts using the ProperCAD framework. While some layout veri cation tools exploit the hierarchical information available in VLSI chip designs during the chip design stage while designers are interactively designing a chip, many companies perform a complete attened chip design rule checking just prior to tape-out to avoid the economic penalties of possibly sending out an incorrect layout for costly fabrication 12, 13]. The runtimes of these attened layout veri cation tools can run into tens to hundreds of hours for large commercial designs having tens of millions of rectangles. This is true for commercial tools such as CHECKMATE and PARADE from Mentor Graphics, and DRACULA and VAMPIRE from Cadence Design Systems. It is therefore important to investigate parallel algorithms for layout veri cation. Another problem of attened design rule checking is its tremendous memory requirements. One can assume that each transistor in a VLSI design translates to about 10-20 rectangles on a mask layout 31]. In order to represent a mask layout, one needs to store the X and Y location, and additional information about the layout mask layer, orientation, etc, which would need about 20-40 bytes per rectangle 31]. For a 10 million transistor circuit representative of current microprocessors, the memory requirements are 8 Gbytes using this simple analysis. The above analysis is for simply representing the layout. During the layout veri cation tasks, additional data structures and temporary data storage is used. Clearly, these memory requirements are too large to t on the memory of a conventional workstation. Using data partitioning, one can partition the memory requirements of the layout among various processors in a parallel machine and enable the execution of these large problems. We will show in the results section of this paper examples of large layouts that cannot run 2

Width Enclosure

Spacing

Square Test

Extension

Figure 1: Sample design rules on one processor of a CM-5 multiprocessor due to memory limitations, but can run on 64 processors using data parallelism. This paper is organized as follows: Section 2 will describe the details of the serial algorithm for design rule checking. Section 3 will discuss related work in parallel design rule checking algorithms. Details of the parallel DRC algorithm and of the ProperDRC implementation are described in Section 4. Performance results for ProperDRC are presented in Section 5. Section 6 contains an analysis of the performance of the parallel DRC. Section 7 summarizes the work in parallel DRC performed in this research.

2 Serial Design Rule Checking

To guarantee that a circuit can be reliably fabricated, it is necessary to impose a set of design rules on the layout geometry. Figure 1 shows some examples of design rules. The algorithms presented in this paper use an edge-based representation scheme to describe the layout geometry. The masks of a Manhattan geometry can be represented by horizontal edges only, because the vertical edges can be reconstructed by examining the 3

opaqueness or transparency of the areas above and below each horizontal edge. More details of the algorithms are presented in 14]. While a naive DRC algorithm would check for all possible interactions of all N 2 pairs of rectangles in a design consisting of N rectangles, a data structure called the scanline is useful for performing operations on a geometry that uses an edge representation. The basic idea of a scanline algorithm is to sweep a vertical line across the edges that constitute a mask layer. Each horizontal location the scanline encounters is called a scanline stop. Only the edges that encounter the scanline are considered at a given time. Scanline algorithms can be implemented in a space-e cient manner. An edge in the circuit area is brought in from the global data structure containing N rectangles to an intermediate data structure p containing on the average O( N ) edges. An edge is included into the data structure when its left endpoint touches the scanline and is removed from the scanline data structure when its right endpoint touches the scanline 15]. Figure 2 illustrates the basic scanline operation. The scanline stops can be restricted to locations on the circuit area that correspond to the left or right endpoint of an edge.

2.1 Task Graph Generation

Advances continue to occur in VLSI manufacturing technology; therefore, a practical DRC tool must have the exibility to accommodate changes in design rules. ProperDRC reads a set of design rules from an input le and then generates a graph of the tasks required to perform the design rule checks. If it becomes necessary to change any of the design rules, only the input le must be modi ed; no changes to the source are necessary. ProperDRC has the capability to test for violations of any of the following types of rules: Width, Spacing, Enclosure, Overlap, No-Overlap, and Extension. The Width rule is used to specify a minimum feature width for a given layer. The Spacing rule de nes the minimum distance between geometries in two di erent layers, or between two geometries in a single layer. The Enclosure rule is used when one feature must surround another feature by a minimum distance on all sides. The Overlap rule is used when features in a given layer must always be overlapped by another layer. The No-Overlap rule serves just the opposite purpose, and is used to prevent two features in separate layers from occupying the same space. If one geometry feature must extend past the boundary of another feature by a certain minimum distance, the Extension rule is used. Each of these rule checks is broken into one or more elementary tasks. The elementary tasks used by ProperDRC are Boolean operations between two layers, the Square-Test op4

Scan line

D A B F

Figure 2: A scanline moving left to right. At the current position of the scanline, rectangles C, B, E, D and G are included in the data structure, and various operations regarding layout violations are performed only on these rectangles and not on the remaining rectangles. When the scanline moves to the next position (stop), the rectangle C is deleted from the scanline structure.

eration, the Grow operation, and Width/Spacing testing. The majority of the rules equate to a single elementary task, as shown in Figure 3. The circles in the task graph represent the input layers, and the squares represent layers generated by the operation listed, which will contain all of the geometry edges that fail to pass the corresponding design rule. The Enclosure and Extension rules require a series of three and ten elementary tasks, respectively, to test forrule violations. The task graphs corresponding to these two types of design rules are shown in Figure 4.

2.2 Elementary DRC Tasks

Boolean operations between layers are performed using Lauther's scanline algorithm 15]. The arguments to the function are the two input layers, and the result is an set of edges for the newly formed layer. This operation takes O(N log N ) time for N edges. The edges of the newly formed layer must be sorted, so that they may be used as arguments to subsequent tasks. Szymanski and Van Wyk have demonstrated that the natural ordering for the output of a scanline operation can be exploited to perform the sort in O(log N ) time 16]. A special case of the Boolean operation is the paint operation, in which a scanline is passed over a single layer, and no Boolean operation is performed per se. The purpose of the paint operation is to form a set of maximal, nonoverlapping edges. This procedure is necessary before passing a set of edges to the Width/Spacing operation, to remove any edges that may appear inside the interior of a polygon, and possibly cause erroneous clearance violation reports. The Square-Test operation groups every edge of a given layer into pairs that form squares of a given size. This test is used to verify the size of features in the contact and via layers. This task is also used to implement the Extension test. This operation takes O(N ) time for N edges.

2.3 Grow Operation

The Grow operation is performed on a given layer and produces a new set of edges, in which every rectangle is expanded by a speci ed size. A modi cation of Lauther's Boolean mask scanline algorithm 15] is used to perform the grow operation. The implementation of the grow itself actually occurs at the output of the scanline operation, so a Boolean mask operation and grow operation can be performed simultaneously, if necessary.

Description

Example

Corresponding Task Graph

Input Layer

Single Layer Width/Spacing

Width/Spacing

Layer 1

Layer 2

Two-Layer Spacing
2-Layer Spacing

Input Layer

Square Test

Layer 1

Layer 2

Overlap
AND_NOT

Layer 1

Layer 2

No-Overlap
AND

Figure 3: Design rules that correspond to one elementary task 7

Description

Example

Corresponding Task Graph

Layer 1

Layer 2

Enclosure
AND

Grow

AND_NOT

Layer 1

Layer 2

AND

Grow

Extension

Grow

AND

AND_NOT

AND

AND_NOT

Square Test

Figure 4: Tasks used to implement the Enclosure and Extension rules 8

2.4 Width/Spacing Test

There are two versions of the Width/Spacing test. One takes a single layer as an argument and tests all edges in that layer against other edges in the same layer for minimum width and spacing requirements. The second version takes two layers as arguments and tests all edges in the rst layer against edges in the second layer, and vice versa, for spacing violations. The di erences between these two versions are minor, so a single routine is used to perform both functions. Several optimizations can be used to streamline the clearance checking algorithm. Only the endpoints of an edge must be tested. Ideally, this endpoint only needs to be tested against edges that lie within a circle whose center lies on the endpoint and whose radius is the minimum allowable clearance. For Manhattan edges, the search range can be further reduced by dividing the circle into quadrants. Only edges that lie in one quadrant of the circle require testing for spacing violations. If a Width test is also being performed, a second quadrant of the circle must be tested for width violations. The ProperDRC Width/Spacing test uses scanlines similar to the ones used in the Lauther algorithm. However, several scanlines must be kept in memory at a time, to test for violations between neighboring edges. Therefore, a new data structure, the Window, is introduced. A Window consists of a set of scanlines. In addition, the Window contains additional storage for edges that lie parallel to the scanline, which must be generated by the Width/Spacing test routine. The Window is essentially a swath cut through the circuit with a width of twice the maximum design rule interaction distance, (DRIDMAX ), which is de ned to be the greater of the minimum spacing distance and the minimum width distance for a given layer. This is used to ensure that a given edge is tested only for width/spacing violations against other edges that lie in close proximity to that edge. To make the Width/Spacing test even more e cient, it is desirable to compare edges within the Window that are in close proximity to each other. Therefore, the width/spacing routine essentially passes a second Window, perpendicular to the rst one, across the length of the rst Window. The result is that a given edge is tested only against edges that lie in a square whose size is twice the maximum design rule interaction distance on a side. The width/spacing testing can be further optimized by dividing the square into quadrants and restricting the searches to the appropriate quadrants. Details of the algorithms are provided in 14]. 9

3 Prior Work in Parallel DRC

Several approaches have been explored for parallelizing the design rule checking process in the past by other researchers 17]. We will present an overview of previous work utilizing the following methods: area decomposition on attened circuits, hierarchical decomposition on hierarchical circuits, functional decomposition on attened and hierarchical circuits, and edge decomposition on attened circuits.

3.1 Area Decomposition

Bier and Pleszkun have proposed a parallel algorithm that works on the attened representation of mask layouts and uses an area decomposition strategy 18]. The circuit area is divided into subregions that are distributed to various processors, and each processor performs a complete set of design rule checks on its own subregion. The algorithm can work on polygon, pixelmap, or edge-based geometry representations. Care must be taken when dividing the circuit area into subregions. A cut through the circuit area may introduce errors by dividing geometry features into pieces that do not pass the design rules by themselves. Furthermore, some design rule infractions may go undetected if the o ending features lie on opposite sides of the dividing line. Both of these problems can be alleviated by extending the area of each processor's subregion on all sides by the maximum design rule interaction distance, which is de ned to be the size of the largest constraint placed on the layout for a given technology. Any errors detected within the overlap region are discarded rather than reported. This work did not speci cally address the issue of load balancing. The circuit was partitioned by equal area regions. In chips with widely varying densities of rectangles, one region can have a large number of rectangles; hence, the speedup would be less than linear. We address this problem in our work. A second problem is that the above work basically partitioned the chip area in a single dimension, by columns. It is well known that the perimeter of a square is less than that of a rectangle of equal area. Because the larger perimeter translates to an increase in the overlap area between processors, two-dimensional partitioning, as used in ProperDRC, minimizes the total amount of area assigned to each processor.

3.2 Hierarchical Decomposition

Unlike a attened VLSI layout representation, in which all of the geometry features of a circuit are explicitly speci ed at all of the mask layers, a hierarchical representation of a VLSI layout groups sets of geometries into a single symbol, which usually represents a single functional unit of some type. Symbol calls can be nested, providing a tool for structured design. A parallel DRC tool has been developed by Gregoretti and Seagall that takes advantage of the hierarchical representation of the circuit 19]. A generalized data type, called the token, is introduced to represent either a single geometry feature or a collection of features grouped into a symbol. The design rule checks are performed on the tokens themselves. When two tokens overlap, new tasks are generated in which one token is tested against all of the tokens represented by the second token, if it represents a symbol, or against the single feature the second token represents. The process is parallelized by having all processors take tasks from, and add tasks to, a common task queue. This approach exploits parallelism only at the level of cells. If there are fewer cells in the design than processors in the multiprocessor, or if the cells have widely di erent sizes, there can be a load balancing problem. Also, this approach is not applicable to attened circuit descriptions since there will only be a single task. The other disadvantage with this approach is that it is not memory scalable. If the edge-based representation of a circuit is too large to t in the memory of a single processor, the multiprocessor will not be able to operate on the circuit.

3.3 Task Partitioning

Task partitioning of the DRC process relies on the fact that a design rule check does not entail the execution of a single algorithm, but instead requires the sequential execution of many computationally independent algorithms. The goal of task partitioning is to perform the computations necessary for separate rule checks simultaneously on di erent processors, while at the same time not duplicate the computations that contribute to the checking of more than one rule. Marantz developed a system that provided a general method of controlling the execution of any program that can be divided into a nite set of tasks 20]. This system was applied to the DRC problem by distributing the design rules to the various processors, where each processor applies its subset of rules to the entire circuit area. 11

It should be noted that the task parallel approach is the easiest to incorporate into a large piece of layout veri cation software, since one can partition the rules among di erent DRC runs on di erent processors. This is the approach used in a commercial version of a parallel DRC called DRACULA from Cadence Design Systems which runs on networks of workstations and on shared memory multiprocessors such as the SPARCServer 1000. We will show in the results in Section 5 that pure task parallelism produces limited speedups since there is not enough task parallelism in real design rules, hence such an approach is appropriate for a small number of processors, e.g. 4 to 8. Therefore, the approach is not scalable. This method of parallelization also su ers from the same memory scalability problem as the previous approach, in that each processor must have enough memory to perform operations on the entire circuit area.

3.4 Edge Partitioning

Carlson and Rutenbar have developed an algorithm in which all scanline stops are generated at the start of the checking and then processed in parallel 21, 22]. It is necessary to decompose the circuit geometry into a completely intersected set of edges so that the set of all edges crossing a given scanline is immediately available. Boolean operations between layers, the determination of electrically connected sets of geometries, and checking for width, spacing, and extension violations are all performed in parallel on the scanlines. This approach is applicable only to a single type of architecture, namely SIMD data parallel computers. This algorithm, therefore, is not appropriate for many of the powerful parallel machines available today.

4 A New Approach to Parallel DRC

The serial design rule checking algorithm presented in Section 2 can be parallelized in two ways: First, the circuit area may be divided, and design rule checks performed on the subregions simultaneously; second, the series of elementary DRC tasks necessary to perform the checks for the various rules can be divided between processors. These two methods of parallelizing the design rule checking process are completely independent of one another. Therefore, the data and task parallelism can be considered orthogonal axes of parallelism, in which exploitation of one of the two types of parallelism, or both simultaneously, will result in performance gains. 12

4.1 Data Parallelism

In ProperDRC, data parallelism is achieved by dividing the circuit in two dimensions into subregions and distributing the subdivisions of the circuit geometry between processors, or clusters of processors, depending on whether or not task parallelism is being implemented simultaneously. For the purposes of discussion in this section, let us consider the issues of implementing data parallelism by itself. Task parallelism, and a combination of the two types of parallelism, will be discussed in later sections. The data partitioning scheme used in ProperDRC uses the number of rectangles assigned to a given processor as an estimate of workload. It should be noted that the actual amount of computation performed by a processor depends on the exact DRC checks performed on the rectangles within a region (see Section 6 for analysis of computations for various checks taking between O(N ) and O(N log N ) time for N rectangles). Partitioning the circuit by assigning equal area regions to each processor does not necessarily produce a balanced load, since the distribution of geometry features within the circuit area may not be uniform. Several researchers have worked in the area of load balancing and partitioning of points in two dimensions 23]. Salmon 24] has proposed the use of the Orthogonal Recursive Bisection (ORB) scheme for solving the N-body problem 25, 26]. Cybenko has reported on a scheme for recursive decomposition of workload in a multiprocessor 27]. Belkhale and Banerjee have proposed an alternate recursive partitioning algorithm for partitioning a set of points on a multiprocessor 28], and have reported implementations of this scheme in the context of a parallel circuit extractor 29, 30]. All of the above partitioning methods are fairly complex to implement e ciently. ProperDRC utilizes a data partitioning strategy based on a scheme proposed by Ramkumar and Banerjee 6] for parallel circuit extraction. The decomposition is performed by repeatedly subdividing the circuit area to produce subregions of equal area. The subdivision continues until all processors have equal areas of the circuit geometry, and may continue further to facilitate load balancing. The physical layout description is read from o ine storage in the Caltech Intermediate Form (CIF) representation 31]. Rectangles are distributed to the corresponding processors in batches, so that it is never necessary to keep the entire circuit description in the memory of a single processor. The multiprocessor architecture can therefore operate on a circuit area that is too complex to t into the memory of an individual processor. The capability to perform layout veri cation on a circuit area that is too large for a uniprocessor is one of the most important advantages of performing design rule checking 13

in parallel. The drawback associated with this method is that, because the entire circuit is never in the memory of a single processor at one time, no global quanti cation of the distribution of geometry features within the circuit area is possible. For this reason, circuit partitioning must be based purely on circuit area rather than dividing geometry features themselves equally among processors. To balance the load between processors, additional decomposition is performed to further subdivide the chip area. A grainsize is speci ed by the user to limit the amount of additional decomposition performed. All areas that contain an amount of geometry features greater than the speci ed grainsize are subdivided. Circuit geometry regions are then remapped to processors in such a way as to provide the best load balancing. Choosing the optimal grain-size is a hard problem since, in general, the variation of the runtimes of a parallel DRC tool for varying grain-sizes will have a bath-tub characteristic. If the grain size to too large, we will get unequal load balance. Hence the runtimes of a parallel DRC program will be large for layouts containing irregular distributions of rectangles. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions (see Section 6.2 for a detailed analysis). Again, the runtimes of the parallel DRC tool will be large for small grain-sizes. For an optimum grain-size, the runtimes of the parallel tool will be minimum. Since the distribution of rectangles of a circuit are not known a priori, it is impossible to optimally determine the optimal grain-size for all layouts. We will discuss experimental data on the choice the the grain size for some example layouts in Section 5. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. Figure 5(a) shows the initial circuit partitioning on four processors for a sample circuit area, in which the X's represent geometry features. The dashed lines show the initial division of the circuit into equal-area subregions, which are assigned one per processor. Figure 5(b) shows the same circuit after the load balancing algorithm is applied with a speci ed grainsize of 5. The region initially assigned to Processor 2 has been divided into two parts. Processor 3 will perform the DRC checks on the subregion on the right, to maintain better overall load balance. Note that if a grainsize of 10 had been speci ed by the user, no further subdivision of the circuit beyond the initial partitioning shown in Figure 5(a) would have been performed. The 14

proc 0 proc 1

proc 2 proc 3

proc 0 proc 1

proc 2 proc 3

proc 3

(a) Equal-area circuit partitioning

(b) Area redistribution with grainsize=5

Figure 5: Data parallel load balancing on four processors extent to which load balancing takes place is therefore completely under the user's control, which gives the user the exibility to customize the performance of the algorithm to take full advantage of the multiprocessor architecture by selecting an appropriate grainsize. The geometry layers are distributed in the original polygon representation used by the CIF input le. The conversion from polygon to edge-based representation takes place at the clusters that will perform the DRC tests on the area. Delaying the conversion until after the partitioning has two advantages: the messages are smaller, because one rectangle expands to two edges, and the conversion work is distributed to reduce the amount of time required. Some overlap is necessary between the areas assigned to the various processors to ensure that no pairs of neighboring edges are overlooked. Each processor will receive all rectangles that lie within its area extended on all sides by the maximum design rule interaction distance for the technology. Figure 6 shows the partitioned circuit area from Figure 5(b) with the addition of the overlap areas. Rectangles that are present in more than one processor area will be duplicated and trimmed to the respective processor areas. Trimming the rectangles can easily introduce geometry features that do not pass the design rules. The DRC routine must be careful not to report erroneous results introduced by the circuit partitioning. Therefore, upon completion of the design rule tests, infractions that fall within the maximum design rule interaction distance boundary surrounding the processor's area of the circuit are disregarded rather than reported.

proc 0

proc 1

proc 2

proc 3

Figure 6: Overlapping processor areas

4.2 Task Parallelism

Task parallelism is achieved by having a group of processors share a single region of the circuit area and divide among themselves the elementary tasks necessary to perform the various DRC tests. If pure task parallelism is desired, the group of processors will actually be the entire set of processors in the multiprocessor architecture, each of which will divide up the DRC tasks for the whole circuit area. When a combination of data and task parallelism is used, the group of processors will represent an individual processor cluster inside the multiprocessor. Let us refer to the group of processors sharing the DRC tasks for a given region of the circuit as a cluster, without loss of generality, for the purpose of the following discussion. The task graph generated by the serial DRC algorithm is usable for the parallel DRC as well. The ideal parallel implementation would dynamically schedule the tasks, when upon the generation of a layer, all of the subsequent tasks that utilize that layer would be spawned on currently idle processors. However, such an implementation is not feasible, due to the dependencies in the task graph. Figure 7(a) shows a sample task graph corresponding to a Square Test on the via layer, an Enclosure check on the via and metal2 layers, and a Width/Spacing test on the poly layer. Elementary DRC operations such as the Boolean mask operations and Width/Spacing 16

Processor 1
Via Met2 Poly

Processor 2

Via

40 Met2

Square Test

AND

W/S

50 AND

Poly

Grow

60 Grow

W/S

AND NOT

Square Test

(a) Sample task graph

(b) Mapping tasks onto a cluster

Figure 7: Task scheduling example tests described in the previous section may have two input layers. These two layers will be generated by other tasks, which precede the current task in the task graph. It is conceivable (maybe even desirable, from a performance standpoint) that the two parent tasks run on separate processors in the cluster. The layers generated by both these parent tasks must be sent to a single processor, so that the subsequent task can be completed. Therefore, it is necessary that the destination processor for the generated layers, and thus the child task itself, be determined a priori. Other solutions to the dependency problem, such as broadcasting the EdgeSets or having the child task explicitly request the layer from the parents, introduce too much communication overhead to be e ective. The mapping of tasks onto processors is obtained by levelizing the task graph. Priorities are assigned to tasks based on the number of levels of subsequent tasks that depend on the output layers. The levelized task graph is lled by arranging tasks in prioritized order. Figure 7(b) shows how the tasks can be tagged with a priority and mapped to a cluster that contains two processors. A task can begin as soon as its input layers arrive at the destination processor. There is no need for explicit synchronization between all of the processors of the cluster at the task graph level boundaries, so the penalty for load imbalance is not as severe as with 17

the traditional barrier-synchronized implementation of a levelized task graph. Furthermore, because the complexities of the various algorithms used to implement the elementary tasks can be used to estimate the length of time required to perform the operations for a given problem size, the potential exists for some intelligent scheduling methods to minimize the imbalance between processors. Because the number of tasks necessary to perform the design rule checks is xed for a given technology, and is independent of the problem size, there is an upper bound on the performance that can be achieved by parallelizing these tasks, no matter how e ective the load balancing strategies are. This would suggest task parallelism alone is not su cient to obtain the best performance on a multiprocessor architecture with more than a few processors; a combination of data and task parallelism must be used.

4.3 Combination of Task and Data Parallelism

In the case in which a combination of data and functional parallelism is used, clusters of processors are assigned regions of the circuit area, and processors within the cluster perform the DRC tasks in parallel. Two separate load balancing issues must be addressed: The elementary DRC tasks must be divided equally between the processors in each cluster, as discussed in the previous section, and the load must be balanced between the various clusters. To balance the workload between processor clusters, a separate strategy is introduced. The same initial partitioning method is used as in the purely data parallel version, but the partitioning is done at the cluster level, rather than the processor level. The cluster estimates its own relative need for processing power, based on the number of geometry features inside its circuit region as a fraction of the total number of geometry features in the circuit. A fraction of the total number of available processors is then assigned to the cluster, based on this ratio. The methods used to select which processors are apportioned to a given cluster can be customized to take advantage of physical locality in a given processor architecture. The end result is that a cluster with a lower workload \loans" one of the processors in its cluster to an overburdened cluster. In this way, more resources are applied to the more dense regions of the circuit area to improve the overall execution time. Figure 8 shows an example of how the load balancing scheme is applied, on an imaginary multiprocessor architecture with eight processors, arranged as four clusters of two processors. The partitioning of the circuit area between the clusters is shown in Figure 8(a). In the absence of the load balancing scheme, these circuit regions are assigned to each of the 18

four homogeneous clusters, as depicted in Figure 8(b), where the circles represent individual processors, and the lines connecting them show the structure of the architecture. Figure 8(c) shows the processor-to-cluster mapping after application of the load balancing scheme. Cluster 3 has essentially borrowed an extra processor from Cluster 2 to compensate for the larger number of geometry features in its region of the circuit area.

5 Results
ProperDRC was used to test for violations of the MOSIS Scalable CMOS design rules 32]. A total of 32 design rules were speci ed, which resulted in the generation of 64 intermediate layers to perform all of the necessary tests. The following platforms were used to generate performance measurements: a Sun Sparcserver 1000 shared-memory multiprocessor, a network of six Sun Sparcstations, and the CM-5 message-passing distributed-memory multiprocessor. The benchmarks used to test ProperDRC include plapart, a programmable logic array with 25,000 rectangles; kovariks, a multiplier array with 64,000 rectangles; and haab1 and haab2, static RAMs containing 128,000 and 253,000 rectangles, respectively. An arti cial benchmark, superhaab, was also created, which consists of the haab2 benchmark replicated four times, in array of two cells by two cells, with 10 spacing between cells. Superhaab contains 1,014,000 rectangles. Tables 1 through 3 show the performance data for purely data parallel decomposition of the DRC. All execution times are measured in seconds. Dashes in the tables indicate that the processor con guration had insu cient memory to perform the DRC on the given circuit. The fact that the CM-5 was unable to operate on the haab1, haab2, and superhaab circuits with less than 8, 16, and 64 processors, respectively, illustrates the memory scalability of the ProperDRC algorithm. The results of the network of SUN workstations for very large circuits could not be reported since our ProperCAD library implementation on the network is unreliable for very large message sizes. (In other related work, we are working on a reliable port of the ProperCAD enviroment on a network of workstations.) It is also interesting to note that every platform appears to exhibit superlinear speedups as the number of processors increases from one to two. This is especially apparent the larger benchmarks running on the Sun Sparcserver 1000, which run six to seven times faster on two processors than on a uniprocessor. This e ect is most likely due to cache e ects, where the smaller working space requirement of the two-processor implementation results in 19

clust 0 clust 1

clust 2 clust 3

(a) Partitioned circuit

clust 0

clust 1

clust 2

clust 3

(b) Original cluster mapping

clust 0

clust 1

clust 2

clust 3

signi cantly fewer expensive memory operations. Table 1: Data parallel performance on a network of Sun Sparcstation 5 machines Circuit Processors 1 2 4 plapart 175.1 59.5 28.9 kovariks 324.7 131.4 67.6 haab1 { { { haab2 { { { superhaab { { {

Table 2: Data parallel performance on Sun Sparcserver 1000 shared-memory multiprocessor Circuit Processors 1 2 4 8 plapart 130.3 43.6 21.4 9.4 kovariks 244.7 94.3 38.1 24.2 haab1 843.1 114.5 64.1 40.7 haab2 1221.3 275.8 176.2 100.9 superhaab { { { { The performance results for the purely task parallel implementation of ProperDRC are given in Tables 4 through 6. A small number of processors provide good performance results, but the e ectiveness of adding additional processors diminishes quickly, for any problem size. This is because the amount of task parallelism available is dependent only on the size of the set of technology rules being used, and not the size of the input le, as discussed in Section 4.2. It should be noted that task parallel layout veri cation cannot handle large problem sizes since each processor has to replicate the entire mask layout, which becomes too much for each processor. Tables 7 and 8 show the performance results using a combination of data and task parallelism. It is important to notice that there are cases in which a combination of data and task parallelism provides better performance over either type of parallelism individually. Compared to Table 3, the results on the 128 processor runs show that the combined task 21

Table 3: Data parallel performance on Thinking Machines distributed-memory multiprocessor Circuit Processors 1 2 4 8 16 32 plapart 410.7 77.6 39.3 19.8 9.7 5.6 kovariks { 288.2 89.9 45.0 24.6 12.0 haab1 { { { 261.5 100.3 44.7 haab2 { { { { 159.8 128.9 superhaab { { { { { {

CM-5 message-passing 64 5.2 6.9 34.8 59.2 621.2 128 3.3 6.2 29.5 69.3 440.5

Table 4: Task parallel performance on a network of Sun Sparcstation 5 machines Circuit Processors 1 2 3 4 5 6 plapart 175.1 104.4 100.0 90.4 94.5 76.2 kovariks 324.7 267.8 227.6 217.2 174.0 200.2 haab1 { { { { { { haab2 { { { { { { superhaab { { { { { {

Table 5: Task parallel performance on Sun Sparcserver 1000 shared-memory multiprocessor Circuit Processors 1 2 3 4 5 6 7 8 plapart 130.3 76.1 65.7 56.8 53.5 53.5 44.7 42.1 kovariks 244.7 137.2 100.2 100.1 85.0 77.1 69.6 64.7 haab1 843.1 479.4 339.1 338.0 354.3 324.4 310.2 294.5 haab2 1221.3 683.5 575.6 502.8 476.9 457.6 419.1 422.8 superhaab { { { { { { { {

Table 6: Task parallel performance on Thinking Machines CM-5 message-passing distributedmemory multiprocessor Circuit Processors 1 2 3 4 5 6 7 8 plapart 854.5 442.6 350.6 311.7 306.2 280.5 233.1 308.7 kovariks { { { 644.7 536.9 470.8 466.6 392.8 haab1 { { { { { { { { haab2 { { { { { { { { superhaab { { { { { { { { and data parallel gives better runtime performance than the purely data parallel approach. A detailed analysis of these results is presented in the following section. Table 7: Combined data and task parallel performance on Sun Sparcserver 1000 sharedmemory multiprocessor Circuit Processors/Clusters 4/2 6/2 8/2 8/4 plapart 28.9 25.5 21.9 14.3 kovariks 54.7 44.1 41.9 23.7 haab1 178.9 124.8 137.0 72.1 haab2 513.5 379.9 391.9 264.2 superhaab { { { { The user-speci ed grainsize controls the extent to which load balancing takes place in the purely data parallel decomposition of the DRC problem. Any region of the circuit having a number of geometry features greater than the grainsize is subdivided into equal area regions, which may later be reassigned to di erent processors as necessary to facilitate load balancing. As discussed earlier in Section 4, choosing the optimal grain-size is a hard problem. If the grain size to too large, we will get unequal load balance. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. 23

Table 8: Combined data and task parallel performance on Thinking Machines CM-5 messagepassing distributed-memory multiprocessor Circuit Processors/Clusters 4/2 8/4 16/8 32/8 64/16 64/32 128/64 plapart 157.3 59.6 26.0 17.2 9.8 8.7 10.7 kovariks 320.7 124.8 70.7 49.8 21.9 15.6 13.4 haab1 { 373.3 77.8 50.4 42.5 20.7 21.9 haab2 { { 325.4 203.6 67.8 65.5 48.0 superhaab { { { { { 326.7 301.2 The e ect of varying the grainsize for the purely data parallel decomposition is shown in Table 9 for the CM-5. The purely area-based circuit partitioning may be considered a degenerate case of the data partitioning strategy presented in this paper, in which the grainsize is an in nite value since the grain-size based partitioning is not invoked. For the haab1 circuit consisting of 128,000 rectangles, we show results of varying grain sizes for 5,000 and 1,000 rectangles. For example, for the 16 processor run, we show that the results are optimal for 5000 rectangles (our heuristic picks 4,000 rectangles). Similarly for the haab2 circuit consisting of 256,000 rectangles on 16 processors, the optimal grain size is 10,000 rectangles (our heuristic picks 8,000 rectangles). We have obtained similar results on the SUN Sparcserver 1000 and network of workstations. The concept of using task priorities to determine the order of execution for a set of DRC tasks was introduced in Section 4.2. Tasks are assigned higher priorities based on the number of levels of subsequent tasks that rely on the output of the task. Using an arbitrary ordering for tasks could result in a task schedule that produces more tra c and requires more waiting time than the prioritized schedule. Table 10 shows the e ect of using the priorities to schedule tasks, as opposed to using random ordering. The performance gures are reported for the network of Sun workstations, but we obtained similar results on the Sun Sparcserver and the CM-5. The choice of a task ordering heuristic has no e ect on the uniprocessor performance, as expected, because network tra c and processor idle time are not relevant concerns for uniprocessor execution. With two or more processors, the performance data illustrate that the prioritized task queue provides better performance. A cluster remapping strategy was presented in Section 4.3 as a means of balancing the 24

Table 9: E ect of grainsize on data parallel performance on Thinking Machines CM-5 message-passing distributed-memory multiprocessor Circuit Grainsize Processors 8 16 32 64 128 haab1 1 395.3 239.9 83.1 36.3 29.5 5000 { 100.3 79.2 34.8 30.4 2000 261.5 104.2 44.7 45.7 30.3 haab2 1 { 426.1 297.4 100.5 71.4 10000 { 139.7 146.3 97.7 71.3 5000 { 159.8 128.9 59.2 69.3 superhaab 1 { { { 638.7 445.8 50000 { { { 642.5 440.7 20000 { { { 621.2 440.5

Table 10: E ect of task parallel scheduling on a network of Sun Sparcstations Circuit Task Processors Ordering 1 2 3 4 5 plapart random 175.1 142.4 116.4 109.7 92.2 prioritized 176.0 104.4 100.0 90.4 94.5 kovariks random 324.7 321.5 264.8 227.9 212.6 prioritized 326.3 267.8 227.6 217.2 174.0

load between clusters when a combination of data and task parallelism is used. Ideally, the number of processors assigned to a given cluster is proportional to the number of geometry features inside that cluster's region of the circuit area. However, because the total number of processors is xed, and sometimes small, the fraction of the available processors assigned to a cluster cannot always equal the exact fraction of the total number of geometry features that lie within the circuit area owned by the cluster. Having a larger number of processors available allows the fraction of the available processors assigned to the cluster to more closely approximate the fraction of geometry features in the cluster area and, therefore, facilitates more e ective load balancing. Table 11 shows the e ectiveness of the cluster remapping strategy. The load balancing strategy was most e ective with a large number of processors on the CM-5, where the processor-to-cluster mapping has the most exibility. Table 11: E ect of variable cluster size on Thinking Machines CM-5 message-passing distributed-memory multiprocessor for data and task parallel decomposition Circuit Cluster Processors/Clusters Size 16/8 24/8 32/8 32/16 48/16 64/16 64/32 128/64 haab1 xed 78.2 58.4 56.1 72.6 50.7 52.4 25.8 22.1 variable 77.8 58.4 50.4 56.2 46.8 42.5 20.7 21.9 haab2 xed 388.9 260.8 237.5 123.7 92.5 85.3 87.5 62.9 variable 322.8 241.5 203.6 102.5 76.5 67.8 65.5 48.0 superhaab xed { { { { { { 335.2 362.4 variable { { { { { { 326.7 301.2

6 Analysis of Approaches
To analyze the performance results of ProperDRC, we will rst examine the performance of the serial algorithms used to implement the various DRC operations. We will then proceed to examine the performance issues introduced by the parallelization of the DRC process.

6.1 Analysis of Serial DRC

ProperDRC uses the scanline operation developed by Lauther to perform Boolean mask operations 15], the edge sorting algorithm presented by Szymanski and Van Wyk 16], and 26

the width/spacing clearance checking algorithm presented in this paper. Table 12 provides a summary of the complexities of the various algorithms used to perform the DRC operations. Considering that the overall performance of the DRC is bounded by the performance of the most complex algorithms used by the DRC, the overall complexity for the DRC is O(NlogN ). Table 12: Complexities of the DRC operations Operation Execution Time Boolean Operation O(N log N ) Sort O(N log N ) or O(log N ) Grow O(N log N ) Width O(N ) Spacing O(N ) Square Test O(N ) Overall DRC O(N log N )

6.2 Analysis of Parallel DRC

The performance results demonstrate that both data parallelism and task parallelism can be applied to the DRC problem to achieve better performance and reduced memory requirements as compared to serial algorithms. Because neither of the two types of parallelism adversely impacts the e ectiveness of the other, a combination of the two types of parallelism can be applied to achieve further parallelism. In practice, the ultimate goal is to achieve the best performance given an existing architecture. Table 13 shows some of the performance results measured on the CM-5 from the previous chapter, rearranged to show a comparison between using pure data parallelism and using a combination of data and task parallelism on a given number of processors. In the case of the combination of data and task parallelism, the number of processors per cluster given in the table is an average value; the processor-to-cluster mappings may be modi ed to balance the load between clusters, as discussed in Section 4.3. The data in Table 13 show that neither of the two parallelization approaches is superior to the other in all cases. There are two counteracting factors that a ect the relative performance of the two algorithms: The task parallel performance is limited by the complexity of the DRC algorithms, and the data parallel performance is limited by the extra work created by the overlapping processor areas. 27

Table 13: Comparison of parallelization methods on CM-5 Circuit Procs per Processors Cluster 16 32 64 128 haab1 1 100.3 44.7 34.8 29.5 2 77.8 56.2 20.7 21.9 haab2 1 159.8 128.9 59.2 69.3 2 325.4 102.5 65.5 48.0 To illustrate the e ect of the complexity of the DRC algorithms on the task parallel performance, consider a simpli ed case in which two processors are to be applied to perform a design rule check on a circuit with 2X geometry features. Assume perfect load balancing, whether data or task parallelism is used. No matter which type of parallelism is used, the combination of the two processors must have the memory capacity to hold the entire circuit. Using the complexity of the slowest algorithms in the DRC procedure, the time necessary to perform the DRC on a circuit of problem size N is O(N log N ), and the minimum amount of working space required for the p DRC algorithms is O( N ). We will use the term working space to distinguish between the amount of storage required by a single processor to perform the various DRC operations, and the amount of storage required by the entire set of processors performing the DRC operations to hold the whole circuit geometry, which is xed at O(N ) for the entire set of processors. If data parallelism were used to divide the circuit's geometry features equally between the processors, neglecting the overlapping processor areas for the moment, each of the processors would perform a DRC on a subregion of the circuit with X geometry features. The total run time would be O(X log X ), because this amount of time is necessary for each processor to perform local design rule checking simultaneously. The demand for work space at each of p the processors is O( X ). In the task parallel implementation, the various DRC tasks would be divided equally between the two processors. Both processors would be working on a problem size of 2X . The time required for the DRC would be O(1=2 (2Xplog (2X )) = O(X log (2X )). The working ) space requirement for each processor would be O( 2X ). Therefore, the task parallel version of the DRC requires slightly more time to run and more working space. These penalties are O(constant), but nonetheless indicate that the purely data parallel implementation provides 28

the better performance when overlapping processor areas are disregarded. Now, let us take the e ect of overlapping processors into consideration. Using twodimensional partitioning, a square circuitp pwith area A divided between P processors area results in a square circuit area measuring A= P units on a side being assigned to each of the processors. Taking the overlap area of c units on every side of the processor area into p p p p consideration,p total area assigned to the processor is ( A= P + 2c) ( A= P p 2c), or the + p p 2 A=P +4c A= P +4c units. This area formula can be generalized to A=P + kc A= P +4c2 for decompositions of circuit areas that result in processor areas that are not perfectly square, where k is a constant. Note that the actual area assigned to processors whose regions are on the outside boundaries of the circuit is actually slightly lower. These slight area discrepancies can be safely ignored, because a smaller fraction of the processors are on the boundary as the total number of processors increases, and the performance of the algorithm on the circuit as a whole will be bounded by the processors with the highest areas. Disregarding these area discrepancies, p the total area operated on by the set of P processors is A + kc AP + 4c2P . Returning to our ctitious example of dividing a circuit with 2X geometry features between a pair of processors, the actual amount of area assigned to each of the two processors p is X + kc 2X + 8c2. The p actual amount of time consumed by the slowest of the DRC p algorithms is now q((X + kc 2X + 8c2) log (X + kc 2X + 8c2)), and the working space O p requirement is O( X + kc 2X + 8c2 ). Note that these penalties are dependent on the number of processors. In addition to the penalty for the complexity of the DRC algorithms associated with the task parallelism, there is also the overhead of intraprocessor communication, whereas the purely data parallel decomposition of the problem requires no communication while the DRC checks are being performed, although some communication is necessary during the data partitioning phase to perform the load balancing between processors. The experimental results also show that load balancing is more di cult to attain for the task parallel implementation as compared to the data parallel implementation. Consider that the task parallel DRC on the plapart benchmark on the network of Sun workstations took longer with ve processors than with either four or six. The actual distribution of the geometries between the di erent mask layers is much more critical for the task parallel implementation than for the data parallel version. The particular distribution in the plapart benchmark apparently presented a load balancing problem for the particular order in which the layer operations were divided between ve processors. 29

7 Conclusion
In this paper we have have applied the concept of integrating task and data parallelism in an irregular application, namely VLSI layout veri cation in a tool called ProperDRC. ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform design-rule checking (DRC) operations concurrently on a multiprocessor architecture. Another contribution of the parallel application is that it is portable across a large number of parallel platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations. A number of areas in parallel design rule checking should be explored in the future. Ideally, a DRC tool should be able to exploit the hierarchy of large designs. Performing DRC on a attened layout representation may result in much redundant work if individual cells in the design are instantiated a large number of times, as is often the case with library cell-based designs. ProperDRC should be expanded to handle non-Manhattan layout geometries. Many of the algorithms used in ProperDRC would require some additional work to be capable of operating on non-Manhattan designs. When such increased capabilities are included into ProperDRC, we can perform an e ective comparison of the runtimes of ProperDRC versus commercial layout veri cation tools such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics. Conceptually, the approaches of combined task and data parallelism should be applicable to any commerical layout veri cation tool. But the exact nature of the performance gains will be dependent on the actual implementation. We are in the process of interacting with developers at Cadence to transfer the parallel algorithms in ProperDRC into practice 13].

References
1] B. Ramkumar and P. Banerjee, \ProperCAD: A portable object-oriented parallel environment for VLSI CAD," IEEE Transactions on Computer Aided Design, vol. 13, pp. 829{842, July 1994. 2] S. Parkes, J. A. Chandy, and P. Banerjee, \ProperCAD II: A run-time library for portable, parallel, object-oriented programming with applications to VLSI CAD," Tech. 30

Rep. CRHC{93{22/UILU{ENG{93{2250, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, Illinois, Dec. 1993. 3] S. Parkes, J. A. Chandy, and P. Banerjee, \A library-based approach to portable, parallel, object-oriented programming: Interface, implementation, and application," in Supercomputing '94, (Washington, DC), pp. 69{78, Nov. 1994. 4] S. Kim, J. A. Chandy, S. Parkes, B. Ramkumar, and P. Banerjee, \ProperPLACE: A portable parallel algorithm for cell placement," in Proceedings of the International Parallel Processing Symposium, (Cancun, Mexico), pp. 932{941, Apr. 1994. 5] J. A. Chandy and P. Banerjee, \Parallel simulated annealing strategies for VLSI cell placement," in Proceedings of the International Conference on VLSI Design, (Bangalore, India), Jan. 1996. To appear. 6] B. Ramkumar and P. Banerjee, \ProperEXT: A portable parallel algorithm for VLSI circuit extraction," in Proceedings of the International Parallel Processing Symposium, (Newport Beach, CA), pp. 434{438, Apr. 1993. 7] K. De, B. Ramkumar, and P. Banerjee, \ProperSYN: A portable parallel algorithm for logic synthesis," in Digest of Papers, International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 412{416, Nov. 1992. 8] K. De, J. A. Chandy, S. Roy, S. Parkes, and P. Banerjee, \Portable parallel algorithms for logic synthesis using the mis approach," in Proceedings of the International Parallel Processing Symposium, (Santa Barbara, CA), pp. 579{585, Apr. 1995. 9] B. Ramkumar and P. Banerjee, \Portable parallel test generation for sequential circuits," in Digest of Papers, International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 220{223, Nov. 1992. 10] S. Parkes, P. Banerjee, and J. H. Patel, \ProperHITEC: A portable, parallel, object{ oriented approach to sequential test generation," in Proceedings of the Design Automation Conference, (San Diego, CA), pp. 717{721, June 1994. 11] S. Parkes, P. Banerjee, and J. Patel, \A parallel algorithm for fault simulation based on proofs," in Proceedings of the International Conference on Computer Design, (Austin, TX), Oct. 1995. To appear. 31

12] S. Kim, \Lsi logic corporation," Personal communication, 1995. 13] E. Petrus, \Cadence design systems," Personal communication, 1996. 14] K. MacPherson, \Parallel algorithms for layout veri cation," Master's thesis, University of Illinois at Urbana-Champaign, Aug. 1995. 15] U. Lauther, \An O(N log N) algorithm for Boolean mask operations," in Proc. 18th Design Automation Conf., pp. 555{562, June 1981. 16] T. Szymanski and C. J. Van Wyk, \Goalie: A space e cient system for VLSI artwork analysis," IEEE Design Test Computers, vol. 2, pp. 64{72, June 1985. 17] P. Banerjee, Parallel Algorithms for VLSI Computer-aided Design Applications. Englewoods-Cli s, NJ: Prentice Hall, 1994. 18] G. E. Bier and A. R. Pleszkun, \An algorithm for design rule checking on a multiprocessor," in Proc. Design Automation Conf., pp. 299{303, June 1985. 19] F. Gregoretti and Z. Segall, \Analysis and evaluation of VLSI design rule checking implementation in a multiprocessor," in Proc. Int. Conf. Parallel Processing, pp. 7{14, Aug. 1984. 20] J. Marantz, \Exploiting parallelism in VLSI CAD," in Proc. Int. Conf. Computer Design, Oct. 1986. 21] E. Carlson and R. Rutenbar, \Design and performance evaluation of new massively parallel VLSI mask veri cation algorithms in JIGSAW," in Proc. 27th Design Automation Conf., pp. 253{259, June 1990. 22] E. Carlson and R. Rutenbar, \Mask veri cation on the Connection Machine," in Proc. Design Automation Conf., pp. 134{140, June 1988. 23] S. H. Bokhari, \Partitioning problems in Parallel, Pipelined and Distributed computing," IEEE Trans. Comput., vol. C-37, pp. 48{57, Jan. 1988. 24] J. K. Salmon, Parallel Hierarchical N-body Methods. PhD thesis, California Institute of Technology, Dec. 1990. 25] J. E. Barnes and P. Hut, \A hierarchical O(N log N) force calculation algorithm," Nature, vol. 324, pp. 446{449, 1986. 32

26] J. P. Singh et al., \Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity," Parallel & Distrib. Comput., vol. 27, pp. 118{141, June 1995. 27] G. Cybenko, \Dynamic load balancing for distributed memory multiprocessors," Parallel & Distrib. Comput., vol. 7, pp. 279{301, July 1989. 28] K. P. Belkhale and P. Banerjee, \Recursive partitions on multiprocessors," in Proc. 5th Distributed Memory Computing Conf., Apr. 1990. 29] K. P. Belkhale and P. Banerjee, \Parallel algorithms for VLSI circuit extraction," IEEE Transactions on Computer Aided Design, vol. 10, pp. 604{618, May 1991. 30] K. P. Belkhale, Parallel Algorithms for CAD with Applications to Circuit Extraction. PhD thesis, University of Illinois at Urbana-Champaign, Nov. 1990. Tech. Rep. CRHC{ 90{15/UILU{ENG{90{2251. 31] C. Mead and L. Conway, Introduction to VLSI Systems. Philippines: Addison-Wesley, 1980. 32] J.-I. Pi, MOSIS Scalable CMOS Design Rules. Information Sciences Institute, University of Southern California, Marina del Rey, CA.

Ace of PACE Sample Paper
55% (20)
Ace of PACE Sample Paper
5 pages
TT Plus Catalogue RCF - ENG
No ratings yet
TT Plus Catalogue RCF - ENG
52 pages
Varela 1979
No ratings yet
Varela 1979
14 pages
C 4
No ratings yet
C 4
61 pages
Iare DS Lecture Notes 2
No ratings yet
Iare DS Lecture Notes 2
135 pages
Aristotle On Matter
No ratings yet
Aristotle On Matter
24 pages
Chapter 2
No ratings yet
Chapter 2
23 pages
2010 Ford Scape 3.0l Fluid Capacities
No ratings yet
2010 Ford Scape 3.0l Fluid Capacities
2 pages
First Order Open-Loop Systems
0% (1)
First Order Open-Loop Systems
18 pages
CH-10 Boiler Performance
No ratings yet
CH-10 Boiler Performance
19 pages
Wanxiang Refrigeration: Lecturer: Jane Xie
No ratings yet
Wanxiang Refrigeration: Lecturer: Jane Xie
34 pages
Strength Tests On Concrete: (1) Compressive Strength Test (ASTM C 39)
No ratings yet
Strength Tests On Concrete: (1) Compressive Strength Test (ASTM C 39)
12 pages
JNV. Chemistry Viva
No ratings yet
JNV. Chemistry Viva
30 pages
The Development of The Atomic Structure.
No ratings yet
The Development of The Atomic Structure.
10 pages
Production of Ceramic Foam Filters For Molten Meta
No ratings yet
Production of Ceramic Foam Filters For Molten Meta
5 pages
Sample
No ratings yet
Sample
14 pages
Outline For Photosynthesis
No ratings yet
Outline For Photosynthesis
6 pages
Structural Analysis
No ratings yet
Structural Analysis
3 pages
MSD Digital 6A and 6AL Ignition Control
No ratings yet
MSD Digital 6A and 6AL Ignition Control
20 pages
Electroválvula Honeywell TN UR
No ratings yet
Electroválvula Honeywell TN UR
20 pages
EnCom LG ABS 40 - EnCom
No ratings yet
EnCom LG ABS 40 - EnCom
2 pages
CME113 Formula Excel
No ratings yet
CME113 Formula Excel
16 pages
Preceptron
No ratings yet
Preceptron
17 pages
Ocean and Sea Waves
No ratings yet
Ocean and Sea Waves
30 pages
Solution V1 Ch6FyANVC06 Test CH 6 Work, Energy and The Power
No ratings yet
Solution V1 Ch6FyANVC06 Test CH 6 Work, Energy and The Power
11 pages
Asynch Exercise 2 WACC APV
No ratings yet
Asynch Exercise 2 WACC APV
2 pages
Maf603 - Test - Nov 2023
No ratings yet
Maf603 - Test - Nov 2023
4 pages
Chemical Shift
No ratings yet
Chemical Shift
10 pages
CSP2101 Scripting Languages Assignment 3 - Software Based Solution
No ratings yet
CSP2101 Scripting Languages Assignment 3 - Software Based Solution
8 pages
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
No ratings yet
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
1 page
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
From Everand
Mastering Concurrency and Parallel Programming Unlock the Secrets of Expert-Level Skills.pdf
Larry Jones
No ratings yet
Concurrency and Multithreading in C: POSIX Threads and Synchronization
From Everand
Concurrency and Multithreading in C: POSIX Threads and Synchronization
Larry Jones
No ratings yet
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Concurrency and Multithreading in C++: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
From Everand
Programming AI Workloads with Habana Gaudi SDK: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
From Everand
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
Anand Vemula
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
From Everand
Graphcore Poplar Programming and Optimization: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
From Everand
DeepSparse for Efficient CPU Inference: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
From Everand
Cerebras GPT: Wafer-Scale Architectures for Large Language Models
William Smith
No ratings yet
Embedded Systems Programming with C: Writing Code for Microcontrollers
From Everand
Embedded Systems Programming with C: Writing Code for Microcontrollers
Larry Jones
No ratings yet
Compiler Frontiers Unveiled
From Everand
Compiler Frontiers Unveiled
Azhar ul Haque Sario
No ratings yet
Mastering the Art of x86 Assembly Programming: Unlocking the Secrets of Expert-Level Skills
From Everand
Mastering the Art of x86 Assembly Programming: Unlocking the Secrets of Expert-Level Skills
Steve Jones
No ratings yet
CrateDB for IoT and Machine Data: The Complete Guide for Developers and Engineers
From Everand
CrateDB for IoT and Machine Data: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
From Everand
Fundamentals of Modern Computer Architecture: From Logic Gates to Parallel Processing
Sam Steed
No ratings yet
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
TinyGo for Embedded Systems and WebAssembly: The Complete Guide for Developers and Engineers
From Everand
TinyGo for Embedded Systems and WebAssembly: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Netlify Edge Functions in Depth: The Complete Guide for Developers and Engineers
From Everand
Netlify Edge Functions in Depth: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Netlify Edge Functions in Practice: The Complete Guide for Developers and Engineers
From Everand
Netlify Edge Functions in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
gVisor Architecture and Integration: The Complete Guide for Developers and Engineers
From Everand
gVisor Architecture and Integration: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Cortex for Scalable Multi-Tenant Metrics: The Complete Guide for Developers and Engineers
From Everand
Cortex for Scalable Multi-Tenant Metrics: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
C++ Mastery: Advanced Techniques and Strategies
From Everand
C++ Mastery: Advanced Techniques and Strategies
Adam Jones
No ratings yet
Mastering C: Advanced Techniques and Best Practices
From Everand
Mastering C: Advanced Techniques and Best Practices
Adam Jones
No ratings yet
Programming Cloudflare Workers KV: The Complete Guide for Developers and Engineers
From Everand
Programming Cloudflare Workers KV: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
From Everand
Hyperdrive Architecture and Implementation: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
From Everand
Kestra Pipeline Orchestration Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Boost.Asio Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Boost.Asio Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Boost.Thread in Practice: Definitive Reference for Developers and Engineers
From Everand
Boost.Thread in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
From Everand
Architecting Distributed Applications with Macrometa: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Parallel Software Development with Threading Building Blocks: Definitive Reference for Developers and Engineers
From Everand
Parallel Software Development with Threading Building Blocks: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Amazon ECR Deployment Solutions: Definitive Reference for Developers and Engineers
From Everand
Amazon ECR Deployment Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Designing Resilient Distributed Systems with CAP: Definitive Reference for Developers and Engineers
From Everand
Designing Resilient Distributed Systems with CAP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Container Solutions with Fargate: Definitive Reference for Developers and Engineers
From Everand
Building Container Solutions with Fargate: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
From Everand
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
From Everand
Deploying Scalable Systems with Nomad: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
From Everand
Cortex-M Architecture and Programming Reference: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AWS Greengrass for Edge Computing Solutions: Definitive Reference for Developers and Engineers
From Everand
AWS Greengrass for Edge Computing Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
From Everand
Engineering Anthos Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cortex-A Architecture and System Design: Definitive Reference for Developers and Engineers
From Everand
Cortex-A Architecture and System Design: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
From Everand
Mastering Embedded C: The Ultimate Guide to Building Efficient Systems
Robert Johnson
No ratings yet
Embedded Systems Programming with C++: Real-World Techniques
From Everand
Embedded Systems Programming with C++: Real-World Techniques
Robert Johnson
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parallel Algorithms For VLSI Layout Verication

Uploaded by

Parallel Algorithms For VLSI Layout Verication

Uploaded by

Parallel Algorithms for VLSI Layout Veri cation

Ky MacPherson and Prithviraj Banerjee

2 Serial Design Rule Checking

2.1 Task Graph Generation

2.2 Elementary DRC Tasks

2.3 Grow Operation

Corresponding Task Graph

Single Layer Width/Spacing

Figure 3: Design rules that correspond to one elementary task 7

Corresponding Task Graph

Figure 4: Tasks used to implement the Enclosure and Extension rules 8

2.4 Width/Spacing Test

3 Prior Work in Parallel DRC

3.1 Area Decomposition

3.2 Hierarchical Decomposition

3.3 Task Partitioning

3.4 Edge Partitioning

4 A New Approach to Parallel DRC

4.1 Data Parallelism

(a) Equal-area circuit partitioning

(b) Area redistribution with grainsize=5

Figure 6: Overlapping processor areas

4.2 Task Parallelism

(a) Sample task graph

(b) Mapping tasks onto a cluster

4.3 Combination of Task and Data Parallelism

(a) Partitioned circuit

(b) Original cluster mapping

6.1 Analysis of Serial DRC

6.2 Analysis of Parallel DRC

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.