Parallel Algorithms For VLSI Layout Verication
Parallel Algorithms For VLSI Layout Verication
Computer and Systems Research Laboratory University of Illinois at Urbana-Champaign 1308 West Main St. Urbana, IL 61801 Tel : (217)333-6564 Fax : (217)333-1910 E-mail : banerjee@crhc.uiuc.edu
Layout veri cation determines whether the polygons that represent di erent mask layers in the chip conform to the technology speci cations. Commercial layout veri cation programs can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the DRC problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors. In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and hybrid architectures comprised of uni- and multiprocessor workstations connected by a network. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. This paper presents speci cs of the implementation of ProperDRC, provides an analysis of the methods used to obtain parallelism, addresses load balancing issues, and reports on experimental results on various benchmark circuits.
This research was supported in part by the Advanced Research Projects Agency under contract DAAH04-94-G-0273 administered by the Army Research O ce.
Abstract
1 Introduction
Layout veri cation determines whether the polygons that represent di erent mask layers in a VLSI chip conform to the technology speci cations. One aspect of layout veri cation is design rule checking (DRC) which detects violations of rules such as width, space and overlap rules that govern the technology in which the chip is to be fabricated. The computational complexity of layout veri cation programs is not due to the intrinsic complexity of each operation but to large number of parts in the layout which can consist of tens of millions of rectangles for large designs. The most sophisticated commercial layout veri cation programs such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics can take tens of hours to run in the attened representations for large designs. It is therefore desirable to run the layout veri cation problem in parallel to reduce the runtimes. Also, the memory requirements of large chips are such that the entire chip description may not t in the memory of a single workstation, hence parallel processing allows one to distribute the memory requirements of the problem across multiple processors. In this paper, we will present a parallel implementation of a design-rule checking program called ProperDRC which is implemented on top of the ProperCAD environment. ProperDRC has two novel contributions over previous work. First, it is portable across a large number of multiprocessor platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations. Second, ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform DRC operations concurrently on a multiprocessor architecture. ProperDRC currently works on Manhattan geometries only (where the edges of rectangles are parallel to the X and Y axes), but conceptually the parallel approaches can be extended to handle non-Manhattan geometries as well since the algorithms for layout operations are all based on scanline algorithms. The objectives of the ProperCAD project are to develop e cient parallel algorithms for VLSI CAD tasks that can utilize the computing power of a wide range of parallel platforms in order to reduce the design turnaround time of complex chips 1, 2, 3]. We have developed a PoRtable Object-oriented Parallel EnviRonment for CAD algorithms (ProperCAD II), which is a C++ object library targeted at medium-grain parallelism, and MIMD parallel architectures (shared memory, and message passing). Parallel CAD algorithms developed on this library run unchanged, e ciently on both shared memory and message-passing architectures. The di erences from all the previous work on portable parallel programming and our 1
ProperCAD e ort is that we have avoided de ning a new language for writing parallel programs. We have instead used an existing established object-oriented language (C++) as the base language for exploiting the objected oriented nature of programming, and augmented it with an e cient C++ class library to help write portable parallel programs. The ProperCAD II framework runs on shared memory multiprocessors such as the SUN 4/600MP, SUN Sparcserver 1000, the Encore Multimax, and the Silicon Graphics Challenge, and distributed memory message-passing multicomputers such as the Intel iPSC/860 hypercube the Intel Paragon, the Thinking Machines CM-5, the IBM SP-2, on also on a network of SUN workstations. We are investigating parallel algorithms for various VLSI CAD applications on top of the ProperCAD II framework. The applications include cell placement 4, 5], global and detailed routing, circuit extraction 6], logic synthesis 7, 8], test generation 9, 10], fault simulation 11], circuit, logic and behavioral simulation, and high level synthesis. In this paper, we describe parallel algorithms for layout veri cation of attened VLSI layouts using the ProperCAD framework. While some layout veri cation tools exploit the hierarchical information available in VLSI chip designs during the chip design stage while designers are interactively designing a chip, many companies perform a complete attened chip design rule checking just prior to tape-out to avoid the economic penalties of possibly sending out an incorrect layout for costly fabrication 12, 13]. The runtimes of these attened layout veri cation tools can run into tens to hundreds of hours for large commercial designs having tens of millions of rectangles. This is true for commercial tools such as CHECKMATE and PARADE from Mentor Graphics, and DRACULA and VAMPIRE from Cadence Design Systems. It is therefore important to investigate parallel algorithms for layout veri cation. Another problem of attened design rule checking is its tremendous memory requirements. One can assume that each transistor in a VLSI design translates to about 10-20 rectangles on a mask layout 31]. In order to represent a mask layout, one needs to store the X and Y location, and additional information about the layout mask layer, orientation, etc, which would need about 20-40 bytes per rectangle 31]. For a 10 million transistor circuit representative of current microprocessors, the memory requirements are 8 Gbytes using this simple analysis. The above analysis is for simply representing the layout. During the layout veri cation tasks, additional data structures and temporary data storage is used. Clearly, these memory requirements are too large to t on the memory of a conventional workstation. Using data partitioning, one can partition the memory requirements of the layout among various processors in a parallel machine and enable the execution of these large problems. We will show in the results section of this paper examples of large layouts that cannot run 2
Width Enclosure
Spacing
Square Test
Extension
Figure 1: Sample design rules on one processor of a CM-5 multiprocessor due to memory limitations, but can run on 64 processors using data parallelism. This paper is organized as follows: Section 2 will describe the details of the serial algorithm for design rule checking. Section 3 will discuss related work in parallel design rule checking algorithms. Details of the parallel DRC algorithm and of the ProperDRC implementation are described in Section 4. Performance results for ProperDRC are presented in Section 5. Section 6 contains an analysis of the performance of the parallel DRC. Section 7 summarizes the work in parallel DRC performed in this research.
opaqueness or transparency of the areas above and below each horizontal edge. More details of the algorithms are presented in 14]. While a naive DRC algorithm would check for all possible interactions of all N 2 pairs of rectangles in a design consisting of N rectangles, a data structure called the scanline is useful for performing operations on a geometry that uses an edge representation. The basic idea of a scanline algorithm is to sweep a vertical line across the edges that constitute a mask layer. Each horizontal location the scanline encounters is called a scanline stop. Only the edges that encounter the scanline are considered at a given time. Scanline algorithms can be implemented in a space-e cient manner. An edge in the circuit area is brought in from the global data structure containing N rectangles to an intermediate data structure p containing on the average O( N ) edges. An edge is included into the data structure when its left endpoint touches the scanline and is removed from the scanline data structure when its right endpoint touches the scanline 15]. Figure 2 illustrates the basic scanline operation. The scanline stops can be restricted to locations on the circuit area that correspond to the left or right endpoint of an edge.
Scan line
D A B F
Figure 2: A scanline moving left to right. At the current position of the scanline, rectangles C, B, E, D and G are included in the data structure, and various operations regarding layout violations are performed only on these rectangles and not on the remaining rectangles. When the scanline moves to the next position (stop), the rectangle C is deleted from the scanline structure.
eration, the Grow operation, and Width/Spacing testing. The majority of the rules equate to a single elementary task, as shown in Figure 3. The circles in the task graph represent the input layers, and the squares represent layers generated by the operation listed, which will contain all of the geometry edges that fail to pass the corresponding design rule. The Enclosure and Extension rules require a series of three and ten elementary tasks, respectively, to test forrule violations. The task graphs corresponding to these two types of design rules are shown in Figure 4.
Description
Example
Input Layer
Layer 1
Layer 2
Two-Layer Spacing
2-Layer Spacing
Input Layer
Square Test
Square Test
Layer 1
Layer 2
Overlap
AND_NOT
Layer 1
Layer 2
No-Overlap
AND
Description
Example
Layer 1
Layer 2
Enclosure
AND
Grow
AND_NOT
Layer 1
Layer 2
AND
Grow
Extension
Grow
AND
AND_NOT
AND_NOT
AND
AND_NOT
Square Test
Square Test
10
It should be noted that the task parallel approach is the easiest to incorporate into a large piece of layout veri cation software, since one can partition the rules among di erent DRC runs on di erent processors. This is the approach used in a commercial version of a parallel DRC called DRACULA from Cadence Design Systems which runs on networks of workstations and on shared memory multiprocessors such as the SPARCServer 1000. We will show in the results in Section 5 that pure task parallelism produces limited speedups since there is not enough task parallelism in real design rules, hence such an approach is appropriate for a small number of processors, e.g. 4 to 8. Therefore, the approach is not scalable. This method of parallelization also su ers from the same memory scalability problem as the previous approach, in that each processor must have enough memory to perform operations on the entire circuit area.
in parallel. The drawback associated with this method is that, because the entire circuit is never in the memory of a single processor at one time, no global quanti cation of the distribution of geometry features within the circuit area is possible. For this reason, circuit partitioning must be based purely on circuit area rather than dividing geometry features themselves equally among processors. To balance the load between processors, additional decomposition is performed to further subdivide the chip area. A grainsize is speci ed by the user to limit the amount of additional decomposition performed. All areas that contain an amount of geometry features greater than the speci ed grainsize are subdivided. Circuit geometry regions are then remapped to processors in such a way as to provide the best load balancing. Choosing the optimal grain-size is a hard problem since, in general, the variation of the runtimes of a parallel DRC tool for varying grain-sizes will have a bath-tub characteristic. If the grain size to too large, we will get unequal load balance. Hence the runtimes of a parallel DRC program will be large for layouts containing irregular distributions of rectangles. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions (see Section 6.2 for a detailed analysis). Again, the runtimes of the parallel DRC tool will be large for small grain-sizes. For an optimum grain-size, the runtimes of the parallel tool will be minimum. Since the distribution of rectangles of a circuit are not known a priori, it is impossible to optimally determine the optimal grain-size for all layouts. We will discuss experimental data on the choice the the grain size for some example layouts in Section 5. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. Figure 5(a) shows the initial circuit partitioning on four processors for a sample circuit area, in which the X's represent geometry features. The dashed lines show the initial division of the circuit into equal-area subregions, which are assigned one per processor. Figure 5(b) shows the same circuit after the load balancing algorithm is applied with a speci ed grainsize of 5. The region initially assigned to Processor 2 has been divided into two parts. Processor 3 will perform the DRC checks on the subregion on the right, to maintain better overall load balance. Note that if a grainsize of 10 had been speci ed by the user, no further subdivision of the circuit beyond the initial partitioning shown in Figure 5(a) would have been performed. The 14
proc 0 proc 1
proc 2 proc 3
proc 0 proc 1
proc 2 proc 3
proc 3
Figure 5: Data parallel load balancing on four processors extent to which load balancing takes place is therefore completely under the user's control, which gives the user the exibility to customize the performance of the algorithm to take full advantage of the multiprocessor architecture by selecting an appropriate grainsize. The geometry layers are distributed in the original polygon representation used by the CIF input le. The conversion from polygon to edge-based representation takes place at the clusters that will perform the DRC tests on the area. Delaying the conversion until after the partitioning has two advantages: the messages are smaller, because one rectangle expands to two edges, and the conversion work is distributed to reduce the amount of time required. Some overlap is necessary between the areas assigned to the various processors to ensure that no pairs of neighboring edges are overlooked. Each processor will receive all rectangles that lie within its area extended on all sides by the maximum design rule interaction distance for the technology. Figure 6 shows the partitioned circuit area from Figure 5(b) with the addition of the overlap areas. Rectangles that are present in more than one processor area will be duplicated and trimmed to the respective processor areas. Trimming the rectangles can easily introduce geometry features that do not pass the design rules. The DRC routine must be careful not to report erroneous results introduced by the circuit partitioning. Therefore, upon completion of the design rule tests, infractions that fall within the maximum design rule interaction distance boundary surrounding the processor's area of the circuit are disregarded rather than reported.
15
proc 0
proc 1
proc 2
proc 3
Processor 1
Via Met2 Poly
Processor 2
40
Via
40 Met2
Square Test
AND
W/S
50 AND
60
Poly
Grow
60 Grow
70
W/S
AND NOT
70
AND NOT
70
Square Test
Figure 7: Task scheduling example tests described in the previous section may have two input layers. These two layers will be generated by other tasks, which precede the current task in the task graph. It is conceivable (maybe even desirable, from a performance standpoint) that the two parent tasks run on separate processors in the cluster. The layers generated by both these parent tasks must be sent to a single processor, so that the subsequent task can be completed. Therefore, it is necessary that the destination processor for the generated layers, and thus the child task itself, be determined a priori. Other solutions to the dependency problem, such as broadcasting the EdgeSets or having the child task explicitly request the layer from the parents, introduce too much communication overhead to be e ective. The mapping of tasks onto processors is obtained by levelizing the task graph. Priorities are assigned to tasks based on the number of levels of subsequent tasks that depend on the output layers. The levelized task graph is lled by arranging tasks in prioritized order. Figure 7(b) shows how the tasks can be tagged with a priority and mapped to a cluster that contains two processors. A task can begin as soon as its input layers arrive at the destination processor. There is no need for explicit synchronization between all of the processors of the cluster at the task graph level boundaries, so the penalty for load imbalance is not as severe as with 17
the traditional barrier-synchronized implementation of a levelized task graph. Furthermore, because the complexities of the various algorithms used to implement the elementary tasks can be used to estimate the length of time required to perform the operations for a given problem size, the potential exists for some intelligent scheduling methods to minimize the imbalance between processors. Because the number of tasks necessary to perform the design rule checks is xed for a given technology, and is independent of the problem size, there is an upper bound on the performance that can be achieved by parallelizing these tasks, no matter how e ective the load balancing strategies are. This would suggest task parallelism alone is not su cient to obtain the best performance on a multiprocessor architecture with more than a few processors; a combination of data and task parallelism must be used.
four homogeneous clusters, as depicted in Figure 8(b), where the circles represent individual processors, and the lines connecting them show the structure of the architecture. Figure 8(c) shows the processor-to-cluster mapping after application of the load balancing scheme. Cluster 3 has essentially borrowed an extra processor from Cluster 2 to compensate for the larger number of geometry features in its region of the circuit area.
5 Results
ProperDRC was used to test for violations of the MOSIS Scalable CMOS design rules 32]. A total of 32 design rules were speci ed, which resulted in the generation of 64 intermediate layers to perform all of the necessary tests. The following platforms were used to generate performance measurements: a Sun Sparcserver 1000 shared-memory multiprocessor, a network of six Sun Sparcstations, and the CM-5 message-passing distributed-memory multiprocessor. The benchmarks used to test ProperDRC include plapart, a programmable logic array with 25,000 rectangles; kovariks, a multiplier array with 64,000 rectangles; and haab1 and haab2, static RAMs containing 128,000 and 253,000 rectangles, respectively. An arti cial benchmark, superhaab, was also created, which consists of the haab2 benchmark replicated four times, in array of two cells by two cells, with 10 spacing between cells. Superhaab contains 1,014,000 rectangles. Tables 1 through 3 show the performance data for purely data parallel decomposition of the DRC. All execution times are measured in seconds. Dashes in the tables indicate that the processor con guration had insu cient memory to perform the DRC on the given circuit. The fact that the CM-5 was unable to operate on the haab1, haab2, and superhaab circuits with less than 8, 16, and 64 processors, respectively, illustrates the memory scalability of the ProperDRC algorithm. The results of the network of SUN workstations for very large circuits could not be reported since our ProperCAD library implementation on the network is unreliable for very large message sizes. (In other related work, we are working on a reliable port of the ProperCAD enviroment on a network of workstations.) It is also interesting to note that every platform appears to exhibit superlinear speedups as the number of processors increases from one to two. This is especially apparent the larger benchmarks running on the Sun Sparcserver 1000, which run six to seven times faster on two processors than on a uniprocessor. This e ect is most likely due to cache e ects, where the smaller working space requirement of the two-processor implementation results in 19
clust 0 clust 1
clust 2 clust 3
clust 0
clust 1
clust 2
clust 3
clust 0
clust 1
clust 2
clust 3
(c) Balanced cluster mapping Figure 8: Remapping processors to obtain balanced load between clusters 20
signi cantly fewer expensive memory operations. Table 1: Data parallel performance on a network of Sun Sparcstation 5 machines Circuit Processors 1 2 4 plapart 175.1 59.5 28.9 kovariks 324.7 131.4 67.6 haab1 { { { haab2 { { { superhaab { { {
Table 2: Data parallel performance on Sun Sparcserver 1000 shared-memory multiprocessor Circuit Processors 1 2 4 8 plapart 130.3 43.6 21.4 9.4 kovariks 244.7 94.3 38.1 24.2 haab1 843.1 114.5 64.1 40.7 haab2 1221.3 275.8 176.2 100.9 superhaab { { { { The performance results for the purely task parallel implementation of ProperDRC are given in Tables 4 through 6. A small number of processors provide good performance results, but the e ectiveness of adding additional processors diminishes quickly, for any problem size. This is because the amount of task parallelism available is dependent only on the size of the set of technology rules being used, and not the size of the input le, as discussed in Section 4.2. It should be noted that task parallel layout veri cation cannot handle large problem sizes since each processor has to replicate the entire mask layout, which becomes too much for each processor. Tables 7 and 8 show the performance results using a combination of data and task parallelism. It is important to notice that there are cases in which a combination of data and task parallelism provides better performance over either type of parallelism individually. Compared to Table 3, the results on the 128 processor runs show that the combined task 21
Table 3: Data parallel performance on Thinking Machines distributed-memory multiprocessor Circuit Processors 1 2 4 8 16 32 plapart 410.7 77.6 39.3 19.8 9.7 5.6 kovariks { 288.2 89.9 45.0 24.6 12.0 haab1 { { { 261.5 100.3 44.7 haab2 { { { { 159.8 128.9 superhaab { { { { { {
CM-5 message-passing 64 5.2 6.9 34.8 59.2 621.2 128 3.3 6.2 29.5 69.3 440.5
Table 4: Task parallel performance on a network of Sun Sparcstation 5 machines Circuit Processors 1 2 3 4 5 6 plapart 175.1 104.4 100.0 90.4 94.5 76.2 kovariks 324.7 267.8 227.6 217.2 174.0 200.2 haab1 { { { { { { haab2 { { { { { { superhaab { { { { { {
Table 5: Task parallel performance on Sun Sparcserver 1000 shared-memory multiprocessor Circuit Processors 1 2 3 4 5 6 7 8 plapart 130.3 76.1 65.7 56.8 53.5 53.5 44.7 42.1 kovariks 244.7 137.2 100.2 100.1 85.0 77.1 69.6 64.7 haab1 843.1 479.4 339.1 338.0 354.3 324.4 310.2 294.5 haab2 1221.3 683.5 575.6 502.8 476.9 457.6 419.1 422.8 superhaab { { { { { { { {
22
Table 6: Task parallel performance on Thinking Machines CM-5 message-passing distributedmemory multiprocessor Circuit Processors 1 2 3 4 5 6 7 8 plapart 854.5 442.6 350.6 311.7 306.2 280.5 233.1 308.7 kovariks { { { 644.7 536.9 470.8 466.6 392.8 haab1 { { { { { { { { haab2 { { { { { { { { superhaab { { { { { { { { and data parallel gives better runtime performance than the purely data parallel approach. A detailed analysis of these results is presented in the following section. Table 7: Combined data and task parallel performance on Sun Sparcserver 1000 sharedmemory multiprocessor Circuit Processors/Clusters 4/2 6/2 8/2 8/4 plapart 28.9 25.5 21.9 14.3 kovariks 54.7 44.1 41.9 23.7 haab1 178.9 124.8 137.0 72.1 haab2 513.5 379.9 391.9 264.2 superhaab { { { { The user-speci ed grainsize controls the extent to which load balancing takes place in the purely data parallel decomposition of the DRC problem. Any region of the circuit having a number of geometry features greater than the grainsize is subdivided into equal area regions, which may later be reassigned to di erent processors as necessary to facilitate load balancing. As discussed earlier in Section 4, choosing the optimal grain-size is a hard problem. If the grain size to too large, we will get unequal load balance. If the grain size is too small, we will create a large number of tasks, but each task will generate some redundant work in the form of extra checks that are needed to be performed at the boundaries of the partitions. A good heuristic is to choose a grain-size of around N= P rectangles, where N is the number of rectangles, P is the number of processors, and is the variance of the distribution of rectangles per unit area of the chip. We assume a typical value of to be 2 for real designs. 23
Table 8: Combined data and task parallel performance on Thinking Machines CM-5 messagepassing distributed-memory multiprocessor Circuit Processors/Clusters 4/2 8/4 16/8 32/8 64/16 64/32 128/64 plapart 157.3 59.6 26.0 17.2 9.8 8.7 10.7 kovariks 320.7 124.8 70.7 49.8 21.9 15.6 13.4 haab1 { 373.3 77.8 50.4 42.5 20.7 21.9 haab2 { { 325.4 203.6 67.8 65.5 48.0 superhaab { { { { { 326.7 301.2 The e ect of varying the grainsize for the purely data parallel decomposition is shown in Table 9 for the CM-5. The purely area-based circuit partitioning may be considered a degenerate case of the data partitioning strategy presented in this paper, in which the grainsize is an in nite value since the grain-size based partitioning is not invoked. For the haab1 circuit consisting of 128,000 rectangles, we show results of varying grain sizes for 5,000 and 1,000 rectangles. For example, for the 16 processor run, we show that the results are optimal for 5000 rectangles (our heuristic picks 4,000 rectangles). Similarly for the haab2 circuit consisting of 256,000 rectangles on 16 processors, the optimal grain size is 10,000 rectangles (our heuristic picks 8,000 rectangles). We have obtained similar results on the SUN Sparcserver 1000 and network of workstations. The concept of using task priorities to determine the order of execution for a set of DRC tasks was introduced in Section 4.2. Tasks are assigned higher priorities based on the number of levels of subsequent tasks that rely on the output of the task. Using an arbitrary ordering for tasks could result in a task schedule that produces more tra c and requires more waiting time than the prioritized schedule. Table 10 shows the e ect of using the priorities to schedule tasks, as opposed to using random ordering. The performance gures are reported for the network of Sun workstations, but we obtained similar results on the Sun Sparcserver and the CM-5. The choice of a task ordering heuristic has no e ect on the uniprocessor performance, as expected, because network tra c and processor idle time are not relevant concerns for uniprocessor execution. With two or more processors, the performance data illustrate that the prioritized task queue provides better performance. A cluster remapping strategy was presented in Section 4.3 as a means of balancing the 24
Table 9: E ect of grainsize on data parallel performance on Thinking Machines CM-5 message-passing distributed-memory multiprocessor Circuit Grainsize Processors 8 16 32 64 128 haab1 1 395.3 239.9 83.1 36.3 29.5 5000 { 100.3 79.2 34.8 30.4 2000 261.5 104.2 44.7 45.7 30.3 haab2 1 { 426.1 297.4 100.5 71.4 10000 { 139.7 146.3 97.7 71.3 5000 { 159.8 128.9 59.2 69.3 superhaab 1 { { { 638.7 445.8 50000 { { { 642.5 440.7 20000 { { { 621.2 440.5
Table 10: E ect of task parallel scheduling on a network of Sun Sparcstations Circuit Task Processors Ordering 1 2 3 4 5 plapart random 175.1 142.4 116.4 109.7 92.2 prioritized 176.0 104.4 100.0 90.4 94.5 kovariks random 324.7 321.5 264.8 227.9 212.6 prioritized 326.3 267.8 227.6 217.2 174.0
25
load between clusters when a combination of data and task parallelism is used. Ideally, the number of processors assigned to a given cluster is proportional to the number of geometry features inside that cluster's region of the circuit area. However, because the total number of processors is xed, and sometimes small, the fraction of the available processors assigned to a cluster cannot always equal the exact fraction of the total number of geometry features that lie within the circuit area owned by the cluster. Having a larger number of processors available allows the fraction of the available processors assigned to the cluster to more closely approximate the fraction of geometry features in the cluster area and, therefore, facilitates more e ective load balancing. Table 11 shows the e ectiveness of the cluster remapping strategy. The load balancing strategy was most e ective with a large number of processors on the CM-5, where the processor-to-cluster mapping has the most exibility. Table 11: E ect of variable cluster size on Thinking Machines CM-5 message-passing distributed-memory multiprocessor for data and task parallel decomposition Circuit Cluster Processors/Clusters Size 16/8 24/8 32/8 32/16 48/16 64/16 64/32 128/64 haab1 xed 78.2 58.4 56.1 72.6 50.7 52.4 25.8 22.1 variable 77.8 58.4 50.4 56.2 46.8 42.5 20.7 21.9 haab2 xed 388.9 260.8 237.5 123.7 92.5 85.3 87.5 62.9 variable 322.8 241.5 203.6 102.5 76.5 67.8 65.5 48.0 superhaab xed { { { { { { 335.2 362.4 variable { { { { { { 326.7 301.2
6 Analysis of Approaches
To analyze the performance results of ProperDRC, we will rst examine the performance of the serial algorithms used to implement the various DRC operations. We will then proceed to examine the performance issues introduced by the parallelization of the DRC process.
the width/spacing clearance checking algorithm presented in this paper. Table 12 provides a summary of the complexities of the various algorithms used to perform the DRC operations. Considering that the overall performance of the DRC is bounded by the performance of the most complex algorithms used by the DRC, the overall complexity for the DRC is O(NlogN ). Table 12: Complexities of the DRC operations Operation Execution Time Boolean Operation O(N log N ) Sort O(N log N ) or O(log N ) Grow O(N log N ) Width O(N ) Spacing O(N ) Square Test O(N ) Overall DRC O(N log N )
Table 13: Comparison of parallelization methods on CM-5 Circuit Procs per Processors Cluster 16 32 64 128 haab1 1 100.3 44.7 34.8 29.5 2 77.8 56.2 20.7 21.9 haab2 1 159.8 128.9 59.2 69.3 2 325.4 102.5 65.5 48.0 To illustrate the e ect of the complexity of the DRC algorithms on the task parallel performance, consider a simpli ed case in which two processors are to be applied to perform a design rule check on a circuit with 2X geometry features. Assume perfect load balancing, whether data or task parallelism is used. No matter which type of parallelism is used, the combination of the two processors must have the memory capacity to hold the entire circuit. Using the complexity of the slowest algorithms in the DRC procedure, the time necessary to perform the DRC on a circuit of problem size N is O(N log N ), and the minimum amount of working space required for the p DRC algorithms is O( N ). We will use the term working space to distinguish between the amount of storage required by a single processor to perform the various DRC operations, and the amount of storage required by the entire set of processors performing the DRC operations to hold the whole circuit geometry, which is xed at O(N ) for the entire set of processors. If data parallelism were used to divide the circuit's geometry features equally between the processors, neglecting the overlapping processor areas for the moment, each of the processors would perform a DRC on a subregion of the circuit with X geometry features. The total run time would be O(X log X ), because this amount of time is necessary for each processor to perform local design rule checking simultaneously. The demand for work space at each of p the processors is O( X ). In the task parallel implementation, the various DRC tasks would be divided equally between the two processors. Both processors would be working on a problem size of 2X . The time required for the DRC would be O(1=2 (2Xplog (2X )) = O(X log (2X )). The working ) space requirement for each processor would be O( 2X ). Therefore, the task parallel version of the DRC requires slightly more time to run and more working space. These penalties are O(constant), but nonetheless indicate that the purely data parallel implementation provides 28
the better performance when overlapping processor areas are disregarded. Now, let us take the e ect of overlapping processors into consideration. Using twodimensional partitioning, a square circuitp pwith area A divided between P processors area results in a square circuit area measuring A= P units on a side being assigned to each of the processors. Taking the overlap area of c units on every side of the processor area into p p p p consideration,p total area assigned to the processor is ( A= P + 2c) ( A= P p 2c), or the + p p 2 A=P +4c A= P +4c units. This area formula can be generalized to A=P + kc A= P +4c2 for decompositions of circuit areas that result in processor areas that are not perfectly square, where k is a constant. Note that the actual area assigned to processors whose regions are on the outside boundaries of the circuit is actually slightly lower. These slight area discrepancies can be safely ignored, because a smaller fraction of the processors are on the boundary as the total number of processors increases, and the performance of the algorithm on the circuit as a whole will be bounded by the processors with the highest areas. Disregarding these area discrepancies, p the total area operated on by the set of P processors is A + kc AP + 4c2P . Returning to our ctitious example of dividing a circuit with 2X geometry features between a pair of processors, the actual amount of area assigned to each of the two processors p is X + kc 2X + 8c2. The p actual amount of time consumed by the slowest of the DRC p algorithms is now q((X + kc 2X + 8c2) log (X + kc 2X + 8c2)), and the working space O p requirement is O( X + kc 2X + 8c2 ). Note that these penalties are dependent on the number of processors. In addition to the penalty for the complexity of the DRC algorithms associated with the task parallelism, there is also the overhead of intraprocessor communication, whereas the purely data parallel decomposition of the problem requires no communication while the DRC checks are being performed, although some communication is necessary during the data partitioning phase to perform the load balancing between processors. The experimental results also show that load balancing is more di cult to attain for the task parallel implementation as compared to the data parallel implementation. Consider that the task parallel DRC on the plapart benchmark on the network of Sun workstations took longer with ve processors than with either four or six. The actual distribution of the geometries between the di erent mask layers is much more critical for the task parallel implementation than for the data parallel version. The particular distribution in the plapart benchmark apparently presented a load balancing problem for the particular order in which the layer operations were divided between ve processors. 29
7 Conclusion
In this paper we have have applied the concept of integrating task and data parallelism in an irregular application, namely VLSI layout veri cation in a tool called ProperDRC. ProperDRC is able to exploit multiple levels of parallelism. It can utilize data parallelism, task parallelism, or a simultaneous combination of the two types of parallelism to perform design-rule checking (DRC) operations concurrently on a multiprocessor architecture. Another contribution of the parallel application is that it is portable across a large number of parallel platforms, including shared memory multiprocessors, message-passing distributed memory multiprocessors, and networks of workstations. A number of areas in parallel design rule checking should be explored in the future. Ideally, a DRC tool should be able to exploit the hierarchy of large designs. Performing DRC on a attened layout representation may result in much redundant work if individual cells in the design are instantiated a large number of times, as is often the case with library cell-based designs. ProperDRC should be expanded to handle non-Manhattan layout geometries. Many of the algorithms used in ProperDRC would require some additional work to be capable of operating on non-Manhattan designs. When such increased capabilities are included into ProperDRC, we can perform an e ective comparison of the runtimes of ProperDRC versus commercial layout veri cation tools such as DRACULA and VAMPIRE from Cadence Design Systems, and CHECKMATE and PARADE from Mentor Graphics. Conceptually, the approaches of combined task and data parallelism should be applicable to any commerical layout veri cation tool. But the exact nature of the performance gains will be dependent on the actual implementation. We are in the process of interacting with developers at Cadence to transfer the parallel algorithms in ProperDRC into practice 13].
References
1] B. Ramkumar and P. Banerjee, \ProperCAD: A portable object-oriented parallel environment for VLSI CAD," IEEE Transactions on Computer Aided Design, vol. 13, pp. 829{842, July 1994. 2] S. Parkes, J. A. Chandy, and P. Banerjee, \ProperCAD II: A run-time library for portable, parallel, object-oriented programming with applications to VLSI CAD," Tech. 30
Rep. CRHC{93{22/UILU{ENG{93{2250, Center for Reliable and High-Performance Computing, University of Illinois, Urbana, Illinois, Dec. 1993. 3] S. Parkes, J. A. Chandy, and P. Banerjee, \A library-based approach to portable, parallel, object-oriented programming: Interface, implementation, and application," in Supercomputing '94, (Washington, DC), pp. 69{78, Nov. 1994. 4] S. Kim, J. A. Chandy, S. Parkes, B. Ramkumar, and P. Banerjee, \ProperPLACE: A portable parallel algorithm for cell placement," in Proceedings of the International Parallel Processing Symposium, (Cancun, Mexico), pp. 932{941, Apr. 1994. 5] J. A. Chandy and P. Banerjee, \Parallel simulated annealing strategies for VLSI cell placement," in Proceedings of the International Conference on VLSI Design, (Bangalore, India), Jan. 1996. To appear. 6] B. Ramkumar and P. Banerjee, \ProperEXT: A portable parallel algorithm for VLSI circuit extraction," in Proceedings of the International Parallel Processing Symposium, (Newport Beach, CA), pp. 434{438, Apr. 1993. 7] K. De, B. Ramkumar, and P. Banerjee, \ProperSYN: A portable parallel algorithm for logic synthesis," in Digest of Papers, International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 412{416, Nov. 1992. 8] K. De, J. A. Chandy, S. Roy, S. Parkes, and P. Banerjee, \Portable parallel algorithms for logic synthesis using the mis approach," in Proceedings of the International Parallel Processing Symposium, (Santa Barbara, CA), pp. 579{585, Apr. 1995. 9] B. Ramkumar and P. Banerjee, \Portable parallel test generation for sequential circuits," in Digest of Papers, International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 220{223, Nov. 1992. 10] S. Parkes, P. Banerjee, and J. H. Patel, \ProperHITEC: A portable, parallel, object{ oriented approach to sequential test generation," in Proceedings of the Design Automation Conference, (San Diego, CA), pp. 717{721, June 1994. 11] S. Parkes, P. Banerjee, and J. Patel, \A parallel algorithm for fault simulation based on proofs," in Proceedings of the International Conference on Computer Design, (Austin, TX), Oct. 1995. To appear. 31
12] S. Kim, \Lsi logic corporation," Personal communication, 1995. 13] E. Petrus, \Cadence design systems," Personal communication, 1996. 14] K. MacPherson, \Parallel algorithms for layout veri cation," Master's thesis, University of Illinois at Urbana-Champaign, Aug. 1995. 15] U. Lauther, \An O(N log N) algorithm for Boolean mask operations," in Proc. 18th Design Automation Conf., pp. 555{562, June 1981. 16] T. Szymanski and C. J. Van Wyk, \Goalie: A space e cient system for VLSI artwork analysis," IEEE Design Test Computers, vol. 2, pp. 64{72, June 1985. 17] P. Banerjee, Parallel Algorithms for VLSI Computer-aided Design Applications. Englewoods-Cli s, NJ: Prentice Hall, 1994. 18] G. E. Bier and A. R. Pleszkun, \An algorithm for design rule checking on a multiprocessor," in Proc. Design Automation Conf., pp. 299{303, June 1985. 19] F. Gregoretti and Z. Segall, \Analysis and evaluation of VLSI design rule checking implementation in a multiprocessor," in Proc. Int. Conf. Parallel Processing, pp. 7{14, Aug. 1984. 20] J. Marantz, \Exploiting parallelism in VLSI CAD," in Proc. Int. Conf. Computer Design, Oct. 1986. 21] E. Carlson and R. Rutenbar, \Design and performance evaluation of new massively parallel VLSI mask veri cation algorithms in JIGSAW," in Proc. 27th Design Automation Conf., pp. 253{259, June 1990. 22] E. Carlson and R. Rutenbar, \Mask veri cation on the Connection Machine," in Proc. Design Automation Conf., pp. 134{140, June 1988. 23] S. H. Bokhari, \Partitioning problems in Parallel, Pipelined and Distributed computing," IEEE Trans. Comput., vol. C-37, pp. 48{57, Jan. 1988. 24] J. K. Salmon, Parallel Hierarchical N-body Methods. PhD thesis, California Institute of Technology, Dec. 1990. 25] J. E. Barnes and P. Hut, \A hierarchical O(N log N) force calculation algorithm," Nature, vol. 324, pp. 446{449, 1986. 32
26] J. P. Singh et al., \Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity," Parallel & Distrib. Comput., vol. 27, pp. 118{141, June 1995. 27] G. Cybenko, \Dynamic load balancing for distributed memory multiprocessors," Parallel & Distrib. Comput., vol. 7, pp. 279{301, July 1989. 28] K. P. Belkhale and P. Banerjee, \Recursive partitions on multiprocessors," in Proc. 5th Distributed Memory Computing Conf., Apr. 1990. 29] K. P. Belkhale and P. Banerjee, \Parallel algorithms for VLSI circuit extraction," IEEE Transactions on Computer Aided Design, vol. 10, pp. 604{618, May 1991. 30] K. P. Belkhale, Parallel Algorithms for CAD with Applications to Circuit Extraction. PhD thesis, University of Illinois at Urbana-Champaign, Nov. 1990. Tech. Rep. CRHC{ 90{15/UILU{ENG{90{2251. 31] C. Mead and L. Conway, Introduction to VLSI Systems. Philippines: Addison-Wesley, 1980. 32] J.-I. Pi, MOSIS Scalable CMOS Design Rules. Information Sciences Institute, University of Southern California, Marina del Rey, CA.
33