Die Stacking Architecture
Die Stacking Architecture
com
Die-stacking Architecture
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. e scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015
www.ebook3000.com
iii
e Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013
Shared-Memory Synchronization
Michael L. Scott
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
www.ebook3000.com
v
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Die-stacking Architecture
Yuan Xie and Jishen Zhao
www.morganclaypool.com
DOI 10.2200/S00644ED1V01Y201505CAC031
Lecture #31
Series Editor: Margaret Martonosi, Princeton University
Series ISSN
Print 1935-3235 Electronic 1935-3243
www.ebook3000.com
Die-stacking Architecture
Yuan Xie
University of California, Santa Barbara
Jishen Zhao
University of California, Santa Cruz
M
&C Morgan & cLaypool publishers
ABSTRACT
e emerging three-dimensional (3D) chip architectures, with their intrinsic capability of reduc-
ing the wire length, promise attractive solutions to reduce the delay of interconnects in future
microprocessors. 3D memory stacking enables much higher memory bandwidth for future chip-
multiprocessor design, mitigating the “memory wall” problem. In addition, heterogenous integra-
tion enabled by 3D technology can also result in innovative designs for future microprocessors.
is book first provides a brief introduction to this emerging technology, and then presents a
variety of approaches to designing future 3D microprocessor systems, by leveraging the benefits
of low latency, high bandwidth, and heterogeneous integration capability which are offered by
3D technology.
KEYWORDS
emerging technology, die-stacking, 3D integrated circuits, memory architecture,
heterogeneous integration
www.ebook3000.com
ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 3D Integration Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 3D Integrated Circuits vs. 3D Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Different Process Technologies for 3D ICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 e Impact of 3D Technology on 3D Microprocessor Partitioning . . . . . . . . . . . 3
2 Benefits of 3D Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Wire Length Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Memory Bandwidth Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Heterogenous Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Cost-effective Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 3D GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 3D-stacked GPU Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 3D-stacked GPU Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
x
6 3D Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1 3D NoC Router Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 3D NoC Topology Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.3 3D Optical NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4 Impact of 3D Technology on NoC Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
www.ebook3000.com
xi
Preface
ree-dimensional (3D) integration is an emerging technology, where two or more layers of active
devices (e.g., CMOS transistors) are integrated both vertically and horizontally in a single cir-
cuit. With continuous technology scaling, 3D integration is becoming an increasingly attractive
technology in implementing microprocessor systems by offering much lower power consump-
tion, lower interconnect latency, and higher interconnect bandwidth compared to traditional two-
dimensional (2D) circuit integration.
In particular, 3D integration technologies promise at least four major benefits toward future
microprocessor design.
• Enabling smaller form factor. 3D integration enables a much smaller form factor compared
to traditional 2D integration technologies. Due to the addition of a third dimension to
conventional 2D layout, it leads to a higher packing density and smaller footprint. is
potentially leads to processor designs with lower cost.
xii PREFACE
Both academia and the semiconductor industry are actively pursuing this technology by de-
veloping efficient architectures in a variety of forms. From the industry prospective, 3D integrated
memory is envisioned to become pervasive in the near future. Intel’s Xeon Phi processors will de-
liver with 3D integrated DRAMs in 2016 [2]. NVIDIA announced that 3D integrated memory
will be adopted in their new GPU products in 2016 [3]. AMD plans to ship high-bandwidth
memory (HBM) with their GPU products and heterogeneous system architecture (HSA)-based
CPUs in 2015 [4]. From the academia prospective, comprehensive studies have been performed
across all aspects of microprocessor architecture design by employing 3D integration technolo-
gies, such as 3D stacked processor core and cache architectures, 3D integrated memory, and 3D
network-on-chip. Furthermore, a large body of research has studied critical issues and opportuni-
ties raised by adopting 3D integration technologies, such as thermal issues which are imposed by
dense integration of active electronic devices, cost issues which are incurred by extra process and
increased die area, and the opportunity in designing cost-effective microprocessor architectures.
is book provides a detailed introduction to architecture design with 3D integration tech-
nologies. e book will start with presenting the background of 3D integration technologies
(Chapter 1), followed by a detailed analysis of the benefits offered by these technologies includ-
ing low latency, high bandwidth, heterogeneous integration capability, and cost efficiency (Chap-
ter 2). en, it will review various approaches to designing future 3D integrated microprocessors
by leveraging the benefits of 3D integration (Chapter 3 through Chapter 6). ese approaches
cover all levels of microprocessor systems, including processor cores, caches, main memory, and
on-chip network. Furthermore, this book discusses thermal issues raised by 3D integration and
presents recently proposed thermal-aware architecture designs (Chapter 7). Finally, this book
presents a comprehensive cost model which is built based on detailed cost analysis for fabricating
3D integrated microprocessors (Chapter 8). By utilizing the cost model, the book presents and
compares cost-effective microprocessor design strategies.
While this book mostly focuses on designing high-performance processors, the concepts
and techniques can also be applied to other market segments such as embedded processors and
exascale high-performance computing (HPC) systems.
e target audiences for this book are students, researchers, and engineers in IC design and
computer architecture, who are interested in leveraging the benefits of 3D integration for their
designs and research.
www.ebook3000.com
xiii
Acknowledgments
Much of the work and ideas presented in this book have evolved over years in working with
our colleagues and graduate students at Pennsylvania State University (in particular Professor
Vijaykrishnan Narayanan, Professor Mary Jane Irwin, Professor Chita Das), and our industry
collaborators including Dr. Gabriel Loh, Dr. Bryan Black, Dr. Norm Jouppi, and Mr. Kerry
Bernstein.
We also thank Prof. Niraj Jha, Prof. Margaret Martonosi, and other reviewers for the com-
ments and feedback to improve the draft.
CHAPTER 1
3D Integration Technology
A 3D integrated circuit (3D IC) has two or more active device layers (i.e., CMOS transistor
layers) integrated vertically as a single chip, using various integration methods. is chapter will
give a brief introduction to different 3D integration technologies, including monolithic 3D ICs
and through-silicon-via (TSV)-based 3D ICs.
PoP memory
Processor
• Monolithic approach. is approach involves a sequential device process. e front-end pro-
cessing (to build the device layer) is repeated on a single wafer to build multiple active device
layers before the back-end processing builds interconnects among devices.
Several 3D stacking technologies have been explored recently, including wire bonded, mi-
crobump, contactless (capacitive or inductive), and through-silicon vias (TSV) vertical intercon-
nects. Among all these integration approaches, TSV-based 3D integration has the potential to
offer the greatest vertical interconnect density, and therefore is the most promising one among
all the vertical interconnect technologies. Figure 1.3 shows a conceptual 2-layer 3D integrated
circuit with TSV and microbump.
3D stacking can be implemented using two major techniques [7]: (1) Face-to-Face (F2F)
bonding: two wafers (dies) are stacked so that the very top metal layers are connected. Note that
the die-to-die interconnects in face-to-face wafer bonding does not go through a thick buried
www.ebook3000.com
1.3. THE IMPACT OF 3D TECHNOLOGY ON 3D MICROPROCESSOR PARTITIONING 3
silicon layer and can be fabricated as microbump. e connections to C4I/O pads¹ (which are chip
pads used to mount the chip to external circuitry) are formed as TSVs; (2) Face-to-Back (F2B)
bonding: multiple device layers are stacked together with the top metal layer of one die bonded
together with the substrate of the other die, and direct vertical interconnects (which are called
through-silicon vias (TSV)) tunneling through the substrate. In such F2B bonding, TSVs are
used for both between-layer-connections and I/O connections. Figure 1.3 shows a conceptual 2-
layer 3D IC with F2F or F2B bonding, with both TSV connections and microbump connections
between layers.
All TSV-based 3D stacking approaches share the following three common process
steps [7]:
• TSV formation. is is the step to fabricate the through-silicon-via on a wafer. It can also
be done in 3 different ways: (1)Via-first : e TSVs are built before any CMOS transistors
are fabricated on the wafer; (2)Via-middle: e TSVs. are built after the CMOS transistors
are fabricated but before the metal layers are fabricated; (3)Via-last : e TSVs are built
after both CMOS transistors and metal connections among transistors are fabricated. Via-
middle can only be done in a foundry, while the other two approaches can be done outside
a foundry. Currently there are two types of materials: highly conductive copper TSVs, or
smaller Tungsten TSVs.
• Wafer thinning. Wafer thinning is used to reduce the overheads of TSVs. Since TSVs. need
to maintain a certain aspect ratio for reliability/manufacturability purpose, it is a critical step
to thin the wafer so that we can build small and short TSVs between layers. e thinner
the wafer, the smaller (and shorter) the TSV is (with the same aspect ratio constraint) [7].
e wafer thickness could be in the range of 10 m to 100 m and the TSV size is in the
range of 0:2 m to 10 m [8].
• Aligned wafer and die bonding, which could be wafer-to-wafer (W2W) bonding or die-to-
wafer (D2W) bonding.
6 5
In TSV-based 3D stacking bonding, the dimension of the TSVs. is not expected to scale
at the same rate as feature size because alignment tolerance during bonding poses limitations on
the scaling of the vias. e TSV size, length, and the pitch density, as well as the bonding method
(face-to-face or face-to-back bonding), can have a significant impact on the 3D IC design. For
example, a relatively large size of TSVs. can hinder partitioning a design at very fine granularity
across multiple device layers, and make the true 3D component design less possible. On the
other hand, monolithic 3D integration provides more flexibility in vertical 3D connection because
the vertical 3D via can potentially scale down with feature size due to the use of local wires for
connections. Availability of such technologies makes it possible to partition the design at a very
fine granularity. Furthermore, face-to-face bonding or SOI-based 3D integration may have a
smaller via pitch size and higher via density than face-to-back bonding or bulk-CMOS-based
integration. Such influence of the 3D technology parameters on the microprocessor design must
be thoroughly studied before an appropriate partition strategy is adopted.
With TSV-based 3D stacking, the partitioning strategy is determined by the TSV pitch
size and the via diameter. As shown in Figure 1.3, the TSV goes through the substrate and incurs
www.ebook3000.com
1.3. THE IMPACT OF 3D TECHNOLOGY ON 3D MICROPROCESSOR PARTITIONING 5
Via Diameter
0.5 1 5 10 20
Partition Level
macro 0.12 0.5 12 50 200
Figure 1.5: Area overhead (in relative ratio, via diameter in unit of m) at different partitioning
granularity [9].
area overhead, and therefore the larger the via diameter, the higher the area overhead. For example,
Figure 1.4 shows the number of connections for different partitioning granularity, and Figure 1.5
shows the area overhead for different size of 3D via diameters.² ey show that for fine granularity
partitioning, there are a lot of connections, and with relative large via diameter, the area overhead
would be very high for fine granularity partitioning. Consequently, for most of the existing 3D
process technology with via diameter usually larger than 1 m, it makes more sense to perform
the 3D partitioning at the unit level or core level, rather than at the gate level that can result in a
large area overhead.
²ese two tables are based on IBM 65nm technology for high performance microprocessor design.
www.ebook3000.com
7
CHAPTER 2
Benefits of 3D Integration
e following subsections will discuss various architecture design approaches that leverage dif-
ferent benefits that 3D integration technology can offer, namely, wire length reduction, high
memory bandwidth, heterogeneous integration, and cost reduction. It will also briefly review 3D
network-on-chip architecture designs.
Latency Improvement. Latency improvement can be achieved due to the reduction of average
interconnect length and the critical path length.
Early work on fine-granularity 3D partitioning of processor components shows that the la-
tency of a 3D component could be reduced. For example, since interconnects dominate the delay
of cache accesses which determines the critical path of a microprocessor, and the regular struc-
ture and long wires in a cache make it one of the best candidates for 3D designs, 3D cache design
is one of the early design examples for fine-granularity 3D partition [5]. Wordline partitioning
and bitline partitioning approaches divide a cache bank into multiple layers and reduce the global
interconnects, resulting in a fast cache access time. Depending on the design constraints, the
3DCacti tool [11] automatically explores the design space for a cache design, and identifies the
optimal partitioning strategy; the latency reduction can be up to 25% for a two-layer 3D cache.
3D arithmetic-component designs also show latency benefits. For example, various designs [12–
15] have shown that the 3D arithmetic unit design can achieve around 6%–30% delay reduction
8 2. BENEFITS OF 3D INTEGRATION
due to wire length reduction. Such fine-granularity 3D partitioning was also demonstrated by In-
tel [16], showing that by targeting the heavily pipelined wires, the pipeline modifications resulted
in approximately 15% improved performance, when the Intel Pentium-4 processor was folded
onto 2-layer 3D implementation.
Note that such fine-granularity design of 3D processor components increases the design
complexity, and the latency improvement varies depending on the partitioning strategies and
the underlying 3D process technologies. For example, for the same Kogge-Stone adder design,
a partitioning based on logic level [12] demonstrates that the delay improvement diminishes
as the number of 3D layers increases; a bit-slicing partitioning [14] strategy would have better
scalability as the bit-width or the number of layers increases. Furthermore, the delay improvement
for such bit-slicing 3D arithmetic units is about 6% when using a bulk-CMOS-based 180nm 3D
process [15], while the improvement could be as much as 20% when using a SOI-based 180nm
3D process technology [14], because the SOI-based process has much smaller and shorter TSVs
(and therefore much smaller TSV delay) compared to the bulk-CMOS-based process.
Power Reduction. Interconnect power consumption becomes a large portion of the total power
consumption as technology scales. e reduction of the wire length translates into the power sav-
ing in 3D IC design. For example, 7% to 46% of power reduction for 3D arithmetic units were
demonstrated in [14]. In the 3D Intel Pentium-4 implementation [16], because of the reduction
in long global interconnects, the number of repeaters and repeating latches in the implementa-
tion is reduced by 50%, and the 3D clock network has 50% less metal RC than the 2D design,
resulting in a better skew, jitter, and lower power. Such 3D stacked redesign of Intel Pentium 4
processor improves performance by 15% and reduces power by 15% with a temperature increase
of 14 degrees. After using voltage scaling to lower the leak temperature to be the same as the
baseline 2D design, their 3D Pentium 4 processor still showed a performance improvement of
8%.
www.ebook3000.com
2.2. MEMORY BANDWIDTH IMPROVEMENT 9
sign, by providing improved memory bandwidth for such multi-core/many-core microprocessors.
In addition, such approaches of memory stacking on top of core layers do not have the design
complexity problem as demonstrated by the fine-granularity design approaches, which require
re-designing all processor components for wire length reduction (as discussed in Sec. 2.1).
Intel [16] explored the memory bandwidth benefits using a base-line Intel Core2 Duo
processor, which contains two cores. By having memory stacking, the on-die cache capacity is
increased, and the performance is improved by capturing larger working sets, reducing off-chip
memory bandwidth requirements. For example, one option is to stack an additional 8MB L2
cache on top of the base-line 2D processor (which contains 4MB L2 cache), and the other op-
tion is to replace the SRAM L2 cache with a denser DRAM L2 cache stacking. eir study
demonstrated that a 32MB 3D stacked DRAM cache can reduce the cycles per memory access
by 13% on average and as much as 55% with negligible temperature increases.
e PicoServer project [20] follows a similar approach to stack DRAM on top of multi-core
processors. Instead of using stacked memory as a larger L2 cache (as shown by Intel’s work [16]),
the fast on-chip 3D stacked DRAM main memory enables wide low-latency buses to the proces-
sor cores and eliminates the need for an L2 cache, whose silicon area is allocated to accommodate
more cores. Increasing the number of cores by removing the L2 cache can help improve the com-
putation throughput, while each core can run at a much lower frequency, and therefore result
in an energy-efficient many core design. For example, the PicoServer design can achieve a 14%
performance improvement and 55% power reduction over a baseline multi-core architecture.
As the number of the cores on a single die increases, such memory stacking becomes more
important to provide enough memory bandwidth for processor cores. Recently, Intel [21] demon-
strated an 80-tile terascale chip with network-on-chip. Each core has a local 256KB SRAM
memory (for data and instruction storage) stacked on top of it. TSVs provide a bandwidth of
12GB/second for each core, with a total about 1TB/second bandwidth for Tera Flop computa-
tion. In this chip, the thin memory die is put on top of the CPU die, and the power and I/O
signals go through memory to CPU.
Since DRAM is stacked on top of the processor cores, the memory organization should
also be optimized to fully take advantages of the benefits that TSVs offer [17, 22]. For example,
the numbers of ranks and memory controllers are increased, in order to leverage the memory
bandwidth benefits. A multiple-entry row buffer cache is implemented to further improve the
performance of the 3D main memory. Comprehensive evaluation shows that a 1.75x speedup
over commodity DRAM organization is achieved [17]. In addition, MSHR design was modified
to provide a scalable L2 miss handling before accessing the 3D stacked main memory. A data
structure called the Vector Bloom Filter with dynamic MSHR capacity tuning is proposed. Such
structure provides an additional 17:8% performance improvement. If stacked DRAM is used as
the last-level caches (LLC) in chip multiple processors (CMPs), the DRAM cache sets are or-
ganized into multiple queues [22]. A replacement policy is proposed for the queue-based cache
to provide performance isolation between cores and reduce the lifetimes of dead cache lines. Ap-
10 2. BENEFITS OF 3D INTEGRATION
proaches are also proposed to dynamically adapt the queue size and the policy of advancing data
between queues.
e latency improvement due to 3D technology can also be demonstrated by such mem-
ory stacking design. For example, Li et al. [23] proposed a 3D chip multiprocessor design using
network-in-memory topology. In this design, instead of partitioning each processor core or mem-
ory bank into multiple layers (as shown in [5, 11]), each core or cache bank remains to be a 2D
design. Communication among cores or cache banks are via the network-on-chip (NoC) topol-
ogy. e core layer and the L2 cache layer are connected with TSV-based bus. Because of the
short distance between layers, TSVs provide a fast access from one layer to another layer, and
effectively reduce the cache access time because of the faster access to cache banks through TSVs.
www.ebook3000.com
2.4. COST-EFFECTIVE ARCHITECTURE 11
cache on top of a multi-core processor can improve performance by 4.91% and reduce power by
73.5% compared to the conventional SRAM L2 cache with the similar area.
Package
Figure 2.1: An illustration of 3D heterogeneous architecture with non-volatile memory stacking and
optical die stacking.
Optical Device Layer Stacking. Even though 3D memory stacking can help mitigate the mem-
ory bandwidth problem, when it comes to off-chip communication, the pin limitations, the en-
ergy cost of electrical signaling, and the non-scalability of chip-length global wires are still sig-
nificant bandwidth impediments. Recent developments in silicon nanophotonic technology have
the potential to meet the off-chip communication bandwidth requirements at acceptable power
levels. With the heterogeneous integration capability that 3D technology offers, one can inte-
grate optical die together with CMOS processor dies. For example, HP Labs proposed a Corona
architecture [27], which is a 3D many-core architecture that uses nanophotonic communication
for both inter-core communication and off-stack communication to memory or I/O devices. A
photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per
second bandwidth, with much lower power consumption.
Figure 2.1 illustrates such a 3D heterogenous processor architecture, which integrates non-
volatile memories and optical die together through 3D integration technology.
www.ebook3000.com
13
CHAPTER 3
Fine-granularity 3D Processor
Design
As 3D integration technology emerges, the 3D stacking provides great opportunities of improve-
ments in the microarchitecture. In this chapter, we introduce some recent 3D research in the
architecture level. ese techniques leverage the advantages of 3D and help to improve perfor-
mance, reduce power consumption, etc.
As 3D integration can reduce the wire length, it is straightforward to partition the struc-
ture of a planar processor and stack them to improve the performance. ere are two different
methods: (1) coarse granularity stacking, also known as “memory+logic” strategy, in which some
on-chip memories are separated from and stacked on the part containing logic components [29–
33], and (2) fine granularity stacking, in which various function units of the processor are sep-
arated and stacked, and some components are internally partitioned and implemented with 3D
integration [12, 34–37].
3D stacking enables denser form factor and more cost-efficient integration of heteroge-
neous process technologies. It is possible to stack DRAM or other emerging memories on-chip.
Consequently, the advantages of different memory technologies are leveraged. For example, more
data can be stored on-chip and the static power consumption can be reduced. At the same time,
the high bandwidth provided by 3D integration can be explored for the large capacity on-chip
memory to further improve the performance [25, 30–33, 38].
In this chapter, we first describe how to partition and stack the on-chip memory (SRAM
arrays), as they have regular structures. en, we discuss the benefits and issues of partitioning
the logic components.
www.ebook3000.com
3.1. 3D CACHE PARTITIONING 15
the sense amplifiers can either be duplicated across different device layers or shared among the
partitioned sub-arrays in different layers. e former approach is more suitable for reducing ac-
cess time while the latter is preferred for reducing number of transistors and leakage. In the latter
approach, the sharing increases complexity of multiplexing of bitlines and reduces performance
as compared to the former. Similar to 3DWL, the length of the global lines are reduced in this
scheme.
˅ˈˉ̋˕˟̆
˅ˈˉ̋˕˟̆
ˉˇ̋˪˟̆
̃̅˸ˀ˷˸˶̂˷˸˷
ˆˀ˄ ˕˿˾ʳˆˀ˄
˴˷˷̅˸̆̆ʳ˵˼̇̆
ˠ˨˫ʳʹʳ˦˔
ˆˀ˅ ˕˿˾ʳˆˀ˅
ˠ˨˫ʳʹʳ˦˔
Figure 3.1: Cache with 3D divided bitline partitioning and mapped into two active device layers [40].
Figure 3.2: Cache with 3D divided bitline partitioning and mapped into two active device layers [5].
Figure 3.2 shows an example of how these configuration parameters used in 3DCacti affect
the cache structure. e cell level partitioning approach (using MLBS) is implicitly simulated
using a different cell width and height within Cacti.
Table 3.1: Design parameters for 3DCacti and their impact on cache design
Parameter Definition Effect on Cache Design
Ndbl the number of cuts on 1. the bitline length in each sub-array
a cache to divide 2. the number of sense amplifiers
bitlines 3. the size of wordline driver
4. the decoder complexity
5. the multiplexors complexity in
data output path
Ndwl the number of cuts on 1. the wordline length in each sub-array
a cache to divide 2. the number of wordline drivers
wordlines 3. the decoder complexity
Nspd the number of sets connected 1. the wordline length in each sub-array
to a wordline 2. the size of wordline drivers
3. the multiplexors complexity in
data output path
Nx the number of 3D partitions 1. the wordline length in each sub-array
by dividing wordlines 2. the size of wordline driver
Ny the number of 3D partitions 1. the bitline length in each sub-array
by dividing bitlines 2. the complexity in multiplexors
in data output path
www.ebook3000.com
3.1. 3D CACHE PARTITIONING 17
˄˅ˋ̋˕˟̆
˟˪
˄˅ˋ̋˕˟̆
˄˅ˋ̋˪˟̆
˟˗˸
˶ ˕˿˾˃ˀ˄ ˄ˀ˄ ˕˿˾˄ˀ˄ ˅ˀ˄ ˕˿˾˅ˀ˄ ˆˀ˄ ˕˿˾ˆˀ˄
ʹ ˟˪ ˆˀ˄ ˕˿˾ʳˆˀ˄
˄˅ˋ̋˪˟̆
˴˷˷̅˸̆̆ʳ ˗̅ ˟˗
˸˶ ˕˿˾˃ˀ˅ ˄ ˕˿˾˄ˀ˅ ˅ ˕˿˾˅ˀ˅ ˆ ˕˿˾ˆˀ˅
˼́̃̈̇̆ ˦˔ʳ˃ˀ˄ ˦˔ʳ˄ˀ˄ ˦˔ʳ˅ˀ˄ ˦˔ʳˆˀ˄
ʹ ˪˟ʳ
˷˴̇˴ ˗̅ ˣ̅˸ˀ˗˸˶
̂̈̇̃̈̇̆
˦˔ʳˇˀ˄ ˦˔ ˦˔ʳˈˀ˄ ˦˔ ˦˔ʳˉˀ˄ ˦˔ ˦˔ʳˊˀ˄ ˦˔ʳˆˀ˅ ˦˔ʳˆˀ˄
˷˴̇˴ ˕˿˾ʳˆˀ˅
ˇˀ˄ ˕˿˾ˇˀ˄
̂̈̇̃̈̇̆
˦˔ ˦˔ ˦˔
ˈˀ˄ ˕˿˾ˈˀ˄ ˉˀ˄ ˕˿˾ˉˀ˄ ˊˀ˄ ˕˿˾ˊˀ˄
˦˔ʳˊˀ˅
ˆˀ˅
ˇ ˕˿˾ˇˀ˅ ˈ ˕˿˾ˈˀ˅ ˉ ˕˿˾ˉˀ˅ ˊ ˕˿˾ˊˀ˅
˦˔ʳˆˀ˅
Figure 3.3: An example showing how each configuration parameter affects a cache structure. Each
box is a sub-array associated with an independent decoder [40].
observe that delay reduces as the number of layers increase. From Fig. 3.6, we observe that the
reduction in global wiring length of the decoder is the main reason for delay reduction benefit.
We also observe that for the 2-layer case, the partitioning of a single cell using MLBS provides
delay reduction benefits similar to the best intra-subarray partitioning technique as compared to
the 2D design.
1.6
32,1,32
64KB 128KB 256KB 512KB 1MB markers:
1,4,4
Ndwl,Ndbl,Nspd
1.4 Ntwl,Ntbl,Ntspd
32,1,16
1,2,4
1.2 32,1,32 32,1,32 32,1,16
Delay (nS)
0.6
0.4
0.2
1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
2D 3D 3D 3D
2 wafers 4 wafers 8 wafers
3D partitioning (Nx*Ny)
Figure 3.4: Access time for different partitioning [40]. Data of caches of associativity=4 are shown.
Another general trend observed for all cache sizes is that partitioning more aggressively
using 3DWL results in faster access time. For example, in the 4-layer case, the configuration 4x1
(folding wordline into four layers) has an access time which is 16.3% less than that of the 1x4
18 3. FINE-GRANULARITY 3D PROCESSOR DESIGN
0.8
0.6
0.4
1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
2D 3D 3D 3D
2 layers 4 layers 8 layers
3D Partitioning (Nx*Ny)
Figure 3.5: Energy for different partitioning when setting the weightage of delay higher [40]. Data
of caches of associativity=4 are shown.
(folding bitline into four layers) configuration for a 1MB cache. We observed that the benefits
from more aggressive 3DWL stem from the longer length of the global wires in the X direction
as compared to the Y direction before 3D partitioning is performed. e preference for shorter
bitlines for delay minimization in each of the sub-arrays and the resulting wider sub-arrays in
optimal 2D configuration is the reason for the difference in wire lengths along the two directions.
For example, in Fig. 3.8(a), the best sub-array configuration for the 1MB cache in 2D design
results in the global wire length in the X direction being longer. Consequently, when wordlines
are divided along the third dimension, more significant reduction in critical global wiring lengths
can be achieved. Note that because 3DCacti is exploring partitioning across the dimensions si-
multaneously, some configurations can result in 2D configurations that have wirelengths greater
in the Y directions (see Fig. 3.8(c)) as in the 1MB cache 1x2 configuration for two layers. e
3DBL helps in reducing the global wire length delays by reducing the Y direction length. How-
ever, it is still not as effective as the corresponding 2x1 configuration as both the bitline delays in
the core and the routing delays are larger (see Fig. 3.6 and Fig. 3.7). ese trends are difficult to
analyze without the help of a tool to partition across multiple dimensions simultaneously. e en-
ergy reduction for the corresponding best delay configurations tracks the delay reduction in many
cases. For example, the energy of 1MB cache increases when moving from an 8x1 configuration
www.ebook3000.com
3.1. 3D CACHE PARTITIONING 19
output
1.6 SA
BL
1.4 WL-charge
WL_driver
1.2 decoder
predec_driver
1
Delay (nS)
0.8
0.6
0.4
0.2
0
1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
3D Partitioning (Nx*Ny)
Figure 3.6: Access time breakdown of a 1MB cache corresponding to the results shown in
Fig. 3.6 [40].
1.4
1.2
Energy (nJ)
0.8
0.6
0.4
0.2
0
1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
3D Partitioning (Nx*Ny)
Figure 3.7: Energy breakdown of a 1MB cache corresponding to the results shown in Fig. 3.7 [40].
to a 1x8 configuration. In these cases, the capacitive loading that affects delay also determines
the energy trends. In some cases, the energy reduces significantly when changing configurations
and does not track performance behavior. For example, for the 512KB cache using 8-layers, the
energy reduces when moving from 2*4 to 1x8 configuration. is stems from the difference in
the number of sense amplifiers activated in these configurations, due to the different number of
bitlines in each sub-array in the different configurations and the presence of the column decoders
20 3. FINE-GRANULARITY 3D PROCESSOR DESIGN
Figure 3.8: Critical paths in 3DWL and 3DBL for a 1MB cache [40]. Dashed lines represent the
routing of address bits from pre-decoder to local decoder while the solid arrow lines are the routing
paths from the address inputs to predecoders.
after the sense amplifiers. Specifically, the optimum (Ndwl,Ndbl,Nspd) for the 512KB case is
(32,1,16) for the 2*4 case and (32,1,8) for the 1*8 configuration. Consequently, the number of
sense amplifiers activated per access for 1x8 configuration is only half as much as that of the 2x4
configuration, resulting in a smaller energy.
Puttaswamy et al. provided a good study of 3D integrated SRAM components for high-
performance microprocessors [34]. In this paper, they explored various design options of 3D
integrated SRAM arrays with functional block partitioning. ey studied two different types of
SRAM arrays, which are banked SRAM arrays (e.g., caches) and multiported SRAM arrays (e.g.,
register files).
For the banked SRAM arrays, the paper discussed methods of bank stacking 3D and array
splitting 3D. e bank stacking is straightforward as the SRAM arrays have already been parti-
www.ebook3000.com
3.1. 3D CACHE PARTITIONING 21
wordline (port 1)
bitline (port 1)
layer 1
layer 0
wordline (port 0)
bitline (port 0)
tioned into banks in the 2D case. e decision for a horizontal or vertical split largely depends on
which dimension has more wire delay. e array splitting method may help to reduce the lengths
of either wordlines or the bitlines. When the wordline is split, die-to-die vias are required because
the row decoder needs to drive the wordlines on both of the dies, and the column select multi-
plexers have been split across the two dies. For the row stacked SRAM, a row decoder need to be
partitioned across the two dies. If the peripheral circuitry, such as sense amplifier, is shared for
the two dies, extra multiplexers may be required. In their work, latency and energy results were
evaluated for caches from 16KB to 4MB. e results show that array-split configurations provide
additional latency reductions as compared to the bank-stacked configurations. e 3D organiza-
tion provides more benefits to the larger arrays because these structures have substantially longer
global wires to route signals between the array edge and the farthest bank. For large-sized (2-4
MB) caches, the 3D organization exhibits the greatest reduction in global bank-level routing la-
tency, since it is the dominant component of latency. On the other hand, moderate-sized (64-512
KB) cache delays are not dominated by the global routing. In these instances, a 3D organization
that targets the intra-SRAM delays provides more benefit. Similar to the latency reduction, the
3D organization reduces the energy in the most wire-dominated portions of the arrays, namely
the bank-level routing and the bitlines in different configurations [34].
For the multiported SRAM, Puttaswamy et al. took the register files (RF) as an example to
show different 3D stacking strategies. First, a RF can split half of the entries and stack them on
the rest half. is is called register partitioning (RP). e bitline and row decoder are halved so
that the access latency is reduced. Second, a bit-partitioned (BP) 3D register file stacks higher or-
der and lower order bits of the same register across different dies. Such a strategy reduces the load
22 3. FINE-GRANULARITY 3D PROCESSOR DESIGN
on wordline. ird, the large footprint of a multiported SRAM cell provides the opportunity to
allocate one or two die-to-die vias for each cell. Consequently, it is feasible that each die contains
bitlines, wordlines, and access transistors for half of the ports. is strategy is called port split-
ting (PS), which provides benefits in reducing area footprint. Figure 3.9 gives an illustration of the
3D structure of the two-port cell. e area reduction translates into latency and energy savings.
e register files with sizes ranging from 16 up to 192 entries were simulated in the work. e
results show that, for 2-die stacks, the BP design provides the largest latency reduction benefit
when compared to the corresponding RP and PS designs. e wordline is heavily loaded by the
two access transistors per bitline column. Splitting the wordline across two dies reduces a major
component of the wire latency. As the number of entries increases, benefits of latency reduction
from BP designs decrease because the height of the overall structure increases, which makes the
row decoder and bitline/sense-amplifier delay increasingly critical. e benefits of using the other
two strategies, however, increase for the same reason. e 3D configuration that minimizes en-
ergy consumption is not necessarily the configuration that has the lowest latency. For a smaller
number of entries, BP organization requires the least energy. As the number of entries increases
above 64, RP organization provides the most benefits by effectively reducing bitline length [34].
When there are more than two stacked layers, the case is more complicated because of many
stacking options. More details can be found in the paper.
www.ebook3000.com
3.2. 3D PARTITIONING FOR LOGIC BLOCKS 23
Figure 3.10: (a) Planar floorplan of a deeply pipelined microprocessor with the point register read to
execute paths, (b)3D floorplan of the planar microprocessor in (a) [35].
power, the power per instruction is reduced because of the elimination of pipeline stages. e
temperature increase, however, is a serious problem because of the doubling of power density in
3D stacked logic. e “worst case” shows a 26 degree increase if there were no power savings from
the 3D floorplan and the stacking were to result in a 2x power density. A simple iterative process
of placing blocks, observing the result is a 1.3x power density increase and 14 degree temperature
increase [35].
Balaji et al. also explored design methodologies for processor components in 3D technol-
ogy, in order to maximize the performance and power efficiency of these components [12]. ey
proposed systematic partitioning strategies for custom design of several example components: a)
instruction scheduler, b) Kogge-Stone adder, and c) logarithmic shifter. In addition, they also de-
veloped a 3D ASIC design flow leveraging both widely used 2D CAD tools (Synopsys Design
Compiler, Cadence Silicon Ensemble, and Synopsys Prime Time) and emerging 3D CAD tools
(MIT PR3D and in-house netlist extraction and timing tool).
e custom designs are implemented with MIT 3D Magic—a layout tool customized for
3D designs—together with in-house netlist and timing extraction scripts. For the instruction
scheduler, they found that the tag drive latency is a major component of the overall latency and
sensitive to wire delay. us they proposed to either horizontally or vertically partition the tag
line to reduce the tag drive latency. From the experiment result, horizontal partitioning achieves
24 3. FINE-GRANULARITY 3D PROCESSOR DESIGN
significant latency reduction (44% when moving from 2D to 2-tier 3D implementations) while
vertical partitioning only achieves marginal improvement (4%). For both the KS adder and the
logarithmic shifter, the intensive wiring in the design becomes the limiting factor in performance
and power in advanced technology nodes. In the 3D implementation, the KS adder is partitioned
horizontally and the logarithmic shifter is partitioned vertically. Significant latency reductions
were observed, 20:23%–32:7% for the KS adder and 13:4% for the logarithmic shifter when the
number of tiers varies. In addition to the custom design, they also experimented the proposed
ASIC design flow with a range of arithmetic units. e implementation results show that signifi-
cant latency reductions (9:6%–42:3%) are archived for the 3D designs generated by the proposed
flow. Last but not least, it is observed that moving from one tier to two tier produces the most sig-
nificant performance improvements, while this improvement saturates when more tiers are used.
Architectural performance impact of the 3D components is evaluated by simulating a 3D pro-
cessor running SPEC2000 benchmarks. e data path width of the processor is scaled up, since
the 3D components have much lower latency than 2D components with same widths. erefore,
simulation results show an average IPC speedup of 11% for the 3D processor.
So far, the partitioning of the components in the 3D design are all specified and most par-
titions are limited within two layers. In order to further explore the 3D design space, Ma et al.
proposed a microarchitecture-physical codesign framework to handle fine-grain 3D design [42].
First, the components in the design are modeled in multiple silicon layers. e effects of different
partitioning approaches are analyzed with respect to area, timing, and power. All these approaches
are considered as the potential design strategies. Note that the number of layers that a compo-
nent can be partitioned into is not fixed. Especially, the single layer design of a component is
also included in the total design space. In addition, the author analyzed the impact of scaling the
sizes of different architectural structures. Having these design alternatives of the components,
an architectural building engine is used to choose the optimized configurations among a wide
range of implementations. Some heuristic methods are proposed to speed-up the convergence
and reduce redundant searches on infeasible solutions. At the same time, a thermal-aware pack-
ing engine with temperature simulator tool is employed to optimize the thermal characters of
the entire design so that the hotspot temperatures are below the given thermal thresholds. With
these engines, the framework takes the frequency target, architectural netlist, and a pool of alter-
native block implementations as the inputs and finds the optimized design solution in terms of
performance, temperature, or both. In the experiments, the author used a superscalar processor
as an example of design space exploration using the framework. e critical components such as
the issue queue and caches could be partitioned into two to four layers. e partition methods
included block folding and port partitioning, as introduced before. e experimental results show
a 36% performance improvement over traditional 2D and 14% over 3D with single-layer unit
implementations.
We have discussed that the 3D stacking aggravates the thermal issues, and a thermal-aware
placement can help alleviate the problem [35]. ere are some architectural level techniques,
www.ebook3000.com
3.2. 3D PARTITIONING FOR LOGIC BLOCKS 25
0-15 Bit 3 F 1 0 5 3 1 F
16-31 Bit 1 5 A 2 0 0 0 0
32-47 Bit 2 7 1 1 0 0 0 0
48-63 Bit 0 2 1 5 0 0 0 0
(a) (b)
Figure 3.11: e conceptual view of a 3D stacked register file: (a) data bits on lower three layers are
all zeros, (b) data in all layers are non-zeros [36].
which can be used to control the thermal hotspots. Puttaswamy et al. proposed thermal herding
techniques for the fine-grain partitioned 3D microarchitecture. For different components in the
processor, various techniques are introduced to reduce 3D power density and locate a majority of
the power on the die (layer) closest to the heat sink.
Assume there are four 3D stacked layers and the processor is 64-bit based. Some 64-bit
based components (e.g., register files, arithmetic units) are equally partitioned and 16-bit is placed
in each layer, as shown in Fig. 3.11. Such a partition not only reduces the access latency, but also
provides the opportunities for thermal herding. For example, in Fig. 3.11(a) the most significant
48 bits (MSBs) of the data are zero. We don’t have to load/store these zero bits into register files, or
process these bits in the arithmetic unit, such as the adder. If we only process the least significant
16 bits (LSBs), the power is reduced. In addition, if the LSBs are located in the layer next to the
heat sink, the temperature is also decreased. For the data shown in Fig. 3.11(b), however, we have
to process all bits since they are non-zeros in four layers.
For some other components, such as instruction scheduler and data caches, entries are par-
titioned and placed in different layers. e accesses to these components are controlled so that
the entries, which are located in the layer next to the heat sink, are more frequently accessed.
Consequently, the energy is herded toward the heat sink and the temperature is reduced. Extra
hardware, however, is required to achieve these techniques. For example, an extra bit is induced
in the register file to represent whether the MSBs are zero. is bit is propagated to other com-
ponents for further controls. Compared to a conventional planar processor, the 3D processor
achieves a 47:9% frequency increase which results in a 47:0% performance improvement, while
26 3. FINE-GRANULARITY 3D PROCESSOR DESIGN
simultaneously reducing total power by 20%. Without ermal Herding techniques, the worst-
case 3D temperature increases by 17 degrees. With ermal Herding techniques, the temperature
increase is only 12 degrees [36].
www.ebook3000.com
27
CHAPTER 4
Coarse-granularity 3D
Processor Design
In this section, we will focus on the “memory+logic” strategy in multi-core processors. Memory
of various technologies can be stacked on top of cores as caches or even on-chip main memories.
Different from the research in the previous section, which focuses on optimizations in the fine-
granularity (e.g., wire length reduction), the approaches of this section consider the memories as
a whole structure and explore the high-level improvements, such as access interfaces, replacement
policies, etc.
Figure 4.1: Memory stacked options: (a) 4MB baseline; (b) 8MB stacked for a total of 12MB; (c)
32MB of stacked DRAM with no SRAM; (d) 64MB of stacked DRAM [35].
Figure 4.2: Structure of the multi-queue cache management scheme for (a) a single core and (b)
multiple cores [30].
as shown in Fig. 4.2(b). e data loaded to LLC is first inserted into these queues before being
placed into cache ways following “LRU” policies. A “u-bit” is employed to represent whether
the cache line is re-used (hit) during its stay in these queues. When cache lines are evicted from
queues, they are moved to LRU-based cache ways only if they have been re-used. Otherwise,
these cache lines are evicted from queues directly. Under some cache behaviors, these queues can
effectively prevent useful data from being evicted by inserted data, which are not reused.
e first level of queue is core-based so that the cache requests of different cores are placed
into separated queues. Consequently, the cache replacements of different cores are isolated from
each other in first-level queues. Such a scheme can help to improve the utilization efficiency of
caches because a core with a high access rate can quickly evict the cache lines of another core from
www.ebook3000.com
4.2. 3D MAIN MEMORY STACKING 29
the shared cache without the isolation. is scheme raises another issue of how to allocate the
size of the first-level queue for each core. In order to solve this problem, the paper proposed an
adaptive multi-queue (AMQ) method, which leveraged the set-dueling principal [30] to dynam-
ically allocate the size of each first-level queue. ere are several pre-defined configurations of
queue sizes, and the AMQ can dynamically decide which configurations should be used accord-
ing to the real cache access pattern. A stability control mechanism was used to avoid unstable and
rapidly switching across many different configurations of first-level queues. e results showed
that the AMQ management policy can improve the performance by 27:6% over the baseline of
simply using DRAM as LLC. is method, however, incurs extra overhead of managing multiple
queues. For example, it requires a head pointer for each queue. In addition, it may not work well
when the cache associativity is not large enough (64 way was used in the paper) because the queue
size needs to be large enough to record the re-use of a cache line. Since the first level queue is
core-based, it has a scalability limitation for the same reason.
cache occurs on 4KByte boundaries rather than 64 bytes. Such a structure makes L2 cache banks
“aligned” to their own MSHR and MC so that the communication is reduced when we have
multiple ranks and MCs. Figure 4.3 shows the structure of the 4-core processor with the aligned
L2 cache banks and corresponding MSHRs and MCs. e results show that we can have another
1:75X speedup in addition to that provided by the true 3D structure.
Loh’s work also pointed out that the significant increase in memory system performance
makes the L2 miss handling architecture (MHA) a new bottleneck. e experiments showed that
simply increasing the capacity of MSHR cannot improve the performance consistently. Conse-
quently, a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity
tuning was proposed to achieve a scalable MSHR. e VBF-based MSHR is effectively a hash-
table with linear probing, which speeds up the searching of MSHR. e experimental results
show that the VBF-based dynamic MSHR can provide a robust, scalable, high-performance
L2 MHA for 3D-stacked memory architecture. Overall, a 23:0% improvement is observed on
memory-intensive workloads of SPEC2006 over the baseline L2 MSHR architecture for the
dual-MC (quad-MC) configuration [31].
www.ebook3000.com
4.2. 3D MAIN MEMORY STACKING 31
Kgil et al. studied the advantages of 3D stacked main memory, with respect to the energy
efficiency of processors [43]. In their work, the structure of the DRAM-based main memory was
not changed and was stacked directly on the processor. e bus width of the 3D main memory
was assumed to be up to 1024-bit. With the large bus width, one interesting observation was that
the 3D main memory could achieve a similar bandwidth as the L2 cache. Consequently, the pa-
per pointed out that the L2 cache was no longer necessary and the space could be saved to insert
more processing cores. With more processing cores, the frequency of the cores could be low down
without degrading computing bandwidth. e energy efficiency could be increased because the
power consumption was reduced as the frequency decreased, especially for applications with high
thread level parallelism. e paper provided comprehensive comparisons among different config-
urations of 2D and 3D processors with similar die areas, with respect to processing bandwidth
and power consumption.
Woo et al. further explored the high bandwidth of 3D stacked main memory by modifying
the structure of the L2 cache and the corresponding interface between cache and 3D stacked
main memory [32]. e paper first revisited the impact of cache line size on cache miss rate when
there was no constraint to the bandwidth of main memory. e experimental results show an
interesting conclusion that most modern applications will benefit from a smaller L1 cache line
size, but a larger cache line is found to be helpful for a much larger L2 cache. Especially, the
maximum line size, an entire page (4KB), is found to be very effective in a large cache. en, the
paper proposed a technique named “SMART-3D” to employ the benefits of large cache line size.
e cache line of L2 cache was still kept as 64B, and read/write from L1 cache to L2 cache was
operated with the traditional 64B bus structures. e bus between L2 cache and main memory,
however, was designed to be 4KB, and 64 cache lines could be filled with data from main memory
at the same time. In order to achieve a high parallel data filling, the L2 cache was divided into
64 subarrays, and one cache line from each subarray was written in parallel. Two cache eviction
policies were proposed so that either one or 64 cache lines could be evicted on demand. Besides the
modification to L2 cache, the 3D stacked DRAM was carefully re-designed since a large number
(32K) of TSVs were required in SMART-3D. e paper also noticed the potential exacerbation
of the false sharing problem caused by SMART-3D. e cache coherence policy was modified
so that either one or 64 cache lines are fetched for different cases.
e experimental results show that the performance is improved with the help of SMART-
3D. For the single core, the average speedup of SMART-3D is 2.14x over 2D case for memory-
intensive benchmarks from SPEC2006, which is much higher than that of base 3D stacked
DRAM. In addition, using 1MB L2 and 3D stacked DRAM with SMART-3D achieves 2.31x
speedup for a two-core processor, whereas a 4M L2 cache without SMART-3D only achieves
1.97x over the 2D case. Furthermore, the analysis shows that SMART-3D can even lower the
energy consumption in the L2 cache and 3D DRAM because it reduces the total number of row
buffer misses.
32 4. COARSE-GRANULARITY 3D PROCESSOR DESIGN
It is known that the periodical refresh is required for DRAMs to maintain the information
stored in them. Since the read access is self-destructive, the refresh in a DRAM row involves
reading the stored data in each cell and immediately rewriting back to the same cell. e refresh
incurs considerable power and bandwidth overhead. 3D integration is a promising technique that
benefits DRAM design particularly from capacity and performance perspectives. Nevertheless,
3D stacked DRAMs potentially further exacerbates power and bandwidth overhead incurred by
the refresh process. Ghosh et al. proposed “Smart Refresh” [33], a low-cost technique imple-
mented in the memory controller to eliminate the unnecessary periodic refresh processes and
mitigate the energy consumption overhead in DRAMs. By employing a time-out counter in the
memory controller, for each memory row of a DRAM module, the DRAM row that was recently
accessed will not be refreshed during the next periodic refresh operation. e simulation results
show that the proposed technique can reduce 53% of refresh operations on average with a 2GB
DRAM, and achieves 52:6% energy saving for refresh operations and 12:13% overall energy sav-
ing on average. An average 9:3% energy saving can be achieved for a 64MB 3D DRAM with
64ms refresh rate, and 6:8% energy saving can be achieved for the same DRAM capacity with
32ms refresh rate.
Chen et al. developed an architecture-level modeling tool, CACTI-3DD [47], which es-
timates timing, area, and power of the 3D die-stacked off-chip DRAM main memory. is tool
is designed based on the DRAM main memory model of CACTI [48]; it introduces TSV mod-
els, improves the modeling of 2D off-chip DRAM main memory, and includes 3D integration
modeling on top of the original CACTI memory models. CACTI-3DD enables the analysis of
a full spectrum of 3D DRAM design from coarse-grained rank-level 3D stacking to bank-level
3D stacking. It allows memory designers to perform in-depth studies of 3D die-stacked DRAM
main memory, in terms of architecture-level tradeoffs of timing, area, and power. eir study also
demonstrated the usage of the proposed tool by re-architecting the DRAM dies at a fine granu-
larity under the guidance of modeling results. e redesigned 3D-stacked DRAM main memory
can achieve significant timing and power improvements compared with coarse-grained baseline.
www.ebook3000.com
4.3. 3D ON-CHIP STACKED MEMORY: CACHE OR MAIN MEMORY? 33
4.3.1 ON-CHIP MAIN MEMORY
Dong et al. [49] observed that adopting the on-chip DRAM as LLC is less feasible than as
a portion of the main memory, due to the non-trivial design efforts required to accommodate
access to a large cache capacity. Commodity DRAM dies have been optimized for cost and do
not employ specialized tag arrays that automatically determine a cache hit/miss and forward the
request to the corresponding data array. Because the size of a tag array can be a hundred megabytes
or more with a multi-gigabyte cache, storing tags on the processor die requires an impractically
large tag space. e only alternative is to place the LLC tags inside the on-chip DRAM. However,
doing so will require two DRAM accesses upon each cache hit: one looking up the tag and the
other returning the data. Each cache hit will take approximately 2 of the time for a single access
to the on-package DRAM. eir experimental results showed that there is almost no benefit to
enlarge the LLC capacity in terms of the cache miss rate. While accessing the LLC and the main
memory in parallel can help hide the long LLC access latency, there is not sufficient off-chip
bandwidth to access the off-chip memory speculatively and simultaneously with every reference
to an on-chip cache. Furthermore, off-chip references consume significantly more power and
should generally be avoided when possible.
To avoid such issues, Dong et al. proposed a heterogeneous main memory architecture,
which leverages the on-chip 3D-stacked DRAM as a fast portion of the main memory. In ad-
dition, the main memory adopts four DDR3 channels connected to traditional off-chip DRAM
Dual In-line Memory Modules (DIMMs) to extend the total memory capacity. Figure 4.4 illus-
trates an overview of their heterogeneous main memory architecture. DRAM dies are placed be-
side the processor die using 2.5D integration technology. e flip-chip system-in-package (SiP)
can provide a die-to-die bandwidth of at least 2 Tbps [50]. To reduce the memory access latency,
the on-chip DRAM is slightly modified based on commodity DRAM products to implement a
many-bank structure. Doing so further increased the signal I/O speed by taking the advantage
of the high-speed on-chip interconnects. e proposed design did not employ a custom tag part
and adopted only a single on-chip DRAM design in order to reduce design cost and maximize
the effective capacity of the on-package DRAM. e on-chip memory controller is connected
to off-chip DIMMs through the conventional 64-bit DDRx bus and to the on-chip memory
through a customized memory bus. MSBs of physical memory addresses are used to decode the
target location. For example, if 1GB of 32-bit memory space is on-package, AddrŒ31::30 D 00 is
mapped to on-chip memory while AddrŒ31::30 D 01; 10; 11 is mapped to off-chip DIMMs.
eir experiments with all ten workloads in NAS Parallel Benchmark Suite 3.3 [51] showed
directly mapping 1GB on-chip DRAM resources into the main memory space can always achieve
better performance than using the on-chip DRAM resources to add a new L4, when application
memory footprints fit into the on-chip memory. However, with workloads that have a much
larger working set, the performance improvement achieved by such static mapping was trivial. For
example, the performance improvement of DC.B is only 16% and that of FT.C is only 20.7%.
Both of them are less than the ones achieved by using on-chip DRAM as LLC.
34 4. COARSE-GRANULARITY 3D PROCESSOR DESIGN
To solve this issue, they proposed to add data migration functionality into the memory con-
troller so that frequently used data can remain on-chip with a higher probability. Compared with
other works on data migration [52–56]: (1) their data migration was implemented by introducing
another layer of address translation; (2) depending on the data granularity, they proposed either
a pure-hardware implementation or an OS-assisted implementation; and (3) a novel migration
algorithm was used to hide the data migration overhead. ey used the term macro page as the
data migration granularity, and the macro page size can be much larger than the typical 4KB page
size used in most operating systems.
In particular, their data migration algorithm is based on the hottest-coldest swapping
mechanism, which first monitors the LRU (least recently used) on-chip macro page (the coldest)
and the MRU (most recently used) off-chip macro page (the hottest) during the latest period of
execution and then triggers the memory migration if the off-chip MRU page is accessed more
frequently than the on-chip LRU page after each monitoring epoch. e migration algorithm
can be implemented in either a pure-hardware scheme or OS-assisted manner depending on the
migration granularity. e pure-hardware solution is preferred when the macro page size is rela-
tively large so that the scale of macro page count is controllable within certain hardware resources.
If finer granularity of data migration is required, the number of macro pages becomes too large for
hardware to handle and OS-assisted scheme is used to track the access information of each macro
page. Basically, the functionality of the data migration is achieved by keeping an extra layer of ad-
dress translation that maps the physical address to the actual machine address. e pure-hardware
scheme keeps the translation table in hardware while the OS-assisted scheme keeps it in software.
eir evaluation results demonstrate how the heterogeneous main memory can use the
on-package memory efficiently and achieve the effectiveness of 83% on average.
To manage such a space and move frequently accessed data to fast regions, we propose two
integrated memory controller schemes: a first technique handles everything in hardware and our
second scheme takes assistance from the operating system.
www.ebook3000.com
4.3. 3D ON-CHIP STACKED MEMORY: CACHE OR MAIN MEMORY? 35
Figure 4.4: e conceptual view of the System-in-Package (SiP) solution with one microprocessor
die and nine DRAM dies connecting off-package DIMMs (one on-package DRAM die is for ECC).
their design, a single physical DRAM row holds both tags and data, as illustrated in Fig. 4.5(a).
erefore, the memory controller can access the data and the corresponding tag with a single com-
pound operation. During access to the row, the memory controller prevents subsequent memory
requests from closing the row by reserving the row buffer. As a result, any updates to the data
and the corresponding tag will also hit in the row buffer. Second, they developed a MissMap
mechanism to avoid the DRAM cache access on a cache miss, by bypassing the DRAM cache
and directing the request immediately to the off-chip main memory. eir solution is to employ
a MissMap data structure (Fig. 4.5(b)) which tracks the cache lines belonging to the same page
stored in the DRAM cache; each MissMap entry has a bit vector, each associated with a cache
line. A zero-bit or no corresponding entry in the MissMap indicates a DRAM cache miss. In
this case, the request will be directly issued to the off-chip main memory.
eir experimental results showed that the compound-access scheduling improves the per-
formance of a DRAM L4 cache by half and provides 92.9% of the performance delivered by the
ideal SRAM-tag configuration compared to having no DRAM L4 cache. With MissMap, their
design offers 97.1% of the performance of the ideal SRAM-tag configuration.
Qureshi and Loh [58] further improved the DRAM cache design with a latency-optimized
cache architecture that is different from conventional cache organizations. ey observed that
Loh and Hill’s design can substantially increase the DRAM cache access latency by serializing
the tag accesses and the accesses to the MissMap. eir design leverages DRAM bursts which
do not exist in conventional SRAM caches by streaming tag and data together in a single burst.
eir design constructs the DRAM cache as direct-mapped cache, and tightly couples the tag and
data into one unit to prevent the tag serialization penalty. Doing this effectively eliminated the
36 4. COARSE-GRANULARITY 3D PROCESSOR DESIGN
Figure 4.5: e conceptual view of the proposed stacked DRAM cache management mechanisms.
(a) A single physical DRAM row holds both tags and data; (b) MissMap structure used to predict
whether a memory request will hit or miss in the stacked DRAM cache.
delay caused by tag serialization and improved performance by sacrificing DRAM cache hit rate.
To address the performance penalty introduced by MissMap accesses, they proposed a memory
access predictor, which requires only 96-byte storage per core and provides a performance of 98%
of a perfect predictor. eir evaluation across various benchmarks in SPEC2006 suite showed that
the proposed design outperforms Loh and Hill’s work by 24% and the ideal SRAM tag design by
8.9%.
www.ebook3000.com
4.4. PICOSERVER 37
Figure 4.6: e proposed CAMEO architecture (c) compared with prior designs of 3D-stacked
DRAM as LLC (a) and a portion of main memory (b) [59].
Such swapping sustains the latency and the bandwidth benefits of the stacked memory.
However, it may change the physical location of a memory line without the notice of the OS. To
address this issue, the proposed design employs a line location table (LLT) to track the physical
location of all memory lines. CAMEO employs two methods to reduce the storage overhead
yet sustain low memory access latency. First, CAMEO stores the corresponding LLT entries in
stacked DRAM, residing in the same rows of these memory lines. Second, CAMEO adopts a
Line Location Predictor (LLP), a hardware structure less than 1KB per core, to predict whether a
memory line is exclusively stored in off-chip DRAMs and identify the potential physical addresses
of the off-chip memory lines. With the two method, CAMEO can access a memory line with
the latency of one memory reference, regardless of the location of the memory line. If a memory
line is predicted to be off-chip, CAMEO will fetch the predicted physical address in parallel with
LLT accesses.
Experiments conducted across various capacity- and latency-limited benchmarks show that
CAMEO improves system performance by 69%, while employing the stacked DRAM purely as
LLC and main memory only provide performance improvements of 51% and 44%, respectively.
Furthermore, CAMEO achieves 98% the performance of an idealized system which employs
1GB stacked DRAM both as an LLC and main memory capacity extension.
4.4 PICOSERVER
In this section, PicoServer [20], an architecture to reduce power and energy consumption using
3D stacking technology, is introduced.
e basic idea is to stack on-chip DRAM main memory instead of using stacked memory
as a larger L2 cache. e on-chip DRAM is connected to the L1 caches of each core through
38 4. COARSE-GRANULARITY 3D PROCESSOR DESIGN
shared bus architecture. It offers wide low-latency buses to the processor cores and eliminates
the need for an L2 cache, whose silicon area is allocated to accommodate more cores. Increasing
the number of cores can help improve the computation throughput, while each core can run
at a much lower frequency, and therefore result in an energy-efficient many core design. e
PicoServer is a chip multiprocessor, consists of several single issue in-order processors. Each core
runs at 500MHz, containing an instruction cache and a data cache, which uses a MESI cache
coherence protocol. e study showed that the majority of the bus traffic is caused by cache miss
traffic instead of cache coherence due to the small cache for each core. is is one motivation to
stack large on-chip DRAM, which is hundreds of megabytes, using 3D technology.
In PicoServer, a wide shared bus architecture is adopted to provide high memory bandwidth
and to fully take advantage of 3D stacking. A design space is explored by running simulations
varying the bus width on a single shared bus, which ranges from 128 bits to 2048 bits. e impact
of bus width on the PicoServer is determined by measuring the network performance. e results
showed that a relatively wide data bus is needed to achieve better performance and satisfy the
outstanding cache miss requests. e bus traffic increase caused by narrowed bus width will result
in latency increase. Wide bus widths can speedup DMA transfer since more data can be copied
in one transaction. e simulation also shows that a 1024-bit bus width is reasonable for different
configurations including 4, 8, and 12 multiprocessors due to performance saturation at this point.
e stacked on-chip DRAM contains four layers in order to obtain a total size of 256
MB, which may be enough depending on the workload. More on-chip DRAM capacity can be
obtained with aggressive die stacking. In order to fully take advantage of 3D stacking, it is nec-
essary to modify the conventional DDR2 DRAM interface for PicoServer’s 3D stacked on-chip
DRAM. In the conventional DDR2 DRAMs, a small pin count is assumed. In addition, address
multiplexing and burst mode transfer are used to compensate the limited number of pins. With
3D stacking, there is no need to address multiplexing so that the additional logic to latch and
mux narrow address/data can be removed. In servers with large network pipes such as PicoServer,
one common problem is how to handle large amount of packets that arrive at each second. In-
terrupt coalescing, a method to coalesce non-critical events to reduce the number of interrupts,
is one solution to solve this problem. However, even with this technique, the number of inter-
rupts received by a low frequency processor in PicoServer is huge. To address this issue, multiple
network interface controllers (NICs) with their interrupt lines are routed to a different processor.
One NIC is inserted for two processors to fully utilize each processor. Such NIC should have
multiple interface IP addresses or an intelligent method to load balancing packets to multiple
processors. In addition, it needs to keep track of network protocol states at the session level.
ermal evaluation showed that the maximum junction temperature increase is about 5 to
10°C in 5-layer PicoServer architecture. is comes from power and energy reduction caused by
core clock frequency reduction and improvement on high network bandwidth.
www.ebook3000.com
39
CHAPTER 5
3D GPU Architecture
Graphics processing units (GPUs) are an attractive solution for both graphics and general purpose
workloads, which demand high computational performance. 3D integration is an attractive tech-
nology in developing high-performance, power-efficient GPU systems. Recently, 3D integration
has been explored by both academia and industry as a promising solution to improve GPU per-
formance and address increasingly critical GPU power issues. In this section, we will introduce
recent research and implementation efforts in 3D GPU system design.
www.ebook3000.com
5.1. 3D-STACKED GPU MEMORY 41
Figure 5.1: Peak memory bandwidth and maximum total DRAM power consumption of 2GB graph-
ics memory with various interface configurations.
silicon interposer technologies. ey performed thermal analysis with a GPU system configura-
tion based on NVIDIA Quadro 6000 [64]. ey computed the maximum power consumption of
GPU processors and memory controllers by subtracting the DRAM power from the reported
maximum power consumption of Quadro 6000 [64], resulting in 136W. e power of 6GB
DRAM is calculated as 68W, based on Hynix’s GDDR5 memory [68]. e areas of different
GPU components are obtained from the GPU die photo, which has a 529 mm2 die area. ey
assume that the ambient temperature to be 40 ı C. ey used the HotSpot thermal simulation
tool [69] to conduct the analysis. e maximum steady-state temperature of the GPU (without
DRAMs) is 71:2 ı C. With 6GB interposer-mounted DRAMs (four layers of memory cells plus
one layer of logic) placed beside the GPU processor as shown in Fig. 5.3, the maximum temper-
ature is 76:6 ı C. us, it is feasible to employ interposer-based memory integration. Vertically
stacking memories on top of the GPU incurs much greater temperature increases than a silicon
interposer-based approach. By stacking the same DRAMs on top of the GPU processor, the
temperature rises to 83:8 ı C, a 7:2 ı C increase compared to the interposer-mounted DRAMs.
Moreover, the temperature rise can further increase system-wide power consumption due to the
temperature dependence of leakage power. erefore, the proposed GPU system design employs
2.5D (interposer-based) memory integration.
e memory interface with fixed bus width and frequency cannot satisfy various mem-
ory utilization requirements of various applications. Even a single application can have variable
42 5. 3D GPU ARCHITECTURE
memory access patterns during execution. To accommodate the varying graphics memory access
demands, the proposed design developed a reconfigurable memory interface, which can dynami-
cally adapt to the demands of various applications based on dynamically observed memory access
and performance information. To maximize the system energy efficiency, the proposed design
configures the memory interface to minimize the DRAM power and maintain the system in-
structions per cycle (IPC) rate. To improve the system throughput under a given power budget,
the proposed design co-optimizes the memory configuration and the GPU clock frequency by
shifting power saved from the memory interface over the GPU. e proposed design employs two
reconfiguration mechanisms, EOpt and PerfOpt, to optimize the system energy efficiency and
system throughput, respectively. ese reconfiguration mechanisms can effectively accommodate
both memory-intensive and compute-intensive applications.
EOpt: e proposed design adopts a reconfigurable memory interface that can dynamically de-
tect the various memory access patterns of the two types, and apply appropriate strategies to
achieve their design goals. Both IPC and memory power will be affected by the change of mem-
ory interface during memory-intensive execution periods: decreasing the memory frequency (and
consequently increasing the memory access latency) results in significant IPC degradation, even
www.ebook3000.com
5.1. 3D-STACKED GPU MEMORY 43
though we provide wider buses to keep the same peak memory bandwidth; the corresponding
IPC typically stays much lower than that of compute-intensive periods, due to the continuous
memory demands that significantly slow down the instruction execution. erefore, they choose
configurations that maintain high memory clock frequencies to minimize the IPC degradation.
Given the memory frequency constraint, the bus width is then configured to minimize the mem-
ory power consumption. During compute-intensive execution periods, IPC is stable with various
memory interface configurations. During these execution periods, the proposed design adopted
the memory frequency and bus width configuration that minimizes the memory power.
PerfOpt explores GPU system performance (instruction throughput, i.e., the executed instruc-
tions per second) optimization under a given power budget. During compute-intensive execution
periods, PerfOpt always employs the memory interface configuration that minimizes the DRAM
power. Any power saved is transferred to scale up the GPU core clock frequency/supply voltage
to improve the system performance. During memory-intensive periods, their design employs two
strategies. First, because the memory interface configuration directly affects system performance
during memory-intensive periods, they choose the memory configuration that delivers the high-
est system performance while staying within the system power budget. Second, sometimes an
application can be relatively memory-intensive while still having significant compute needs as
well. In these cases, reconfiguring the memory interface to free up more power for the GPU can
still result in a net performance benefit despite the reduced raw memory performance. Based on
the predicted benefit, PerfOpt will choose the better of these two strategies.
Implementation: Figure 5.4 illustrates the hardware design of the proposed reconfigurable mem-
ory interface. e design made several modifications to the interface between the GPU processor
and the 3D die-stacking graphics memories, including adding a central controller, control sig-
nals to the bus drivers, and controls for dynamic VF scaling. e central controller in Fig. 5.4(a)
is used to collect global information of both GPU performance and memory accesses. A vec-
tor of counters are maintained in the controller, including instruction counter, cycle counter, and
memory access counters, to collect performance and memory access information from either GPU
hardware performance counters or memory controllers. A threshold register vector is used to store
various thresholds and initial values described in reconfiguration mechanism. e calculator mod-
ule calculates the system energy efficiency based on the collected performance information and
the estimated power consumption. e results are stored in result registers for comparison. Fig-
ures 5.4(b) and (c) illustrate their data bus implementation. e basic topology of a bi-directional
point-to-point data bus is a set of transmission lines, with transmitter and receiving devices at
both ends of each bit. Control signals of I/O drivers are connected to the central controller. ese
control signals switch the drivers in the transmitters on and off to change the bus width.
e experimental results show that even with a static (no reconfiguration) in-package
graphics memory solution, the energy efficiency (performance per Watt) of the GPU system can
be improved by up to 21%. Of course, fixing the system bandwidth to be equal to the off-chip
solution does not really take advantage of the wide interface provided by the in-package memory.
44 5. 3D GPU ARCHITECTURE
Figure 5.4: Hardware implementation of reconfigurable memory interface. (a) Central controller;
(b) connection between memory controller and 3D die-stacking graphics memory (one channel); (c)
reconfigurable data bus.
By increasing the memory interface clock speed to provide bandwidths of 360 GB/s and 720 GB/s
(note even at these higher bandwidths, the power of in-package memory can still be lower than
the off-chip GDDR5 due to the lower clock frequency), performance on the memory-intensive
applications can be brought back up.
With EOpt, system power with almost all applications is reduced, and by an average of up
to 12%. e overall system performance-per-Watt rate is improved for all the benchmarks. e
improvement of the non-memory intensive applications (16%) is not as great as for the memory-
intensive applications (44%). e reason is that the system throughput is significantly improved
with memory-intensive applications, but almost stays the same with non-memory intensive appli-
cations. Overall, EOpt improves system energy efficiency of all baseline configurations, including
those peak memory bandwidth configurations. Across all low- and high-intensity applications,
the performance-per-Watt improves by 26% on average.
Evaluation with a variety of GPU system power budgets shows that PerfOpt can adjust the
memory power consumption to fit the application memory needs, and that the saved power can be
effectively redeployed to improve the GPU core performance up to 31%. For non-memory inten-
sive applications, a higher power budget directly leads to more performance improvement. Since
we always configure the memory interface to minimize the DRAM power, extra power is avail-
able to increase GPU core clock frequency. e throughput improvement of memory-intensive
applications is not as significant as the non-memory intensive applications, yet PerfOpt still yields
an average of 8% throughput improvement with these three most memory-intensive applications
under the power budget of 220W, and more improvement with higher power budgets.
www.ebook3000.com
5.2. 3D-STACKED GPU PROCESSOR 45
5.2 3D-STACKED GPU PROCESSOR
One aggressive approach to adopt 3D integration in GPU system design is to stack GPU caches
and cores using 3D technology. For example, Maashri et al. [70] proposed a 3D GPU design with
cache stacking. e work performed a comprehensive study across performance, cost, power, and
thermal of 3D-stacked GPU system. e study showed that 3D-stacked GPU system can sustain
low cache access latency, while increasing the cache capacity. It also showed that 3D-stacked GPU
system can achieve up to 45% speedup over the 2D planar architecture without significant increase
of peak temperature.
Future Products. In-package graphics memory has been considered one of the most promising
and practical solutions for energy-efficient GPU systems. Top GPU vendors, such as NVIDIA
and AMD, have recently invested significant effort in investigating in-package graphics mem-
ory in their next generation products. For example, NVIDIA recently announced that their next
generation Pascal GPU products will adopt 3D-stacked memory [3]. In the roadmap, NVIDIA
plans to employ stacks of DRAM chips into dense modules with wide interfaces, and integrate
them inside the same package as the GPU processor. Doing so will not only boost GPU system
throughput and energy efficiency by allowing GPU processors to quickly access data from mem-
ory, but also allows the vendor to build more compact GPUs with much larger graphics memory
capacity.
www.ebook3000.com
47
CHAPTER 6
3D Network-on-Chip
Network-on-chip (NoC) is a general purpose on-chip interconnection network architecture that
is proposed to replace the traditional design-specific global on-chip wiring, by using switching
fabrics or routers to connect processor cores or processing elements (PEs). Typically, the PEs
communicate with each other using a packet-switched protocol, as illustrated by Fig. 6.1.
Figure 6.1: For a network-on-chip (NoC) architecture, processing elements (PEs) are connected via
a packet-based network.
www.ebook3000.com
6.1. 3D NOC ROUTER DESIGN 49
Table 6.1: Area and power comparison of the crossbar switches implemented in 90 nm technology
Symmetric NoC Router Design. e natural and simplest extension to the baseline NoC router
to facilitate a 3D layout is simply adding two additional physical ports to each router; one for Up
and one for Down, along with the associated buffers, arbiters (VC arbiters and Switch Arbiters),
and crossbar extension. We can extend a traditional NoC fabric to the third dimension by simply
adding such routers at each layer (called a symmetric NoC, due to symmetry of routing in all
directions). We call this architecture a 3D Symmetric NoC, since both intra- and inter-layer
movement bear identical characteristics as hop-by-hop traversal. For example, moving from the
bottom layer of a 4-layer chip to the top layer requires three network hops. is architecture, while
simple to implement, has a few major inherent drawbacks.
• It wastes the beneficial attribute of a negligible inter-wafer distance in 3D chips (for exam-
ple, the thickness of a die could be as small as 10s of m). Since traveling in the vertical
dimension is multi-hop, it takes as much time as moving within each layer. Of course, the
average number of hops between a source and a destination does decrease as a result of
folding a 2D design into multiple stacked layers, but inter-layer and intra-layer hops are
indistinguishable. Furthermore, each flit must undergo buffering and arbitration at every
hop, adding to the overall delay in moving up/down the layers.
• e addition of two extra ports necessitates a larger 77 crossbar. Crossbars scale upward
very inefficiently, as illustrated in Table 6.1. is table includes the area and power budgets
of all crossbar types investigated in this section, based on synthesized implementations in
90 nm technology. Clearly, a 77 crossbar incurs significant area and power overhead over
all other architectures. erefore, the 3D Symmetric NoC implementation is a somewhat
naive extension to the baseline 2D network.
50 6. 3D NETWORK-ON-CHIP
3D NoC-Bus Hybrid Router Design [77]. ere is an inherent asymmetry in the delays in a 3D
architecture between the fast vertical interconnects and the horizontal interconnects that connect
neighboring cores due to differences in wire lengths (a few tens of m in the vertical direction
as compared to a few thousands m in the horizontal direction). Consequently, a symmetric
NoC architecture with multi-hop communication in the vertical (inter-layer) dimension is not
desirable.
Given the very small inter-layer distance, single-hop communication is, in fact, feasible.
is technique revolves around the fact that vertical distance is negligible compared to intra-
layer distances; the bus can provide single-hop traversal between any two layers. is realization
opens the door to a very popular shared-medium interconnect, the bus. e NoC router can be
hybridized with a bus link in the vertical dimension to create a 3D NoC-Bus Hybrid structure,
as shown in Fig. 6.3. is hybrid system provides both performance and area benefits. Instead
of an unwieldy 77 crossbar, it requires a 66 crossbar (Fig. 6.3), since the bus adds a single
additional port to the generic 2D 55 crossbar. e additional link forms the interface between
the NoC domain and the bus (vertical) domain. e bus link has its own dedicated queue, which
is controlled by a central arbiter. Flits from different layers wishing to move up/down should
arbitrate for access to the shared medium.
4 Communication Pillars
assumed here
Cache Bank
or CPU Node Processing
Pillar Node Element
(Cache Bank
Layers of
3D Chip or CPU)
NIC
b bits
2
NoC
n layers
Communication
Pillar (b-bit ~middle layer dTDMA Bus
dTDMA Bus
spanning all
In
pu ut B
O
ut
layers)
tB u
p
uf ffer
fe
r
Figure 6.3: e 3D bus-hybrid NoC proposed by Li et al. [29]. (a) e vertical buses interconnecting
all nodes in a cylinder; (b) the dTDMA bus and the arbiter; (c) the 3D router with the capability for
vertical communication.
Despite the marked benefits over the 3D Symmetric NoC router, the bus approach also
suffers from a major drawback: it does not allow concurrent communication in the third dimen-
sion. Since the bus is a shared medium, it can only be used by a single flit at any given time.
is severely increases contention and blocking probability under high network load. erefore,
while single-hop vertical communication does improve performance in terms of overall latency,
inter-layer bandwidth suffers. More details on the 3D NoC-Bus hybrid architecture can be found
in [77].
www.ebook3000.com
6.1. 3D NOC ROUTER DESIGN 51
True 3D Router Design. Moving beyond the previous options, we can envision a true 3D
crossbar implementation, which enables seamless integration of the vertical links in the overall
router operation. e traditional definition of a crossbar—in the context of a 2D physical layout—
is a switch in which each input is connected to each output through a single connection point.
However, extending this definition to a physical 3D structure would imply a switch of enormous
complexity and size (given the increased numbers of input- and output-port pairs associated with
the various layers). erefore, we chose a simpler structure which can accommodate the inter-
connection of an input to an output port through more than one connection point. While such a
configuration can be viewed as a multi-stage switching network, we still call this structure a cross-
bar for the sake of simplicity. e vertical links are now embedded in the crossbar and extend to
all layers. is implies the use of a 55 crossbar, since no additional physical channels need to be
dedicated for inter-layer communication.
As shown in Table 6.1, a 55 crossbar is significantly smaller and less power-hungry than
the 66 crossbar of the 3D NoC-Bus Hybrid and the 77 crossbar of the 3D Symmetric NoC.
Interconnection between the various links in a 3D crossbar would have to be provided by dedicated
connection boxes at each layer. ese connecting points can facilitate linkage between vertical and
horizontal channels, allowing flexible flit traversal within the 3D crossbar.
e 2D crossbars of all layers are physically fused into one single three-dimensional cross-
bar. Multiple internal paths are present, and a traveling flit goes through a number of switching
points and links between the input and output ports. Moreover, flits re-entering another layer
do not go through an intermediate buffer; instead, they directly connect to the output port of
the destination layer. For example, a flit can move from the western input port of layer 2 to the
northern output port of layer 4 in a single hop.
However, despite this encouraging result, there is an opposite side to the coin which paints
a rather bleak picture. Adding a large number of vertical links in a 3D crossbar to increase NoC
connectivity results in increased path diversity. is translates into multiple possible paths be-
tween source and destination pairs. While this increased diversity may initially look like a positive
attribute, it actually leads to a dramatic increase in the complexity of the central arbiter, which co-
ordinates inter-layer communication in the 3D crossbar. e arbiter now needs to decide between
a multitude of possible interconnections, and requires an excessive number of control signals to
enable all these interconnections. A full crossbar with its overwhelming control and coordination
complexity poses a stark contrast to this frugal and highly efficient design methodology. More-
over, the redundancy offered by the full connectivity is rarely utilized by real-world workloads,
and is, in fact, design overkill [78].
Even if the arbiter functionality can be distributed to multiple smaller arbiters, then the
coordination between these arbiters becomes complex and time-consuming. Alternatively, if dy-
namism is sacrificed in favor of static path assignments, the exploration space is still daunting in
deciding how to efficiently assign those paths to each source-destination pair. Furthermore, a full
3D crossbar implies 25 (i.e., 55) Connection Boxes per layer. A four-layer design would, there-
52 6. 3D NETWORK-ON-CHIP
fore, require 100 CBs! Given that each CB consists of 6 transistors, the whole crossbar structure
would need 600 control signals for the pass transistors alone! Such control and wiring complex-
ity would most certainly dominate the whole operation of the NoC router. Pre-programming
static control sequences for all possible input-output combinations would result in an oversize ta-
ble/index; searching through such table would incur significant delays, as well as area and power
overhead. e vast number of possible connections hinders the otherwise streamlined function-
ality of the switch. Note that the prevailing tendency in NoC router design is to minimize op-
erational complexity in order to facilitate very short pipeline lengths and very high frequency. A
full crossbar with its overwhelming control and coordination complexity poses a stark contrast
to this frugal and highly efficient design methodology. Moreover, the redundancy offered by the
full connectivity is rarely utilized by real-world workloads, and is, in fact, design overkill [78].
3D Dimensionally Decomposed NoC Router Design [78]. Given the tight latency and area
constraints in NoC routers, vertical (inter-layer) arbitration should be kept as simple as possible.
Consequently, a true 3D router design, as described in the previous subsection, is not a realistic
option. e design complexity can be reduced by using a limited amount of inter-layer links. is
subsection describes a modular 3D decomposable router (called Row-Column-Vertical (RoCoVe)
Router) [78].
In a typical two-dimensional NoC router, the 55 crossbar has five inputs/outputs corre-
spond to the four cardinal directions and the connection from the local PE. e crossbar is the
major contributor to the latency and area of a router. It has been shown [83] that through the
use of a preliminary switching process known as Guided Flit Queuing, incoming traffic can be
decomposed into two independent streams: (a) East-West traffic (i.e., packet movement in the
X dimension), and (b) North-South traffic (i.e., packet movement in the Y dimension). Such
segregation of traffic flow allows the use of smaller crossbars and the isolation of the two flows in
two independent router sub-modules, which are called Row Module and Column Module [83].
With the same idea of traffic decomposition, the traffic flow in 3D NoC can be decomposed
into three independent streams: with a third traffic flow in the Z dimension (i.e., inter-layer
communication). An additional module is required to handle all traffic in the third dimension, and
this module is called Vertical Module. In addition, there must be links between Vertical Module
and Row/Column Modules, to allow the movement of packets from the Vertical Module to the
Row Module and Column Module. Consequently, such a dimensionally decomposed approach
allows for a much smaller crossbar design (42), resulting in a much faster and power-efficient
3D NoC router design. e architectural view of such 3D dimensionally decomposed NoC router
design is shown in Fig. 6.4(b). More details can be found in [78].
Multi-layer 3D NoC Router Design [72]. All the 3D router design options discussed earlier
(symmetric 3D router, 3D NoC-Bus hybrid router, true 3D router, and 3D dimensionally de-
composed router) are based on the assumption that the processing element (PE) (which could
be a processor core or a cache bank) itself is still a 2D design. For a fine-granularity design of
www.ebook3000.com
6.1. 3D NOC ROUTER DESIGN 53
Figure 6.4: Two 3D NoC router design [84]. (a) A true 3D crossbar; (b) the dimensionally decom-
posed (DimDe) architecture.
3D design, one can split a PE across multiple layers. For example, 3D cache design [40] and
3D functional units [5] have been proposed before. Consequently, a PE in the NoC architecture
is possible to be implemented with such fine-granularity approach. Although such a multi-layer
stacking of a PE is considered aggressive in the current technology, it could be possible with
monolithic 3D integration or with very small TSVs.
With such a multi-layer stacking of processing elements in the NoC architecture, it is
necessary to design a multi-layer 3D router that is designed to span across multiple layers of a 3D
chip. Logically, such NoC architecture with multi-layer PEs and multi-layer routers is identical
to the traditional 2D NoC case with the same number of nodes albeit the smaller area of each
PE and router and the shorter distance between routers. Consequently, the design of a multi-
layer router requires no additional functionality as compared to a 2D router and only requires
distribution of the functionality across multiple layers.
54 6. 3D NETWORK-ON-CHIP
In such multi-layer router design, router components are identified as separable and non-
separable modules. e separable modules are the input buffer, the crossbar and the inter-router
links; the non-separable modules are the arbitration and routing logics. To decompose the sep-
arable modules, the input buffer (Fig. 6.5), the crossbar (Fig. 6.6) as well as inter-router links
(Fig. 6.7), are designed as bit-slice modules, such that the data width of each component is re-
duced from W to W=n where n is the number of layers. Bit-slicing the input buffer reduces the
length of the word-line and saves power. In addition, bit-slicing the input buffer allows for se-
lectively switching off any unused slices of the buffer to save power during run time. e same
techniques can also be applied for the crossbar and the inter-router links. As a result of 3D parti-
tioning, the area of the router reduces and the available link bandwidth for each router increases.
is excess bandwidth is leveraged to construct express physical links in this work and is shown
to significantly improve performance. In addition, with reduced crossbar size and link length,
the latencies of both crossbar and link stages decrease and they can be combined into one stage
without violating timing constraints. e proposed architecture with express links perform best
among the architectures examined (2D NoC, 3D hop-by-hop NoC, the proposed architecture
without express links and with express links). e latency improvements are up to 51% and 38%,
and the power improvements are up to 42% and 67% for synthetic traffics and real workloads,
respectively [72].
Figure 6.5: Decomposition of the input buffer [85]. (a) e 2D/3D baseline input buffer; (b) the 3D
decomposed input buffer, and (c) the decomposed buffer with unused portion powered off.
www.ebook3000.com
6.2. 3D NOC TOPOLOGY DESIGN 55
Figure 6.6: Decomposition of the crossbar [85]: the 2D baseline crossbar (2DB), the 3D decomposed
crossbar (3DM), and the 3D decomposed crossbar with support for express channels (3DM-E), which
sightly increases the crossbar size.
router, with four to local PEs and the others to four cardinal directions). With such topology,
the 3D NoC-bus hybrid approach would result in a 9-port router design. Such high-radix router
designs are power-hungry with degraded performance, even though the hop count between PEs
is reduced. Consequently, a topology-router co-design method for 3D NoC is desirable, so that
the hop count between any two PEs and the radix of the 3D router design is as small as possi-
ble. Xu et al. [73] proposed a 3D NoC topology with low hop count (which is defined as low
diameter) and low radix router design. e level 2D mesh is replaced with a network of long links
connecting nodes that are at least m mesh-hops away, where m is a design parameter. In such a
topology, long distance communications can leverage the long physical wire and vertical links to
56 6. 3D NETWORK-ON-CHIP
reach destination, achieving low total hop count, while the radix of the router is kept low. For
application-specific NoC architecture, Yan et al. [79] also proposed a 3D-NoC synthesis algo-
rithm that is based on a rip-up and reroute formulation for routing flows and a router merging
procedure for network optimization to reduce the hop count.
www.ebook3000.com
6.4. IMPACT OF 3D TECHNOLOGY ON NOC DESIGNS 57
6.4 IMPACT OF 3D TECHNOLOGY ON NOC DESIGNS
Since TSV vias contend with active device area, they impose constraints on the number of such
vias per unit area. Consequently, the NoC design should be performed holistically in conjunction
with other system components such as the power supply and clock network that will contend for
the same interconnect resources.
e 3D integration using TSV (through-silicon-via) can be classified into one of the two
following categories; (1) monolithic approach and the (2) stacking approach. e first approach in-
volves a sequential device process, where the frontend processing (to build the device layer) is re-
peated on a single wafer to build multiple active device layers before the backend processing builds
interconnects among devices. e second approach (which could be wafer-to-wafer, die-to-wafer,
or die-to-die stacking) processes each active device layer separately using conventional fabrication
techniques. ese multiple device layers are then assembled to build up 3D ICs using bonding
technology. Dies can be bonded face-to-face (F2F) or face-to-back (F2B). e microbump in
face-to-face wafer bonding does not go through a thick buried Si layer and can be fabricated with
a higher pitch density. In stacking bonding, the dimension of the TSVs. is not expected to scale
at the same rate as feature size because alignment tolerance and thinned die/wafer height during
bonding poses a limitation on the scaling of the vias.
e TSV (or micropad) size, length, and the pitch density, as well as the bonding method
(face-to-face or face-to-back bonding, SOI-based 3D or bulk CMOS-based 3D) can have a sig-
nificant impact on the 3D NoC topology design. For example, a relatively large size of TSVs. can
hinder partitioning a design at very fine granularity across multiple device layers, and make the
true 3D router design less possible. On the other hand, the monolithic 3D integration provides
more flexibility in the vertical 3D connection because the vertical 3D via can potentially scale
down with feature size due to the use of local wires for connection. Availability of such technolo-
gies makes it possible to partition the design at a very fine granularity. Furthermore, face-to-face
bonding or SOI-based 3D integration may have a smaller via pitch size and higher via density
than face-to-back bonding or bulk CMOS based integration. Such influence of the 3D tech-
nology parameters on the NoC topology design should be thoroughly studied and suitable NoC
topologies for different 3D technologies should be identified with respect to the performance,
power, thermal, and reliability optimizations.
www.ebook3000.com
59
CHAPTER 7
3D ermal Analysis based on Finite Difference Method or Finite Element Method. Sapat-
nekar et al. proposed a detailed 3D thermal model [89]. e heat equation (7.1), which is a
parabolic partial differential equation (PDE), defines on-chip thermal behavior at the macroscale:
@T .r; t /
cp D k t r 2 T .r; t / C g.r; t /; (7.1)
@t
where p represents the density of the material (in kg/m3), cp is the heat capacity of the chip
material (in J/(kg K)), T is the temperature (in K), r is the spatial coordinate of the point where
the temperature is determined, t is time (in sec), k t is the thermal conductivity of the material
(in W/(m K)), and g is the power density per unit volume (in W/m3). e solution of Eq. (7.1)
is the transient thermal response. Since all derivatives with respect to time go to zeroes in the
steady state, steady-state analysis is needed to solve the PDE, which is the well-known Poisson’s
equation.
A set of boundary conditions must be added in order to get a well-defined solution to
Eq. (7.1). It typically involves building a package macro model and assuming a constant ambient
temperature is interacted with the model. ere are two methods to discretize the chip and to
form a system of linear equations representing the temperature distribution and power density
distribution: finite difference method (FDM) and finite element method (FEM). e difference
between them is that FDM discretizes the differential operator while FEM discretizes the tem-
perature field. Both of them can handle complicated material structures such as non-uniform
interconnects distributions in a chip.
In FDM, a heat transfer theory is adopted, which builds an equivalent thermal circuit
through the thermal-electrical analogy. e steady-state equation represents the network with
thermal resistors connected between nodes and with thermal current sources mapping to power
sources. e voltage and the temperature at the nodes can be computed by solving the circuit. e
ground node is considered as a constant temperature node, typically the ambient temperature.
Since FDM methods are similar to power grid analysis problems, similar solution tech-
niques can be used, such as the multigrid-based approaches. Li et al. [90] proposed multigrid
(MG) techniques for fast chip level thermal steady-state and transient simulation. is approach
avoids an explicit construction of the matrix problem that is intractable for most full-chip prob-
www.ebook3000.com
7.1. THERMAL ANALYSIS 61
lems. Specific MG treatments are proposed to cope with the strong anisotropy of the full-chip
thermal problem that is created by the vast difference inmaterial thermal properties and chip
geometries. Importantly, this work demonstrates that only with careful thermal modeling as-
sumptions and appropriate choices for grid hierarchy, MG operators, and smoothing steps across
grid points, can a full-chip thermal problem be accurately and efficiently analyzed.
e FEM provides another avenue to solve Poisson’s equation. In finite element analysis,
the design space is first discretized or meshed into elements. Different element shapes can be
used such as tetrahedra and hexahedra. For the on-chip problem, where all heat sources are mod-
eled as being rectangular, a reasonable discretization for the FEM divides the chip into 8-node
rectangular hexahedral elements [91]. e temperatures at the nodes of the elements constitute
the unknowns that are computed during finite element analysis, and the temperature within an
element is calculated using an interpolation function that approximates the solution to the heat
equation within the elements.
Compact 3D IC ermal Modeling. 3D IC thermal analysis with FDM or FEM can be very
time consuming and therefore is not suitable to be used for design space exploration where the
thermal analysis has to be performed iteratively. erefore a compact thermal model for 3D
IC is desirable. For traditional 2D design, one widely used compact thermal model is called
HotSpot [92], which is based on an equivalent circuit of thermal resistances and capacitances
that correspond to microarchitecture components and thermal packages. In a well-known du-
ality between heat transfer and electrical phenomena, heat flow can be described as a “current”
while temperature difference is analogous to a “voltage.” e temperature difference is caused
by heat flow passing through a thermal resistance. It is also necessary to include thermal capaci-
tances for modeling transient behavior to capture the delay before the temperature reaching steady
state due to a change in power consumption. Like electrical RC constants, the thermal RC time
constants characterize the rise and fall times led by the thermal resistances and capacitances. In
the rationale, the current and heat flow are described by exactly the same different equations for
a potential difference. ese equivalent circuits are called compact models or dynamic compact
models if thermal capacitors are included. For a microarchitecture unit, the dominant mechanism
to determine the temperature is the heat conduction to the thermal package and to neighboring
units.
In HotSpot, the temperature is tracked at the granularity of individual microarchitectural
units and the equivalent RC circuits have at least one node for each unit. e thermal model com-
ponent values do not depend on initial temperature or the particular configurations. HotSpot is a
simple library that generates the equivalent RC circuit automatically and computes temperature
at the center of each block with power dissipation over any chosen time step.
Based on the original HotSpot model, a 3D thermal estimation tool named HS3D was
introduced [93]. HS3D allows 3D thermal evaluation despite a largely unchanged computational
model and methods in HotSpot. e inter-layer thermal vias can be approximated by changing
the vertical thermal resistance of the materials. HS3D library allows incompletely specified floor-
62 7. THERMAL ANALYSIS AND THERMAL-AWARE DESIGN
plans as input and ensures accurate thermal modeling of large floorplan blocks. Many routines
have been recreated for optimizations of loop accesses, cache locality, and memory paging. ese
improvements offer reduced memory usage and runtime reduction by over three orders when sim-
ulating a large number of floorplans. To guarantee the correctness and efficiency of HS3D library,
the comparison between this new library and a commercial FEM software was performed. First
a 2D sample device and package is used for the verification. e difference of the average chip
temperatures from HS3D and FEM software is only 0.02 ı C. Multi-layer (3D) device modeling
verification is provided using 10 m thick silicon layers and 2 m thick interlayer material. e
test case includes two layers with a sample processor in each layer. e experiment results show
that the average temperature mis-estimation is 3 ı C. In addition, the thermal analysis using FEM
software costs seven minutes while only costs one second HS3D. It indicates that HS3D provides
not only high accuracy but also high performance (low run time). e extension in HS3D was
integrated into HotSpot in the later versions to support 3D thermal modeling.
Rlateral
(a). Tiles Stack Array (b). Single Tile Stack (c). Tile Stack Analysis
Cong and Zhang derived a closed-form compact thermal model for thermal via planning
in 3D ICs [94]. In their thermal resistive model, a tile structure is imposed on the circuit stack
with each tile the size of a via pitch, as shown in Fig. 7.1(a). Each tile stack contains an array of
tiles, with one tile for each device layer, as shown in Fig. 7.1(b). A tile either contains one via at
the center, or no via at all. A tile stack is modeled as a resistive network, as shown in Fig. 7.1(c).
A voltage source is utilized for the isothermal base, and current sources are present in each silicon
layer to represent heat sources. e tile stacks are connected by lateral resistances. e values of
the resistances in the network are determined by a commercial FEM-based thermal simulation
tool.
www.ebook3000.com
7.2. THERMAL-AWARE FLOORPLANNING FOR 3D PROCESSORS 63
To further improve analysis accuracy without losing efficiency, Yang et al. proposed an in-
cremental and adaptive chip-package thermal analysis flow named ISAC [95]. During thermal
analysis, both time complexity and memory usage are linearly or superlinearly related to the num-
ber of thermal elements. ISAC incorporates an efficient technique for adapting thermal element
spatial resolution during thermal analysis. is technique uses incremental refinement to generate
a tree of heterogeneous rectangular parallelepipeds that supports fast thermal analysis without loss
of accuracy. Within ISAC, this technique is incorporated with an efficient multigrid numerical
analysis method, yielding a comprehensive steady-state thermal analysis solution.
www.ebook3000.com
7.2. THERMAL-AWARE FLOORPLANNING FOR 3D PROCESSORS 65
(3) Move, which moves a module.
e first three perturbations are the original moves defined in [101]. Since these moves
only have influence on the floorplan in single layer, more interlayer moves, (5) and (6) are needed
to explore the 3D floorplan solution space.
n1
b7
b6 n2 n5
b5
b3 n3 n6 n7
b4
b1 n4
b2
(a) (b)
Figure 7.2: (a) An example floorplan; (b) the corresponding B -tree [101].
Temperature Approximation
Although HS3D [93] can be used to provide temperature feedbacks, when evaluating a large
number of solutions during simulated procedure, it is not wise to involve the time-consuming
temperature calculation every time. Other than using the actual temperature values, we have
adopted the power density metric as a thermal-conscious mechanism in our floorplanner. e
temperature is heavily dependent on power density based on a general temperature-power equa-
tion: T D P R D P .t =k A/ D .P =A/ .t =k/ D d .t=k/, where t is the thickness of the
www.ebook3000.com
7.2. THERMAL-AWARE FLOORPLANNING FOR 3D PROCESSORS 67
Table 7.1: Floorplanning results of 2D architecture
Circuit 2D 2D(thermal)
wire area peakT wire area peakT
(um) (mm2 ) (um) (mm2 )
Alpha 339672 29.43 114.50 381302 29.68 106.64
xerox 542926 19.69 123.75 543855 19.84 110.45
hp 133202 8.95 119.34 192512 8.98 116.91
ami33 44441 1.21 128.21 51735 1.22 116.97
ami49 846817 37.43 119.42 974286 37.66 108.86
chip, k is the thermal conductivity of the material, R is the thermal resistance, and d is the power
density. us, we can substitute the temperature and adopt the power density, according to the
equation above, to approximate the 3-tie temperature function, CT D .T To /=To , proposed in
[97] to reflect the thermal effect on a chip. As such, the 3-tie power density function is defined
as P D .Pmax Pavg //=Pavg , where Pmax is the maximum power density while Pavg is the average
power density of all modules. e cost function for 2D architecture used in simulated annealing
can be written as
cost D ˛ area C ˇ wl C P: (7.3)
For 3D architectures, we also adopt the same temperature approximation for each layer
as horizontal thermal consideration. However, since there are multiple layers in 3D architec-
ture, the horizontal consideration alone is not enough to capture the coupling effect of heat.
e vertical relation among modules also needs to be involved and is defined as: OP .TPm/ D
P
.P m C P mi / overlap_area, where OP .TPm/ stands for the summation of the power den-
sity of module, P m, and all overlapping module mi with module m and their relative power
densities multiplying their corresponding overlapped area.
e rationale behind this is that for a module with relatively high power density in one
layer, we want to minimize its accumulated power density from overlapping modules located in
different layers. We can define the set of modules to be inspected, so the total overlap power
P
density is TOP D OP .TPi/, for all modules in this set. e cost function for 3D architecture
is thus modified as follows:
cost D ˛ area C ˇ wl C dev.F / C P C ı TOP: (7.4)
At the end of algorithm execution, the actual temperature profile is reported by our HS3D tool.
Experiment Results
We implemented the proposed floorplanning algorithm in C++. e thermal model is
based on the HS3D. In order to effectively explore the architecture-level interconnect
68 7. THERMAL ANALYSIS AND THERMAL-AWARE DESIGN
Table 7.2: Floorplanning results of 3D architecture
Circuit 3D 3D(thermal)
wire area peakT wire area peakT
(um) (mm2 ) (um) (mm2 )
Alpha 210749 15.49 135.11 240820 15.94 125.47
xerox 297440 9.76 137.51 294203 9.87 127.31
hp 124819 4.45 137.90 110489 4.50 134.39
ami33 27911 0.613 165.61 27410 0.645 155.57
ami49 547491 18.55 137.71 56209 18.71 132.69
power consumption of a modern microprocessor, we need a detailed model which can act
for the current-generation high-performance microprocessor designs. We have used IVM
(http://www.crhc.uiuc.edu/ACS/tools/ivm), a Verilog implementation of an Alpha-like archi-
tecture (denoted as Alpha in the rest of this section) at register-transfer-level, to evaluate the
impacts of both interconnect and module power consumptions at the granularity of functional
module level. A diagram of the processor was shown in Fig. 7.4. Each functional block in Fig. 7.4
represents a module used in our floorplanner. e registers between pipeline stages are also mod-
eled but not shown in the figure.
e implementation of microprocessor has been mapped to a commercial 160 nm standard
cell library by Design Compiler and placed and routed by First Encounter under 1GHz perfor-
mance requirement. ere are in total 34 functional modules and 168 netlists extracted from the
processor design. e area and power consumptions from the actual layout served as inputs to our
algorithm. Other than Alpha processor, we have also used MCNC benchmarks to verify our ap-
proach. A similar approach in [99] is used to assign the average power density for each module in
the range of 2.2*104 (W/m2 ) and 2.4*106 (W/m2 ). e total net power is assumed to be 30% of
total power of modules due to lack of information for the MCNC benchmarks, and the total wire
length used to be scaled during floorplanning is the average number from 100 test runs with the
consideration of area factor alone. e widely used method of half-perimeter bounding box model
is adopted to estimate the wire length. roughout the experiments, two-layer 3D architecture
was assumed due to a limited number of functional modules and excessively high power density
beyond two layers; however, our approach is capable of dealing with multiple-layer architecture.
Tables 7.1 and 7.2 show the experiment results of our approach when considering tradi-
tional metrics (area and wire) and thermal effect. When taking thermal effect into account, our
thermal-aware floorplanner can reduce the peak temperature by 7% on average while increasing
wirelength by 18% and providing a comparable chip area as compared to the floorplan generated
using traditional metrics.
www.ebook3000.com
7.3. THERMAL-HERDING: THERMAL-AWARE ARCHITECTURE DESIGN 69
When we move to 3D architectures, the peak temperatures increased by 18% (on average)
as compared to the 2D floorplan due to the increased power density. However, the wire length
and chip area reduced by 32% and 50%, respectively. e adverse effect of the increased power
density in the 3D design can be mitigated by our thermal-aware 3D floorplanner, which lowers
the peak temperature by an average of 8 degrees with little area increase as compared to the 3D
floorplanner that does not account for thermal behavior. As expected, the chip temperature is
higher when we step from 2D to 3D architecture without thermal consideration. Although the
wire length is reduced when moving to 3D and thus accordingly reduces interconnect power
consumption, the temperature for 3D architecture is still relatively high due to the accumulated
power densities and smaller chip footprints. After applying our thermal-aware floorplanner, the
peak temperature is lowered to a moderate level through the separation of high power density
modules in different layers.
Register Files: Each 64-bit entry in the register file is partitioned into four layers with the least
significant bits closest to the heat sink. e width prediction is used to determine gating control
signals for other three layers before the actual register file access. If low-width instruction is pre-
dicted then only the top layer portion of the register file is active. In this case, the power density
is similar to that of a 2D register file design. A width memorization bit is also provided for each
entry in the top layer, indicating whether the remaining three layers contain non-zero values. e
processor compares this bit to the predicted width when reading it. If the width prediction is low
and the actual width is full, then the processor stalls the previous stages of the register file and
activates the logic on the other three layers. It also corrects the prediction to prevent further stalls.
Arithmetic Units: Only 3D integer adder is presented but the concept can be extended to other
arithmetic units. A tree-based adder is partitioned into four layers with the least significant bits
residing on the top layer. e processor uses predication information to decide if the clock gate
should be activated for other three layers. Two possible unsafe width mispredictions should be
handled properly. One is misprediction on an instruction’s input operands. In this scenario, since
the arithmetic unit is not fully active at the start of execution, one cycle stall is needed to activate
the upper 48 bits. Another scenario is misprediction on the output, in which the width mispre-
diction is not known at the beginning of the computation. e instruction needs to be executed
again to guarantee the correctness, causing performance penalty. erefore, the accuracy of the
width predictor is very important.
www.ebook3000.com
7.3. THERMAL-HERDING: THERMAL-AWARE ARCHITECTURE DESIGN 71
Bypass Network: e bypass network does not need additional circuitry to handle the width
prediction since the unsafe misprediction is handled by the arithmetic units. With a correct low-
width prediction, only the drivers/wires on the top layer consume dynamic power. In addition, the
partitioning in 3D reduces the wire length so that the latency and power are reduced accordingly.
Instruction Scheduler: In the instruction scheduler, there is one reservation station (RS) for each
instruction dispatched but not executed. When an instruction is ready to issue, the instruction
broadcasts its destination identifier to notify dependent instructions. e instruction scheduler
is partitioned based on the RS entries so that the length of the broadcast buses is significantly
reduced, resulting in latency and power reduction. A modified allocation algorithm is also adopted
to move instructions to the top layer so that the active entries are closer to the heat sink. If there
are no available entries in the top layer, the allocator will check the layer that is next closest to the
heat sink.
Load and Store Queues: e load and store queues are used to track the data and addresses of
instructions. e queues are partitioned in a similar fashion with the main datapath due to its
similarity with register file. Load and store addresses are normally full-width values but the upper
bits of the addresses do not change frequently. To take advantage of this feature, a partial address
memorization (PAM) scheme is proposed. e low-order 16 bits of a load or store’s address is
broadcast on the top layer. In addition, whether the remaining 48 bits are the same as the most
recent store address is also indicated. To summarize, the PAM approach tries to herd the address
broadcasts and comparisons to the top layer.
Data Cache: e L1 data cache is organized in a word-partitioned manner due to its similarity
with register file. Memorization bits are also provided to detect unsafe width mispredictions. In
addition, the definition of a “low-width” value for load and store instructions is broadened to
increase the frequency of low-width values. Two bits instead of one single width memorization
bit are stored to encode the upper 48 bits. Value “00” means the upper 48 bits are all zeros. Value
“01” indicates all ones. Value “10” means the upper bits are identical to those of the referencing
address. Value “11” means the upper bits are not encodeable.
Front End: Data-centric approaches for partitioning the front end are not effective since no
data values are handled. A register alias table (RAT) is implemented to place the ports of each
instruction on different layers so that unnecessary die-to-tie vias are avoided. e instruction that
requires the most register name comparisons is placed on the top layer so that most switching
activities are moved to the top layer. For branch predictor based on two-bit saturating counters,
the counters are partitioned into two separate arrays: direction bit array and hysteresis array. e
more frequently used direction-bit array is place on the layer closer to the heat sink. For the branch
target buffers (BTBs), since most branch targets are located relatively close to the originating
branch, they are organized like the data cache. e low-order 16 bits are placed on the top layer
72 7. THERMAL ANALYSIS AND THERMAL-AWARE DESIGN
with one extra target memorization bit indicating whether the bits on the other three layers should
be accessed.
3D thermal herding techniques can improve IPC by reducing the pipeline depth and L2
latency but it also can reduce IPC due to width mispredictions. By conducting the experiments
on several benchmarks, the overall performance is improved since the pipeline reduction benefits
outweigh the performance penalties caused by width mispredictions. e power consumption
is also reduced due to two reasons: the length of wires is reduced because of 3D stacking and
thermal herding reduces the switching activities because of clock gating. e thermal experiments
show that thermal herding techniques successfully control power density and mitigate 3D thermal
issues.
Conclusion. Increasing power density in 3D ICs can result in higher on-chip temperatures,
which can have a negative impact on performance, power, reliability, and the cost of the chip.
Consequently, thermal modeling and thermal-aware design techniques are very critical for future
3D architecture designs. is chapter presents an overview of thermal modeling for 3D IC and
outlines architecture design schemes to overcome the thermal challenges.
www.ebook3000.com
73
CHAPTER 8
• Do all the benefits of 3D IC design come with a much higher cost? For example, 3D bonding
incurs extra process cost, and the rough-Silicon Vias (TSVs) may increase the total die
area, which has a negative impact on the cost; however, smaller die sizes in 3D ICs may
result in higher yield than that of a larger 2D die, and reduce the cost.
• How to do 3D integration in a cost-effective way? For example, to re-design a small chip may
not gain the cost benefits of improved yield resulted from 3D integration. In addition, if a
chip is to be implemented in 3D, how many layers of 3D integration would be cost effective?
and should one use wafer-to-wafer or die-to-wafer stacking [5]?
• Are there any design options to compensate the extra 3D bonding cost? For example, in a 3D
IC, since some global interconnects are now implemented by TSVs, it may be feasible to
use fewer number of metal layers for each 2D die. In addition, heterogeneous integration
via 3D could also help cost reduction.
Cost analysis for 3D ICs at the early design stage is critical to answer these questions.
Considering 3D integration has both positive and negative effects on the manufacturing cost,
there are no clear answers yet whether 3D integration is cost-effective or how to achieve 3D
integration in a cost-effective way. In order to maximize the positive effect and compensate the
negative effect on cost, it becomes critical to analyze the 3D IC cost at the early design stage. is
early stage cost analysis can help the chip designers make a decision on whether 3D integration
should be used, or which 3D partitioning strategy should be adopted. Cost-efficient design is the
key for the future wide adoption of the emerging 3D IC design, and 3D IC cost analysis needs close
coupling between the 3D IC design and 3D IC process.¹
¹IC cost analysis needs a close interaction between designers and foundry. We work closely with our industrial partners to perform 3D IC cost
analysis. However, we cannot disclose absolute numbers for the cost, and therefore in this book, we either use arbitrary units (a.u.) or normalized
value to present the data.
74 8. COST ANALYSIS FOR 3D ICS
In this chapter, we first propose a 3D IC cost analysis methodology with a complete set of
cost models that include wafer cost, 3D bonding cost, package cost, and cooling cost. Using this
cost analysis methodology along with the existing performance, power, area estimation tools, such
as McPAT [103], we estimate the 3D IC cost in two cases: one is for fully customized ASICs,
the other is for many-core microprocessor designs. During these case studies, we explore some
guidelines showing what a cost-effective 3D IC fabrication option should be in the future.
Note that 3D integration is not yet a mature technology with very well-developed and
tested cost models; the optimal condition concluded by this work is subject to parameter changes.
However, the major contribution of this work is to provide a cost estimation methodology for
3D ICs. To our best knowledge, this is the first effort to model the 3D IC cost with package and
cooling cost included, while the majority of the existing 3D IC research activities mainly focused
on the circuit performance and power consumption.
In this chapter, we propose a complete set of 3D cost models, which not only includes
the wafer cost and the 3D bonding cost, but also includes another two critical cost components,
the package cost and the cooling cost. Both of them should not be ignored because while 3D
integration can potentially reduce the package cost by having a smaller chip footprint, the multiple
3D-stacked dies might increase the cooling cost due to the increased power density. By using the
proposed cost model set, we conduct two case studies on fully customized ASIC and many-core
microprocessor designs, respectively. e experimental results show that our 3D IC cost model
makes it feasible to estimate the final system-level cost of the target chip design at the early
design stage. More importantly, it gives some guidelines to determine what a cost-effective 3D
fabrication option should be.
www.ebook3000.com
8.1. 3D COST MODEL 75
type of microprocessor cores (e.g., in-order or out-of-order); and (3) the number of cache
levels and the cache capacity of each level. All these specifications are on the architectural
level. Referring to previous design experience, it is feasible to achieve a rough estimation
on the gate count for logic-intensive cores and the cell count for memory-intensive caches,
respectively.
Consequently, it is very likely that at the early design stage, the cost estimation is simply
based on the design style and a rough gate count for the design as the initial starting point. In
this section, we describe how to translate the logic gate count (or the memory cell count) into
higher level estimations, such as the die area, the metal layer count, and the TSV count, given a
specified fabrication process node. e die area estimator and the metal layer estimator are two
key components in our cost model: (i) larger die area usually causes lower die yield, and thus leads
to higher chip cost; (ii) the fewer metal layers are required, the less the number of fabrication
steps (and fabrication masks) are needed, which reduce the chip cost. It is also important to note
that the 3D partitioning strategy can directly affect these two numbers: (1) 3D integration can
partition the original 2D design into several smaller dies; (2) TSVs can potentially shorten the
total interconnect length and thus reduce the number of metal layers.
Die Area Estimation. At the early design stage, the relationship between the die area and the
gate count can be roughly described as follows,
Adie D Ngate Agate (8.1)
where Ngate is the gate count and Agate is an empirical parameter that shows the proportional
relationship between area and gate counts. Based on empirical data from many industrial designs,
in this work, we assume that Agate is equal to 31252 , in which is half of the feature size for
a specific technology node. Although this area estimation methodology is straightforward and
highly simplified, it accords with the real-world measurement quite well.²
Wire Length Estimation. As the only inputs at the early design stage are the estimation of
gate counts, it is necessary to further have an estimation on the design complexity, which can
be represented by the total length of wire interconnects. Rent’s Rule [105] is a well-known and
powerful tool that reveals the trend between the number of signal terminals and the number of
internal gates. Rent’s Rule can be expressed as follows,
p
T D kNgate (8.2)
where the parameters k and p are Rent’s coefficient and exponent and T is the number of signal
terminals. Although Rent’s Rule is an empirical result based on the observations of previous de-
signs and it is not proper to use it for non-traditional designs, it does provide a useful framework
to compare similar architecture and it is plausible to use it as a part of our 3D IC cost model.
²Note that it is necessary to update Agate along with the technology advance because the gate length is not strictly linearly
scaled.
76 8. COST ANALYSIS FOR 3D ICS
Based on Rent’s Rule, Donath discovered it can be used to estimate the average wire
length [106] and Davis et al. found it can be used to estimate the wire length distribution in
VLSI chips [107]. As a result, in this work, we use the derivation of Rent’s Rule to predict the
wire length distribution function i.l/ [107], which has the following forms,
p
Region I: 1 l Ngate
3 q
˛k l
i.l/ D 2 Ngate l 2 C 2Ngate l l 2p 4
2 3
p p
Region II: Ngate l < 2 Ngate
˛k q 3
i.l/ D 2 Ngate l l 2p 4
(8.3)
6
where l is the interconnect length in units of the gate pitch, ˛ is the fraction of the on-chip
terminals that are sink terminals and is related to the average fanout of a gate (f:o:) as follows,
f:o:
˛D (8.4)
f:o: C 1
Metal Layer Estimation. After estimating the die area and the length of wire, we are able to
further predict the number of metal layers that are required to route all the interconnects within
the die area constraint. e number of required metal layers for routing depends on the complexity
of the interconnects. A simplified metal layer estimation can be derived from the average wire
length [104] as follows,
s
f:o:RNm w Ngate
nwire D (8.6)
Adie
where f:o: refers to the average gate fanout, w to the wire pitch, to the utilization efficiency of
metal layers, RNm to the average wire length, and nwire to the number of required metal layers.
Such a simplified model is based on the assumptions that each metal layer has the same
utilization efficiency and the same wire width [104]. However, such assumptions may not be
valid in real design [108]. To improve the estimation of the number of metal layers needed for
feasible routing, a more sophisticated metal layer estimation method is derived from the wire
length distribution rather than the simplified average wire length estimation. e basic flow of
this method is explained as follows,
www.ebook3000.com
8.1. 3D COST MODEL 77
• Estimate the available routing area of each metal layer with the expression:
Adie i 2Avi Ngate f:o: I.li /
Ki D (8.7)
wi
where i is the metal layer, i is the layer’s utilization efficiency, wi is the layer’s wire pitch,
Avi is the layer’s via blockage area, and function I.l/ is the cumulative integral of the wire
length distribution function i.l/, which is expressed in Eq. 8.3.
• Assume that shorter interconnects are routed on lower metal layers. Starting from Metal 1, we
route as many interconnects as possible on the current metal layer until the available routing area
is used up. e interconnects routed on each metal layer can be express as:
where D 4=.f:o: C 3/ is a factor accounting for the sharing of wires between interconnects
on the same net [107, 109]. e function L.l/ is the first-order moment of i.l/.
• Repeat the same calculations for each metal layer in a bottom-up manner until all the interconnects
are routed properly.
By applying the estimation methodology introduced above, we can predict the die area
and the number of metal layers at the early design stage where we only have the gate count as
the input. Table 8.1 lists the values of all the related parameters [108, 110]. Figure 8.1 shows an
example which estimates the area and the number of metal layers of 65nm designs with different
scale of gates.
Figure 8.1 also shows an important implication for 3D IC cost reduction: When a large
2D chip is partitioned into multiple smaller dies with 3D stacking, each smaller die requires fewer
number of metal layers to satisfy the interconnect routability requirements. Such metal layer reduction
opportunity can potentially offset the cost of extra steps caused by 3D integration such as 3D
bonding and test.
TSV Estimation. e existence of TSVs in the 3D IC have its effects on the wafer cost as well.
ese effects are twofold,
1. In 3D ICs, some global interconnects are now implemented by TSVs, going between
stacked dies. is could lead to the reduction of the total wire length, and provides op-
portunities for metal layer reduction for each smaller die;
2. On the other hand, 3D stacking with TSVs may increase the total die area, since the silicon
area where TSVs punch through may not be utilized for building devices or 2D metal layer
connections.³
³Based on current TSV fabrication technologies, the diameter of TSVs ranges from 0:2m to 10m [113]. In this work, we
use the TSV diameter of 10m, and assume that the keepout area diameter is 2.5X of TSV diameter, which is 25m.
78 8. COST ANALYSIS FOR 3D ICS
Figure 8.1: An example of early design estimation of the die area and the metal layer under 65nm
process [111]. is estimation is well correlated with the state-of-the-art microprocessor designs. For
example, Sun SPARC T2 [112] contains about 500 million transistors (roughly equivalent to 125
million gates), with an area of 342mm2 and 11 metal layers).
While the first effect is already explained by Fig. 8.1 showing the existence of TSVs lead
to a potential metal layer reduction on the horizontal direction, modeling the second effort needs
extra efforts by adding the estimation of the number of TSVs and their associated silicon area
overhead.
To predict the number of required TSVs for a certain partition pattern, a derivation of
Rent’s Rule [106] describing the relationship between the interconnect count (X ) and the gate
count (Ngate ) can be used. is interconnect-gate relationship is formulated as follows,
p 1
X D ˛kNgate 1 Ngate (8.9)
X2 D ˛k2 N2 1 N2p2 1
: (8.11)
www.ebook3000.com
8.1. 3D COST MODEL 79
Table 8.1: e parameters used in the metal layer estimation model
Figure 8.2: e basic idea of how to estimate the number of TSVs [111].
Irrespective of the partition pattern, the total interconnect count of a certain design always keeps
constant. us, the number of TSVs can be calculated as follows,
p1;2 1
XT S V D ˛k1;2
.N1 C N2/ 1 .N1 C N2 /
˛k1 N1 1 N1p1 1 ˛k2 N2 1 N2p2 1 (8.12)
where k1;2 and p1;2 are the equivalent Rent’s coefficient and exponent, and they are derived as
follows [114],
p1 N1 C p2 N2
p1;2 D (8.13)
N C N2
1 1=.N 1 CN2 /
N1 N2
k1;2 D k1 k2 (8.14)
80 8. COST ANALYSIS FOR 3D ICS
3D
Wafer Package Cooling
Bonding
Cost Cost Cost
Cost
Figure 8.3: Overview of the proposed 3D cost model, which includes four components: wafer cost
model, 3D bonding cost model, package cost model, and cooling cost model [111].
where Adie is calculated by die area estimator, NTSV=die is the equivalent number of TSVs on each
die, and ATSV is the size of TSVs (including the keepout area). In practice, the final TSV-included
die area, A3D , is used in the wafer cost estimation later, while Adie is used in Eq. 8.7 for metal
layer estimation because that is the actually die area available for routing.
After estimating the die area of each layer, the number of required metal layers, and the
number of TSVs across planar dies in the previous section, a complete set of cost models are
proposed in this section. Our 3D IC cost estimation is composed of four separate parts: wafer
cost model, 3D bonding cost model, package cost model, and cooling cost model. An overview
of this proposed cost model set is illustrated in Fig. 8.3.
www.ebook3000.com
8.1. 3D COST MODEL 81
Wafer Cost Model. e wafer cost model estimates the cost of each separate planar die. is
cost estimation includes all the cost incurred before the 3D bonding process. Before estimating
the die cost, we first model the wafer cost.
e wafer cost is determined by several parameters, such as fabrication process type (i.e.,
CMOS logic or DRAM memory), process node (from 180nm down to 22nm), wafer diameter
(200mm or 300mm), and the number of metal layers (polysilicon, aluminum, or copper depending
on the fabrication process type). We model the wafer cost by dividing it into two parts: a fixed
part that is determined by process type, process node, wafer diameter, and the actual fabrication
vendor; a variable part that is mainly contributed to the number of metal layers. We call the fixed
part silicon cost and the variable part metal cost. erefore, the wafer cost is expressed as follows:
where Cwafer is the wafer cost, Csilicon is the silicon cost (the fixed part), Cmetal is the metal cost
per layer, and Nmetal is the number of metal layers. All the parameters used in this model are from
the ones collected by IC Knowledge LLC [115], which considers several factors including material
cost, labor cost, foundry margin, number of reticles, cost per reticle, and other miscellaneous cost.
Figure 8.4 shows the predicted wafer cost of 90nm, 65nm, and 45nm processes, with 9 or 10 layers
of metal, for three different foundries, respectively.
Besides the number of metal layers affecting the cost as demonstrated by Eq. 8.16, the die
area is another key factor that affects the bare die cost since smaller die area implies more dies per
wafer. e number of dies per wafer is formulated by Eq. 8.17 [116] as follows,
where Ndie is the number of dies per wafer, wafer is the diameter of the wafer, and Adie is the
die area. After having estimated the wafer cost by Eq. 8.16 and the number of dies per wafer by
Eq. 8.17, the bare die cost can be expressed as follows,
Cwafer
Cdie D : (8.18)
Ndie
In addition, the die area also relates to the die yield, which later affects the net die cost. By
assuming rectangular defect density distribution, the relationship between the die area and die
yield can be formulated as follows [117],
1 e 2Adie D0
Ydie D Ywafer (8.19)
2Adie D0
where Ydie and Ywafer are the yields of dies and wafers, respectively, and D0 is the wafer defect
density. erefore, the net die cost is Cdie =Ydie .
82 8. COST ANALYSIS FOR 3D ICS
Figure 8.4: A batch of data calculated by the wafer cost model [111]. e wafer cost varies from
different processes, different number of metal layers, different foundries, and some other factors.
3D Bonding Cost Model. e 3D bonding cost model estimated the cost incurred during the
process that integrates several planar 2D dies together using the TSV-based technology. e extra
fabrication steps required by 3D integrations consist of TSV forming, thinning, and bonding.
ere are two ways to build 3D TSVs: laser drilling or etching. Laser drilling is only suitable
for a small number of TSVs (hundreds to thousands) while etching is suitable for a large number
of TSVs. Furthermore, there are two approaches for TSV etching: (1) TSV-first approach: TSVs
can be formed during the 2D die fabrication process, before the Back-End-of-Line (BEOL)
processes. Such an approach is called TSV-first approach, and is shown in Fig. 8.5(a); (2) TSV-
last approach: TSVs can also be formed after the completion of 2D fabrications, after the BEOL
processes. Such an approach is called TSV-later approach, and is shown in Fig. 8.5(b). Either
www.ebook3000.com
8.1. 3D COST MODEL 83
Figure 8.5: Fabrication steps for 3D ICs: (a) TSVs are formed before BEOL process, thus TSVs
only punch through the silicon substrate but not the metal layers; (b) TSVs are formed after BEOL
process, thus TSvs. punch through not only the silicon substrate but the metal layers as well [111].
approach has advantages and disadvantages. TSV-first approach builds TSVs before metal layers,
thus there are no TSVs punching through metal layers and hence no TSV area overhead; TSV-last
approach has TSV area overhead, but it isolates the die fabrication from 3D bonding, which does
not need to change the traditional fabrication process. In order to separate the wafer cost model
and the 3D bonding cost model, we assume TSV-last approach is used in 3D IC fabrication. e
TSV area overhead caused by TSV-last approach is modeled by Eq. 8.15. e data of the 3D
bonding cost including the cost of TSV forming, wafer thinning, and wafer (or die) bonding are
obtained from our industry partner, with the assumption that the yield of each 3D process step
is 99%.
Combined with the wafer cost model, the 3D bonding cost model can be used to estimate
the cost of a 3D-stacked chip with multiple dies. In addition, the entire 3D-stacked chip cost de-
pends on some design options. For instance, it depends on whether die-to-wafer (D2W) or wafer-
to-wafer (W2W) bonding is used, and it also depends on whether face-to-face (F2F) or face-to-
back (F2B) bonding is used. If D2W bonding is selected, cost of Known-Good-Die (KGD) test
should also be included [118].
For D2W bonding, the cost of a bare N -layer 3D-stacked chip before packaging is calcu-
lated as follows,
PN
iD1 Cdiei C CKGDtest =Ydiei C .N 1/Cbonding
CD2W D N 1
(8.20)
Ybonding
where CKGDtest is the KGD test cost, which we model as Cwafersort =Ndie , and the wafer sort cost,
Cwafersort , is a constant value for a specific process in a specific foundry in our model. Note that
the testing cost for 3D IC itself is a complicated problem. Our other study has demonstrated
a more detailed test cost analysis model with the study of design-for-test (DFT) circuitry and
various testing strategies, showing the variants of testing cost estimations [119]. For example,
adding extra DFT circuitry to improve the yield of each die in D2W stacking can help the cost
84 8. COST ANALYSIS FOR 3D ICS
reduction but the increased area may increase the cost. In this book, we adopt the above testing
cost assumptions to simplify the total cost estimation, due to space limitation.
For W2W bonding, the cost is calculated as follows,
PN
Cdie C .N 1/Cbonding
CW2W D iD1 N i N 1 : (8.21)
˘iD1 Ydiei Ybonding
In order to support multiple-layer bonding, the default bonding mode is F2B. If F2F mode
is used, there is one more component die that does not need the thinning process, and the thinning
cost of this die is subtracted from the total cost.
Package Cost Model. e package cost model estimates the cost incurred during the packaging
process, which can be determined by three factors: package type, package area, and pin count.
As an example, we select the package type to be flip-chip Land Grid Array (fcLGA) in
this book. fcLGA is the common package type of multiprocessors, while other types of packages,
such as fcPGA, PGA, pPGA, etc., are also available in our model. Besides package type, package
area and pin count are the other two key factors that determine the package cost. By using the
early stage estimation methodology as mentioned in Sec. 8.1, the package area can be estimated
from the die area, A3D in Eq. 8.15, and the pin count can be estimated by Rent’s Rule in Eq. 8.2.
We analyze the actual package cost data and find that the number of pins becomes the dominant
factor to the package cost when the die area is much smaller than the total area of pin pads. It is
easy to understand, since there is a base material and production cost per pin which will not be
reduced with die area shrinking.
Figure 8.6 shows the sample data of package cost obtained from IC Knowledge LLC [115].
Based on this set of data, we use curve fitting to derive a package cost function with the parameters
of the die area and the pin count, which can be expressed as follows,
where Np is the pin count, A3D is the die area. In addition, 1 , 2 , and ˛ are the coefficients and
exponents, which are 0:00515, 0:141 and 0:35, respectively. By observing Eq. 8.22, we can also
find that pin count will dominate the package cost if the die area is sufficiently small, since the
order of pin count is higher than that of area. Figure 8.6 also shows the curve fitting result, which
is well aligned to the raw data.
Cooling Cost Model. Although the combination of the aforementioned wafer cost model, 3D
bonding cost model, and package cost model can already offer the cost of a 3D-stacked chip after
packaging, we add the cooling cost model because it has been widely noticed that 3D-stacked
chip has higher working temperature than its 2D counterpart and it might need a more powerful
cooling solution that causes extra cost.
Gunther, et al. [120] noticed that the cooling cost is related to the power dissipation of
the system. Furthermore, the system power dissipation is highly related to the chip working tem-
www.ebook3000.com
8.1. 3D COST MODEL 85
Figure 8.6: e package cost depends on both pin count and die area [111].
perature [69]. Our cooling cost estimation is based on the peak steady state temperature of the
targeted 3D-stacked multiprocessor.
ere are many types of cooling solutions, ranging from a simple extruded aluminum
heatsink to an elaborate vapor phase refrigeration. Depending on the cooling mechanisms, cool-
ing solutions used today can be defined as either convection cooling, phase-change cooling,
thermoelectric cooling (TEC), or liquid cooling [121]. Typical convection cooling solutions are
heatsinks and fans, which are widely adopted for the microprocessor chips in today’s desktop
computers. Phase-change cooling solutions, such as heatpipes, might be used in laptop comput-
ers. In addition, thermoelectric cooling and liquid cooling are used in some high-end computers.
We collect the cost of all the cooling solutions from the commercial market by searching the data
from Digikey [122] and Heatsink Factory [123]. We find that more powerful types of cooling so-
lutions often lead to higher costs as expected. Table 8.2 is a list of typical prices of these cooling
solutions.
In our model, we further assume that cooling cost increases linearly with the rise of chip
temperature, if the same type of cooling solution is adopted. Based on this assumption and the
data listed in Table 8.2, the cooling cost is therefore estimated as follows:
Ccooling D Kc T C c (8.23)
86 8. COST ANALYSIS FOR 3D ICS
Table 8.2: e cost of various cooling solutions
Table 8.3: e values of Kc and c in Eq. (8.23), which are related to the chip temperature
Chip Temperature ( ı C) Kc c
< 60 0:2 6
60 90 0:4 16
90 120 0:2 2
120 150 1:6 170
150 180 2 230
where T is the temperature from which the cooling solution can bring down, Kc and c are the
cooling cost parameters, which can be determined by Table 8.3.
Figure 8.7 shows the cost of these five types of cooling solutions. It can be observed that the
chips with higher steady state temperatures will require more powerful cooling solutions, which
lead to higher costs. It is also illustrated in Fig. 8.7 that the cooling cost is not a global linear
function of the temperature, whereas there are several regions with linear increase of the cost.
Each of the regions is correspondent to one type of cooling solution.
www.ebook3000.com
8.2. COST EVALUATION FOR MANY-CORE MICROPROCESSOR DESIGNS 87
Figure 8.7: A plot for the cooling cost model: the cooling cost increases linearly with the chip temper-
ature if the same cooling solution is used; more powerful cooling solutions result in higher costs [111].
and memory). In this section, we demonstrate how to use the remaining parts of the proposed
3D cost model to estimate the 3D many-core microprocessor cost. We use 45nm IBM common
platform cost model in this case study.
Figure 8.8: Structure of the baseline 2D multiprocessor, which consists of 64 tiles [111].
tioning strategies are: homogeneous and heterogeneous. As shown in Fig. 8.9(a), all the layers after
homogeneous partitioning are identical, while heterogenous partitioning leads to core (logic) layers
and cache (memory) layers as shown in Fig. 8.9(b).
www.ebook3000.com
8.2. COST EVALUATION FOR MANY-CORE MICROPROCESSOR DESIGNS 89
Table 8.4: e configuration of the SPARC-like core in a tile of the baseline 2D multiprocessor
Processor Cores
Clock frequency 1.2 GHz
Architecture type in-order
INT pipeline 6 stages
FP pipeline 6 stages
ALU count 1
FPU count 1
Empirical gate count 1.7 million
Routers
Type 4 ports
Empirical gate count 1.0 million
Caches
Cache line size 64B
L1 I-cache capacity 32 KB
L1 D-cache capacity 32 KB
L2 cache capacity 256 KB
Empirical cell count 2.6 million
where Ntiles=layer is the number of tiles per layer, 2 means the bi-directional channel between two
tiles, and 50% is added due to our assumption that half TSVs are for power delivery. After cal-
culation, the TSV area overhead is around 3.8%.
90 8. COST ANALYSIS FOR 3D ICS
Figure 8.10: e conceptual view of a 4-layer homogeneous stacking composed of 4 identical lay-
ers [111].
0 0
4
3 0
5
3 0 0
2 0
5
) C o o l i n g c o s t
a c k a g e C o s t
P
2 0 0
( a .
o n d i n g c o s t
B
1 0
5
C o
D i e c o s t ( N 1 l a y e r s )
1 0 0
D i e c o s t ( 1 l a y e r )
0
5
2 D 2 L a y e r L a y e r 8 L a y e r 1 6 L a y e r
4
Figure 8.11: e cost breakdown of a 16-core microprocessor design with different homogeneous
partitioning [111].
In the next step, we use the cost model set to estimate the cost of wafer, 3D bonding, pack-
age, and cooling, respectively. We analyze microprocessor designs of different scales. Figure 8.11
to Fig. 8.14 show the estimated price breakdown of 16-core, 32-core, 64-core, and 128-core mi-
croprocessors when they are fabricated by 1-layer, 2-layer, 4-layer, 8-layer, and 16-layer processes,
respectively. In these figures, all the prices are divided into five parts (the first bar, which repre-
sents 2D planar process, only has three parts as the 2D process only has one die and does not
need the bonding process). From bottom up, they are the cost of one die, the cost of remaining
dies, the bonding cost, the package cost, and the cooling cost.
As the figures illustrate, the cost of one die is descending when the number of layers is
ascending. is descending trend is mainly due to the yield improvement brought by the smaller
die size. However, this trend is flattened at the end since the yield improvement is not significant
www.ebook3000.com
8.2. COST EVALUATION FOR MANY-CORE MICROPROCESSOR DESIGNS 91
0
4 5
0 0
4
3 0
5
3 0 0
) C o o l i n g c o s t
2 0
5
a c k a g e C o s t
P
( a .
2 0 0
o n d i n g c o s t
B
C o
1 0
5
D i e c o s t ( N 1 l a y e r s )
D i e c o s t ( 1 l a y e r )
1 0 0
0
5
2 D 2 L a y e r L a y e r 8 L a y e r 1 6 L a y e r
4
Figure 8.12: e cost breakdown of a 32-core microprocessor design with different homogeneous
partitioning [111].
0 0
7
6 0 0
0 0
5
) C o o l i n g c o s t
0 0
4
a c k a g e C o s t
P
( a .
o n d i n g c o s t
B
3 0 0
C o
D i e c o s t ( N 1 l a y e r s )
2 0 0
D i e c o s t ( 1 l a y e r )
1 0 0
2 D 2 L a y e r L a y e r 8 L a y e r 1 6 L a y e r
4
Figure 8.13: e cost breakdown of a 64-core microprocessor design with different homogeneous
partitioning [111].
any more when the die size is sufficiently small. As a result, the cumulative cost of all the dies
does not follow a descending trend. Combined with the 3D bonding cost, which is an ascending
function to the layer count, the wafer cost usually hits its minimum at some middle points. For
example, if only considering the wafer cost and the 3D bonding cost, the optimal number of layers
for 16-core, 32-core, 64-core, and 128-core designs are 2, 4, 4, 8, respectively.
However, the cost before packaging is not the final cost. System-wise, the final cost needs
to include the package cost and the cooling cost. As mentioned, the package cost is mainly de-
termined by the pin count. us, the package cost almost remains constant in all the cases since
the pin count does not change from one partitioning to another partitioning. e cooling cost
mainly depends on the chip temperature. As more layers are vertically stacked, the higher the
chip temperature is, the cooling cost grows with the increase of 3D layers, and this growth is
quite fast because a 3D-stacked chip with more than eight layers can easily reach the peak tem-
perature of more than 150ı C. Actually, the extreme cases in this experiment (such as 16-layer
92 8. COST ANALYSIS FOR 3D ICS
✆ ✄
✆
✂
✆ ✁
✕ ✖ ✗ ✗ ✘ ✙ ✚ ✛ ✜ ✗ ✢ ✣
✠ ✠ ☛
✤
✜ ✥ ✛ ✖ ✗ ✢ ✣
✑ ✒ ✓
✗ ✚ ✧ ✙ ✚ ✛ ✜ ✗ ✢ ✣
✌ ✍
✝ ☛ ✞ ✆ ✠ ✡ ☛ ☞
✙ ✜ ✗ ✢ ✣ ★ ✩ ✘ ✢ ✪
✝ ☛ ✆ ✠ ✡ ☛ ☞
✙ ✜ ✗ ✢ ✣ ★ ✘ ✪
✁ ✝ ✁ ✞ ✟ ✠ ✡ ☛ ☞ ✞ ✟ ✠ ✡ ☛ ☞ ☎ ✞ ✟ ✠ ✡ ☛ ☞ ✆ ✄ ✞ ✟ ✠ ✡ ☛ ☞
✂
Figure 8.14: e cost breakdown of a 128-core microprocessor design with different homogeneous
partitioning [111].
stacking) is not practical indeed as the underlying chip is severely overheated. In this experiment,
we first obtain the power data of each component (the core, router, L1 cache, and L2 cache) from
a power estimation tool, McPAT [103]. Combined with the area data achieved from early design
stage estimation, we get the power density data, then feed them into HotSpot [121], which is
a 3D-aware thermal estimation tool, and finally get the estimated chip temperature and its cor-
responding cost of cooling solutions. Figure 8.11 to Fig. 8.14 also show the cost of packaging
and cooling. As we can find, the cooling cost is just a small portion for few layer stacking (such
as 2-layer), but it starts to dominate the total cost for aggressive stacking (such as 8-layer and
16-layer). erefore, the optimal partitioning solution in terms of the minimum total cost differs
from the one in terms of the cost only before packaging. Observed from the result, the optimal
number of layers for 16-core, 32-core, 64-core, and 128-core become 1, 2, 2, and 4, respectively
instead. is result also reveals our motivation for why the package cost and the cooling cost are
included in the decision of optimal 3D stacking layers.
www.ebook3000.com
8.2. COST EVALUATION FOR MANY-CORE MICROPROCESSOR DESIGNS 93
Figure 8.15: e conceptual view of a 4-layer heterogeneous stacking composed of two logic layers
and two memory layers [111].
timation, the required number of metals is only 6 while the logic modules (such as cores and
routers) usually need more than 10 layers of metals. Hence, We reevaluate the many-core micro-
processor cost by using heterogeneous integration, in which the tile is broken into logic parts (i.e.,
core and router) and memory parts (i.e., L1 and L2 caches).
Figure 8.15 illustrates a conceptual view of 4-layer heterogeneous partitioning where logic
modules (cores and routers) and memory modules (L1 and L2 caches) are on separate dies. As
shown in Fig. 8.15, there are two types of TSVs in the heterogeneous stacking chip: one is for
NoC mesh interconnect, which is the same as that in the homogeneous stacking case; the other
is for the interconnect between cores and caches, which is caused by the separation of the logic
and memory modules. e number of the extra core-to-cache TSVs is calculated as follows,
X
NextraTSV D Data C Addressi Ntiles=layer (8.25)
where Data is the cache line size and Addressi is the address width of cache i . Note that the
number of tiles per layer, Ntiles=layer is doubled in the heterogeneous partitioning compared to its
homogeneous counterpart. For our baseline configuration (Table 8.4), in which IL1 is 32KB, DL2
P
is 32KB, L2 is 256KB, and cache line size is 64B, Data C Addressi is 542. As a side-effect,
the extra core-to-cache TSVs cause the TSV area overhead increasing from 3.8% to 15.7%.
Compared to its homogeneous counterpart, the advantages and disadvantages of partition-
ing an NoC-based many-core microprocessor in the heterogeneous way are,
• Fabricating logic and memory dies separately reduces the wafer cost;
• Putting logic dies closer to heat sink reduces the cooling cost;
94 8. COST ANALYSIS FOR 3D ICS
3 0
5
3 0 0
2 0
5
C o o l i n g c o s t
P a c k a g e C o s t
2 0 0
( a .
o n d i n g c o s t
B
1 0
5
i c o s t ( o t h d i s )
D e e r e
C o
1 0 0
i c o s t ( o d i )
D e 1 m e m r y e
i c o s t ( l o g i c d i )
D e 1 e
0
5
2 D 2 L a y e r 4 L a y e r 8 L a y e r 1 6 L a y e r
Figure 8.16: e cost breakdown of a 16-core microprocessor design with different heterogeneous
partitioning [111].
0
4 5
0 0
4
3 0
5
C o o l i n g c o s t
3 0 0
P a c k a g e C o s t
2 0
5
( a .
o n d i n g c o s t
B
2 0 0
i c o s t ( o t h d i s )
D e e r e
C o
1 0
5
i c o s t ( o d i )
D e 1 m e m r y e
1 0 0
i c o s t ( l o g i c d i )
D e 1 e
0
5
2 D 2 L a y e r L a y e r 8 L a y e r 1 6 L a y e r
4
Figure 8.17: e cost breakdown of a 32-core microprocessor design with different heterogeneous
partitioning [111].
• e higher TSV area overhead increases the wafer cost and the package cost.
www.ebook3000.com
8.2. COST EVALUATION FOR MANY-CORE MICROPROCESSOR DESIGNS 95
Figure 8.18: e cost breakdown of a 64-core microprocessor design with different heterogeneous
partitioning [111].
1 6 0 0
1 0 0
4
1 2 0 0
C o o l i n g c o s t
1 0 0 0
P a c k a g e C o s t
8 0 0
( a .
o n d i n g c o s t
B
i c o s t ( o t h d i s )
D e e r e
6 0 0
C o
i c o s t ( o d i )
D e 1 m e m r y e
0 0
4
i c o s t ( l o g i c d i )
D e 1 e
2 0 0
2 D 2 L a y e r L a y e r 8 L a y e r 1 6 L a y e r
4
Figure 8.19: e cost breakdown of a 128-core microprocessor design with different heterogeneous
partitioning [111].
dies cannot offset the extra cost caused by the increased die area, the cooling cost is reduced after
putting logic layers, which closer to the heat sink (or other cooling solutions).
While our experiment shows heterogeneous integration has the cost advantages on cheaper
memory layer and lower chip temperature, in reality, some designers might still prefer homoge-
neous integration, which has identical layout on each die. e reciprocal design symmetry (RDS)
was proposed by Alam, et al. [126] to re-use one mask for multiple 3D-stacked layers. e RDS
could relieve the design effort, and thus reduce the design time, which means a cost reduction on
human resources. However, this personnel cost is not included in our cost model set, and it is out
of the scope of this book.
96 8. COST ANALYSIS FOR 3D ICS
Figure 8.20: e cost comparison between homogeneous and heterogeneous integrations for a 64-
core microprocessor using 45nm 2-layer 3D process [111].
Table 8.5: e optimal number of 3D layers for many-core microprocessor designs (45nm)
Homogeneous Heterogeneous
16-core 1 1
32-core 2 2
64-core 2 4
128-core 4 8
We list the optimal partitioning options for 45nm many-core microprocessor designs in
Table 8.5, which shows how our proposed cost estimation methodology helps decide the 3D
partitioning at the early design stage.
Conclusion. To overcome the barriers in technology scaling, a three-dimensional integrated
circuit (3D IC) is emerging as an attractive option for future microprocessor designs. However,
fabrication cost is one of the important considerations for the wide adoption of the 3D integration.
System-level cost analysis at the early design stage to help the decision making on whether 3D
integration should be used for the application is very critical.
To facilitate the early design stage cost analysis, we propose a set of cost models that in-
clude wafer cost, 3D bonding cost, package cost, and cooling cost. Based on the cost analysis, we
identify the design opportunities for cost reduction in 3D ICs, and provide a few design guide-
lines on cost-effective 3D IC designs. Our research is complementary to the existing research
on 3D microprocessors that are focused on other design goals, such as performance and power
consumption.
www.ebook3000.com
97
CHAPTER 9
Conclusion
In this book, we have reviewed the background of emerging 3D integration technologies and
explored various architecture designs that employ 3D integration. 3D integration technolo-
gies promise high performance, low power, low cost, and high density microprocessor archi-
tecture solutions. It is an attractive solution in developing high-performance, energy-efficient,
thermal-aware, and cost-effective chip-multiprocessors and GPU systems. In particular, with
chip-multiprocessors, 3D integration provides low wire latency in connecting processor cores
and caches. With GPU systems, 3D integration is promising in developing high-bandwidth, low
power graphics memory interface. 3D integration also enlarges the capacity of on-chip memory,
which can be employed as the last-level cache, a portion of main memory, or the combination of
both.
Although 3D integration brings in great benefits and opportunities in architecture design,
several challenges exist on the way toward its wide adoption in future computer systems. is
book has discussed two of them—thermal and cost—and explored thermal-aware and cost-aware
microprocessor design techniques. Beyond these two challenges, the following two design chal-
lenges are also critical towards the wide adoption of 3D integration technologies.
Design tools and methodologies. 3D integration technologies will not be commercially viable
without the support of electronic design automation (EDA) tools and methodologies. Given par-
ticular design goals, efficient EDA tools and methodologies can help architects and circuit de-
signers make decisions on whether to adopt 3D or 2D integration. ey can also help designers
address design trade-offs in performance, power, and cost, when 3D integration is adopted. Fur-
thermore, 3D ICs may require new layout rules that might be driven by features on adjacent die;
with larger sizes compared with conventional vias, TSvs. can illustrate significant new layout fea-
ture; power planning in 3D ICs still requires substantial studies due to the complexity introduced
by the third dimension; efficient EDA tools need to take thermal constraints into consideration;
place and route tools need to consider thermal constraints of 3D ICs to avoid hot spots; efficient
analysis tools are required for handling electromagnetic interference concern for 3D ICs. To effi-
ciently exploit the benefits of 3D technologies, design tools and methodologies that support 3D
integration are imperative [6].
Testing methodologies. In 3D IC design, various testing strategies and integration methods can
affect system performance, power, and cost dramatically [127]. An obstacle towards the adoption
of 3D technologies is therefore insufficient understanding of 3D testing issues and the lack of
98 9. CONCLUSION
design-for-testability (DFT) techniques for 3D ICs. Without considering test during the design
phase, effective tests on 3D ICs will become impossible. It is therefore very crucial to study options
in 3D testing methodologies. Both new standards and design tools are required to diagnose issues
properly. 3D IC fabrication incorporates many more intermediate steps than conventional 2D IC
fabrication, such as die stacking and TSV bonding. ese extra steps require wafer tests before
final assembly and packaging. However, a wafer test for 3D ICs is challenging in three ways. First,
existing probe technology cannot perform finer pitch and dimensions of TSV tips and is limited
to handling only several hundred probes at a much lower number than required TSV probes.
Second, a wafer test may require creating a known-good die (KGD) stack, which may be at the
risk of damaging due to the contact of the highly thinned wafer by a wafer probe. 3D ICs can also
impose intra-die defects, which may be caused by thermal issues and new manufacturing steps
such as wafer thinning and bonding the top of a TSV to another wafer. Consequently, new fault
models are required to address this issue. While 3D IC testing is a crucial problem, it has remained
largely unexplored in the research community and requires substantial efforts in exploring them.
www.ebook3000.com
99
Bibliography
[1] J. Zhao, C. Xu, and Y. Xie, “Bandwidth-aware reconfigurable cache design with hybrid
memory technologies,” in Proceedings of the International Conference on Computer-Aided
Design (ICCAD), 2011, pp. 48–55. DOI: 10.1109/ICCAD.2011.6105304. xi
[2] “MICRON, micron collaborates with Intel to enhance Knights landing with a high per-
formance, on-package memory solution,” http://investors.micron.com/releasede
tail.cfm?ReleaseID=856057. xii
[3] “NVIDIA,” http://blogs.nvidia.com/blog/2013/03/19/nvidia-ceo-updates-
nvidias-roadmap/. xii, 45
[4] “AMD, what is heterogeneous system architecture (HSA)?” http://developer.amd.
com/resources/heterogeneous-computing/what-is-heterogeneous-system-
architecture-hsa/. xii
[5] Y. Xie, G. Loh, B. Black, and K. Bernstein, “Design space exploration for 3D ar-
chitectures,” ACM Journal of Emerging Technologies in Compuing Systems, 2006. DOI:
10.1145/1148015.1148016. 2, 7, 10, 16, 53, 65, 73
[6] Y. Xie, J. Cong, and S. Sapatnekar, ree-Dimensional Integrated Circuit Design: EDA,
Design and Microarchitectures. Springer, 2009. DOI: 10.1007/978-1-4419-0784-4. 2, 97
[7] P. Garrou, Handbook of 3D Integration: Technology and Applications using 3D Integrated
Circuits. Wiley-CVH, 2008, ch. Introduction to 3D Integration. 2, 3
[8] W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule, M. Steer, and
P. D. Franzon, “Demystifying 3D ICs: the pros and cons of going vertical,” IEEE Design
and Test of Computers, vol. 22, no. 6, pp. 498– 510, 2005. DOI: 10.1109/MDT.2005.136.
3
[9] J. Burns, G. Carpenter, E. Kursun, R. Puri, J. Warnock, and M. Scheuermann, “Design,
cad and technology challenges for future processors: 3d perspectives,” in Design Automa-
tion Conference (DAC), 2011 48th ACM/EDAC/IEEE, June 2011, pp. 212–212. DOI:
10.1145/2024724.2024772. 4, 5
[10] J. Joyner, P. Zarkesh-Ha, and J. Meindl, “A stochastic global net-length distribution for a
three-dimensional system-on-a-chip (3D-SoC),” in Proc. 14th Annual IEEE International
ASIC/SOC Conference, Sep. 2001. DOI: 10.1109/ASIC.2001.954688. 7
100 BIBLIOGRAPHY
[11] Y.-f. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, and M. J. Irwin, “Design space exploration
for three-dimensional cache,” IEEE Transactions on Very Large Scale Integration Systems,
2008. DOI: 10.1109/ASIC.2001.954688. 7, 10
[12] B. Vaidyanathan, W.-L. Hung, F. Wang, Y. Xie, V. Narayanan, and M. J. Irwin, “Ar-
chitecting microprocessor components in 3D design space,” in Intl. Conf. on VLSI Design,
2007, pp. 103–108. DOI: 10.1109/VLSID.2007.41. 7, 8, 13, 23
[13] K. Puttaswamy and G. H. Loh, “Scalability of 3D-integrated arithmetic units in high-
performance microprocessors,” in Design Automation Conference, 2007, pp. 622–625. DOI:
10.1109/DAC.2007.375238.
[14] J. Ouyang, G. Sun, Y. Chen, L. Duan, T. Zhang, Y. Xie, and M. Irwin, “Arithmetic unit
design using 180nm TSV-based 3D stacking technology,” in IEEE International 3D System
Integration Conference, 2009. DOI: 10.1109/3DIC.2009.5306565. 8
[15] R. Egawa, J. Tada, H. Kobayashi, and G. Goto, “Evaluation of fine grain 3D integrated
arithmetic units,” in IEEE International 3D System Integration Conference, 2009. DOI:
10.1109/3DIC.2009.5306566. 7, 8
[16] B. Black et al., “Die stacking 3D microarchitecture,” in MICRO, 2006, pp. 469–479. DOI:
10.1109/MICRO.2006.18. 8, 9
[17] G. H. Loh, “3d-stacked memory architectures for multi-core processors,” in In-
ternational Symposium on Computer Architecture (ISCA), 2008, pp. 453–464. DOI:
10.1145/1394608.1382159. 8, 9
[18] P. Jacob et al., “Mitigating memory wall effects in high clock rate and multi-core cmos 3D
ICs: Processor memory stacks,” Proceedings of IEEE, vol. 96, no. 10, 2008.
[19] G. Loh, Y. Xie, and B. Black, “Processor design in three-dimensional die-stacking tech-
nologies,” IEEE Micro, vol. 27, no. 3, pp. 31–48, 2007. DOI: 10.1109/MM.2007.59. 8
[20] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Rein-
hardt, and K. Flautner, “PicoServer: using 3D stacking technology to enable a com-
pact energy efficient chip multiprocessor,” in ASPLOS, 2006, pp. 117–128. DOI:
10.1145/1168857.1168873. 9, 37
[21] S. Vangal et al., “An 80-tile Sub-100-W TeraFLOPS processor in 65-nm CMOS,”
IEEE Journal of Solid-State Circuits, vol. 43, no. 1, pp. 29–41, 2008. DOI:
10.1109/JSSC.2007.910957. 9
[22] G. Loh, “Extending the effectiveness of 3D-stacked dram caches with an adaptive multi-
queue policy,” in International Symposium on Microarchitecture (MICRO), Dec. 2009. DOI:
10.1145/1669112.1669139. 9
www.ebook3000.com
BIBLIOGRAPHY 101
[23] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, N. Vijaykrishnan, and M. Kandemir, “Design
and management of 3D chip multiprocessors using network-in-memory,” in International
Symposium on Computer Architecture (ISCA’06), 2006. DOI: 10.1145/1150019.1136497.
10
[24] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, “Circuit and microarchitecture
evaluation of 3D stacking Magnetic RAM (MRAM) as a universal memory replacement,”
in Design Automation Conference, 2008, pp. 554–559. DOI: 10.1145/1391469.1391610.
10
[25] X. Wu, J. Li, L. Zhang, E. Speight, and Y. Xie, “Hybrid cache architec-
ture,” in International Symposium on Computer Architecture (ISCA), 2009. DOI:
10.1145/1555815.1555761. 10, 13
[26] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novel 3D stacked MRAM cache archi-
tecture for CMPs,” in International Symposium on High Performance Computer Architecture,
2009. DOI: 10.1109/HPCA.2009.4798259. 10
[28] X. Dong and Y. Xie, “Cost analysis and system-level design exploration for 3D ICs,” in Asia
and South Pacific Design Automation Conference, 2009. DOI: 10.1145/1509633.1509700.
11
[30] G. H. Loh, “Extending the effectiveness of 3D-stacked DRAM caches with an adaptive
multi-queue policy,” in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture. New York, NY, USA: ACM, 2009, pp. 201–212.
DOI: 10.1145/1669112.1669139. 13, 27, 28, 29
[33] M. Ghosh and H.-H. S. Lee, “Smart Refresh: An enhanced memory controller design for
reducing energy in conventional and 3D die-stacked DRAMs,” in MICRO 40: Proceed-
ings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. Wash-
ington, DC, USA: IEEE Computer Society, 2007, pp. 134–145. DOI: 10.1109/MI-
CRO.2007.38. 13, 29, 32
[36] K. Puttaswamy and G. H. Loh, “ermal herding: Microarchitecture techniques for con-
trolling hotspots in high-performance 3D-integrated processors,” in HPCA ’07: Proceed-
ings of the 2007 IEEE 13th International Symposium on High Performance Computer Archi-
tecture. Washington, DC, USA: IEEE Computer Society, 2007, pp. 193–204. DOI:
10.1109/ICCD.2004.1347939. 25, 26
[37] B. Black, D. W. Nelson, C. Webb, and N. Samra, “3D processing technology and its impact
on iA32 microprocessors,” in ICCD ’04: Proceedings of the IEEE International Conference on
Computer Design. Washington, DC, USA: IEEE Computer Society, 2004, pp. 316–318.
13
[38] G. Sun, X. Dong, Y. Xie et al., “A novel architecture of the 3D stacked MRAM L2 cache
for CMPs,” in Proceedings of the International Symposium on High Performance Computer
Architecture, 2009, pp. 239–249. DOI: 10.1109/HPCA.2009.4798259. 13
www.ebook3000.com
BIBLIOGRAPHY 103
[40] Y. Tsai, Y. Xie, V. Narayanan, and M. J. Irwin, “ree-dimensional cache design explo-
ration using 3DCacti,” Proceedings of the IEEE International Conference on Computer Design
(ICCD 2005), pp. 519–524, 2005. DOI: 10.1109/ICCD.2005.108. 15, 17, 18, 19, 20, 53
[41] P. Shivakumar et al., “Cacti 3.0: An Integrated Cache Timing, Power, and Area Model,”
Western Research Lab Research Report, 2001. 15
[42] Y. Ma, Y. Liu, E. Kursun, G. Reinman, and J. Cong, “Investigating the effects of fine-grain
three-dimensional integration on microarchitecture design,” J. Emerg. Technol. Comput.
Syst., vol. 4, pp. 17:1–17:30, November 2008. [Online]. Available: http://doi.acm.or
g/10.1145/1412587.1412590 DOI: 10.1145/1324177.1324180. 24
[43] T. Kgil, A. Saidi, N. Binkert, S. Reinhardt, K. Flautner, and T. Mudge, “Picoserver: Using
3D stacking technology to build energy efficient servers,” J. Emerg. Technol. Comput. Syst.,
vol. 4, no. 4, pp. 1–34, 2008. DOI: 10.1145/1412587.1412589. 29, 31
[44] C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “Bridging the processor-memory per-
formance gap with 3D IC technology,” Design Test of Computers, IEEE, vol. 22, no. 6, pp.
556 – 564, nov. 2005. DOI: 10.1109/MDT.2005.134.
[48] P. Shivakumar, , and N. Jouppi, “Cacti 3.0: An Integrated Cache Timing, Power, and Area
Model,” Western Research Lab Research Report, 2001/2. 32
[49] X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi, “Simple but effective heterogeneous
main memory with on-chip memory controller support,” in Proceedings of the International
Conference for High Performance Computing, 2010, pp. 1–11. DOI: 10.1109/SC.2010.50.
33
[54] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, “A NUCA substrate
for flexible CMP cache sharing,” in Proceedings of the 19th Annual International Conference
on Supercomputing, 2005, pp. 31–40. DOI: 10.1145/1088149.1088154.
[55] N. Rafique, W.-T. Lim, and M. ottethodi, “Architectural support for operating
system-driven CMP cache management,” in Proceedings of the 15th International Con-
ference on Parallel Architectures and Compilation Techniques, 2006, pp. 2–12. DOI:
10.1145/1152154.1152160.
[56] S. Cho and L. Jin, “Managing distributed, shared L2 caches through OS-level page allo-
cation,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microar-
chitecture, 2006, pp. 455–468. DOI: 10.1109/MICRO.2006.31. 34
[57] G. Loh and M. Hill, “Supporting very large DRAM caches with compound-access
scheduling and MissMap,” pp. 70–78, May 2012. DOI: 10.1109/MM.2012.25. 34
[59] C. Chou, A. Jaleel, and M. K. Qureshi, “CAMEO: A two-level memory organization with
capacity of main memory and flexibility of hardware-managed cache,” in Proceedings of the
47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-47,
2014. DOI: 10.1109/MICRO.2014.63. 36, 37
[60] S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs and the fu-
ture of parallel computing,” Micro, IEEE, vol. 31, no. 5, pp. 7–17, Sept 2011. DOI:
10.1109/MM.2011.89. 39
[61] Q. Zou, M. Poremba, R. He, W. Yang, J. Zhao, and Y. Xie, “Heterogeneous architecture
design with emerging 3d and non-volatile memory technologies,” in Design Automation
Conference (ASP-DAC), 2015 20th Asia and South Pacific, Jan 2015, pp. 785–790. DOI:
10.1109/ASPDAC.2015.7059106. 39
www.ebook3000.com
BIBLIOGRAPHY 105
[62] Y.-J. Lee and S. K. Lim, “On gpu bus power reduction with 3d ic technologies,” in Pro-
ceedings of the Conference on Design, Automation & Test in Europe, ser. DATE ’14, 2014,
pp. 175:1–175:6. DOI: 10.7873/DATE.2014.188. 39
[63] AMD, “Amd radeon™hd 7970 graphics,” 2012. [Online]. Available: http://www.amd.
com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx 39
[64] NVIDIA, “Quadro 6000 - workstation graphics card for 3d design, styling, visualization,
cad, and more,” 2010. [Online]. Available: http://www.nvidia.com/object/produc
t-quadro-6000-us.html 39, 41
[65] J. Zhao, G. Sun, G. Loh, and Y. Xie, “Optimizing GPU energy efficiency with 3D die-
stacking graphics memory and reconfigurable memory interface,” ACM Transactions on
Architecture and Code Optimization (TACO), vol. 10, no. 4, pp. 24:1–24:25, 2013. DOI:
10.1145/2541228.2555301. 39
[66] E. Vick, S. Goodwin, G. Cunnigham, and D. S. Temple, “Vias-last process technology
for thick 2.5d si interposers,” in 3D Systems Integration Conference, 2012, pp. 1–4. DOI:
10.1109/3DIC.2012.6262990. 40
[67] G. H. Loh, “3d-stacked memory architectures for multi-core processors,” in Proceed-
ings of the International Symposium on Computer Architecture, 2008, pp. 453–464. DOI:
10.1145/1394608.1382159. 40
[68] Hynix, “Hynix gddr5 sgram datasheet,” 2009. [Online]. Available: http://www.hynix.
com/products/graphics/ 41
[75] A. Jantsch and H. Tenhunen, Networks on Chip. Kluwer Academic Publishers, 2003. 48
[76] G. De Micheli and L. Benini, Networks on Chips. San Francisco, CA: Morgan Kaupmann,
2006. 48
[81] I. Loi, S. Mitra, T. H. Lee, S. Fujita, and L. Benini, “A low-overhead fault tolerance
scheme for tsv-based 3D network on chip links,” in Proceedings of International Conference
on Computer-Aided Design, 2008, pp. 598–602. DOI: 10.1145/1509456.1509589.
[82] V. F. Pavlidis and E. G. Friedman, “3-D topologies for networks-on-chip,” IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 10, pp. 1081–1090, 2007.
DOI: 10.1109/TVLSI.2007.893649. 48
www.ebook3000.com
BIBLIOGRAPHY 107
in Proceedings of International Symposium on Computer Architecture, 2006, pp. 4–15. DOI:
10.1145/1150019.1136487. 52
[86] Y. Ye, L. Duan, J. Xu, J. Ouyang, M. K. Hung, and Y. Xie, “3D optical networks-on-
chip (NoC) for multiprocessor systems-on-chip (MPSoC),” sep. 2009, pp. 1 –6. DOI:
10.1109/3DIC.2009.5306588. 56
[90] P. Li, L. T. Pileggi, M. Asheghi, and R. Chandra, “IC thermal simulation and
modeling via efficient multigrid-based approaches,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 25, pp. 1763–1776, 2006. DOI:
10.1109/TCAD.2005.858276. 60
[93] G. M. Link and N. Vijaykrishnan, “ermal trends in emerging technologies,” 2006, pp.
625–632. DOI: 10.1109/ISQED.2006.136. 61, 66
[94] J. Cong and Y. Zhang, “ermal-driven multilevel routing for 3D ICs,” in Proc. Asia and
South Pacific Design Automation Conf the ASP-DAC 2005, vol. 1, 2005, pp. 121–126. DOI:
10.1109/ASPDAC.2005.1466143. 62
[95] Y. Yang, Z. Gu, C. Zhu, R. P. Dick, and L. Shang, “ISAC: Integrated space-and-
time-adaptive chip-package thermal analysis,” vol. 26, no. 1, pp. 86–99, 2007. DOI:
10.1109/TCAD.2006.882589. 63
[96] W. Hung, G. Link, Y. Xie, V. Narayanan, and M.J.Irwin, “Interconnect and thermal-
aware floorplanning for 3D microprocessors,” in Proceedings of the International Symposium
of Quality Electronic Devices, 2006. DOI: 10.1109/ISQED.2006.77. 63, 64, 69
[98] C. Chu and D. Wong, “A matrix synthesis approach to thermal placement,” in Proceedings
of the ISPD, 1997. DOI: 10.1145/267665.267708. 64
[99] C. Tsai and S. Kang, “Cell-level placement for improving substrate thermal distributio,”
in IEEE Trans. on Computer-Aided Design of Integrated Circuits and System, 2000. DOI:
10.1109/43.828554. 64, 68
[100] P. Shiu and S. K. Lim, “Multi-layer floorplanning for reliable system-on-package,” in Proc.
of IEEE International Symposium on Circuits and Systems (ISCAS), 2004. DOI: 10.1109/IS-
CAS.2004.1329460. 64, 66
[101] Y. Chang, Y. Chang, G.-M. Wu, and S.-W. Wu, “B*-trees: a new representation for non-
slicing floorplans ,” Proceedings of the Annual ACM/IEEE Design Automation Conference,
2000. DOI: 10.1109/DAC.2000.855354. 64, 65
[102] K. Puttaswamy and G. Loh, “ermal herding: Microarchitecture techniques for control-
ling hotspots in high-performance 3D-integrated processors,” in High Performance Com-
puter Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, Feb. 2007,
pp. 193–204. DOI: 10.1109/HPCA.2007.346197. 70
www.ebook3000.com
BIBLIOGRAPHY 109
[103] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “Mc-
pat: An integrated power, area, and timing modeling framework for multicore and many-
core architectures,” in Proceedings of the International Symposium on Microarchitecture, 2009.
DOI: 10.1145/1669112.1669172. 74, 92
[105] B. S. Landman and R. L. Russo, “On a pin versus block relationship for partitions of logic
graphs,” IEEE Transancations on Computers, vol. 20, no. 12, pp. 1469–1479, 1971. DOI:
10.1109/T-C.1971.223159. 75
[107] J. A. Davis, V. K. De, and J. D. Meindl, “A stochastic wire-length distribution for gigas-
cale integration (GSI)¡ªPart I: derivation and validation,” IEEE Transactions on Electron
Devices, vol. 45, no. 3, pp. 580–589, 1998. DOI: 10.1109/16.661219. 76, 77
[109] P. Chong and R. K. Brayton, “Estimating and optimizing routing utilization in DSM
design,” in Proceedings of the Workshop on System-Level Interconnect Prediction, 1999, pp.
97–102. 77
[110] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley, 1990.
77
[111] X. Dong, J. Zhao, and Y. Xie, “Fabrication cost analysis and cost-aware design
space exploration for 3-D ICs,” Computer-Aided Design of Integrated Circuits and Sys-
tems, IEEE Transactions on, vol. 29, no. 12, pp. 1959 –1972, dec. 2010. DOI:
10.1109/TCAD.2010.2062811. 78, 79, 80, 82, 83, 85, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96
[112] M. Tremblay and S. Chaudhry, “A third-generation 65nm 16-core 32-thread llus 32-scout-
thread CMT SPARC(R) processor,” in Proceedings of the International Solid State Circuit
Conference, 2008, pp. 82–83. DOI: 10.1109/ISSCC.2008.4523067. 78, 87
110 BIBLIOGRAPHY
[113] G. Loh, Y. Xie, and B. Black, “Processor design in three-dimensional die-stacking tech-
nologies,” IEEE Micro, vol. 27, no. 3, pp. 31–48, 2007. DOI: 10.1109/MM.2007.59. 77
[114] P. Zarkesh-Ha, J. Davis, W. Loh, and J. Meindl, “On a pin versus gate relationship for
heterogeneous systems: heterogeneous Rent’s rule,” in Proceedings of the IEEE Custom In-
tegrated Circuits Conference, 1998, pp. 93–96. DOI: 10.1109/CICC.1998.694914. 79
[115] IC Knowledge LLC., “IC Cost Model, 2009 Revision 0906,” 2009. 81, 84
[116] J. Rabaey, A. Chandrakasan, and B. Nikolic, “Digital Integrated Circuits,” Prentice-Hall,
2003. 81
[117] B. Murphy, “Cost-size optima of monolithic integrated circuits,” Proceedings of the IEEE,
vol. 52, no. 12, pp. 1537–1545, dec. 1964. DOI: 10.1109/PROC.1964.3442. 81
[118] L. Smith, G. Smith, S. Hosali, and S. Arkalgud, “3D: it all comes down to cost,” in Pro-
ceedings of RTI Conference of 3D Architecture for Semiconductors and Packaging, 2007. 83
[119] Y. Chen, D. Niu, X. Dong, Y. Xie, and K. Chakrabarty, “Testing Cost Analsysis in three-
dimensional (3D) integration technology,” in Technical Report, CSE Department, Penn
State, http://www.cse.psu.edu/yuanxie/3d.html, 2010. 83
[120] S. Gunther, F. Binns, D. M. Carmean, and J. C. Hall, “Managing the impact of increasing
microprocessor power consumption,” Intel Technology Journal, vol. 5, no. 1, pp. 1–9, 2001.
84
[121] W. Huang, K. Skadron, S. Gurumurthi et al., “Differentiating the roles of IR measure-
ment and simulation for power and temperature-aware design,” in Proceedings of the Inter-
national Symposium on Performance Analysis of Systems and Software, 2009, pp. 1–10. DOI:
10.1109/ISPASS.2009.4919633. 85, 92
[122] Digikey, 2009, www.digikey.com. 85
[123] Heatsink Factory, 2009, www.heatsinkfactory.com. 85
[124] X. Dong, X. Wu, G. Sun et al., “Circuit and microarchitecture evaluation of 3D stacking
Magnetic RAM (MRAM) as a universal memory replacement,” in Proceedings of the Design
Automation Conference, 2008, pp. 554–559. DOI: 10.1145/1391469.1391610. 92
[125] S. Borkar, “3D technology: a system perspective,” in Proceedings of the International 3D-
System Integration Conference, 2008, pp. 1–14. 92
[126] S. M. Alam, R. E. Jones, S. Pozder, and A. Jain, “Die/wafer stacking with reciprocal design
symmetry (RDS) for mask reuse in three-dimensional (3D) integration technology,” in
Proceedings of the International Symposium on Quality of Electronic Design, 2009, pp. 569–
575. DOI: 10.1109/ISQED.2009.4810357. 95
www.ebook3000.com
BIBLIOGRAPHY 111
[127] “Cadence, 3D ICs with TSVs – design challenges and requirements,” http://www.cade
nce.com/rl/resources/white_papers/3dic_wp.pdf. 97
www.ebook3000.com
113
Authors’ Biographies
YUAN XIE
Yuan Xie received his B.S. degree in electronic engineering from Tsinghua University, Beijing,
in 1997, and his M.S. and Ph.D. degrees in electrical engineering from Princeton University in
1999 and 2002, respectively. He is currently a Professor in the Electrical and Computer Engi-
neering department at the University of California at Santa Barbara. Before joining UCSB in Fall
2014, he was with the Pennsylvania State University from 2003 to 2014, and with IBM Micro-
electronic Division’s Worldwide Design Center from 2002 to 2003. Prof. Xie is a recipient of the
National Science Foundation Early Faculty (CAREER) award, the SRC Inventor Recognition
Award, IBM Faculty Award, and several Best Paper Award and Best Paper Award Nominations
at IEEE/ACM conferences. He has published more than 100 research papers in journals and ref-
ereed conference proceedings, in the area of EDA, computer architecture, VLSI circuit designs,
and embedded systems. His current research projects include: three-dimensional integrated cir-
cuits (3D ICs) design, EDA, and architecture; emerging memory technologies; low power and
thermal-aware design; reliable circuits and architectures; and embedded system synthesis. He is
currently Associate Editor for ACM Journal of Emerging Technologies in Computing Systems
( JETC), IEEE Transactions on Very Large Scale Integration Systems (TVLSI), IEEE Trans-
actions on Computer Aided Design of Integrated Circuits (TCAD), IEEE Design and Test of
Computers, IET Computers and Digital Techniques (IET CDT). He is a Fellow of IEEE.
JISHEN ZHAO
Jishen Zhao received her B.S. and M.S. degrees from Zhejiang University, and Ph.D. degree
from Pennsylvania State University. She is currently an Assistant Professor at the University of
California, Santa Cruz. Her research interests include a broad range of computer architecture
topics with an emphasis on memory systems, high-performance computing, and energy efficiency.
She is also interested in electronic design automation and VLSI design for three-dimensional
integrated circuits and nonvolatile memories.