A Survey of High-Performance Computing Scaling Challenge
A Survey of High-Performance Computing Scaling Challenge
Abstract
Commodity clusters revolutionized high-performance computing when they first appeared two decades ago. As scale
and complexity have grown, new challenges in reliability and systemic resilience, energy efficiency and optimization and
software complexity have emerged that suggest the need for re-evaluation of current approaches. This paper reviews
the state of the art and reflects on some of the challenges likely to be faced when building trans-petascale computing sys-
tems, using insights and perspectives drawn from operational experience and community debates.
Keywords
Reliability, exascale systems, cloud computing, energy efficiency, data center
Finally, the complexity of software development Today, the end of classical Dennard scaling
continues to rise. Multidisciplinary models increasingly (Dennard et al, 1974) has profoundly influenced both
combine diverse algorithms and data representations, chip design and system design. A decade ago, multipro-
all of which must be optimized for deep memory hier- cessors emerged to provide continued performance
archies, multicore processors and functional accelera- increases, subject to chip power and clock frequency
tors. The rise of ‘‘big data’’ (Microsoft Research, 2009) constraints. More recently, accelerators have offered
further exacerbates application development and per- greater operations/joule and higher performance for
formance optimization. This is particularly true given some data parallel workloads. Looking forward, the
the divergent software ecosystems – programming lan- semiconductor fabrication challenges to create ever
guages, libraries and toolkits – for data analytics and smaller transistors are increasingly difficult.
scientific computing. Fabrication process variation is increasingly leading to
Similarly, two architectural paths are emerging in performance, energy and resilience variations across
the trans-petascale era as options to address the three multicore chips. The long-term future is uncertain,
key challenges of reliability, energy consumption and spanning options as diverse as analog or neuromorphic
software complexity (Geist and Lucas, 2009). The first computing and quantum devices (Fuller and Millett,
is characterized by a large number of homogeneous 2011).
nodes composed of manycore processors. The second is In the nearer term, two diverse architectural designs
characterized by an order of magnitude smaller number are emerging in the trans-petascale era. One design is
of heterogeneous nodes composed of graphics process- based on systems with heterogeneous nodes. Today,
ing units (GPUs) and central processing units (CPUs). these are typified by nodes combining powerful CPUs
With this backdrop, the remainder of the paper is with accelerators. Examples of this design are the two
organized as follows. We begin in Section 2 with a largest systems on the TOP500 list. China’s Tianhe-2
description of the two architectural paths, how they system, currently ranked first on the Top500 list con-
have emerged and how they are projected to evolve in tains 32,000 Intel Xeon CPUs and 48,000 Xeon Phi
the trans-petascale era. This is followed in Section 3 by accelerators (Meuer et al., 2014). Similarly, the Titan
a description of how resilience is potentially addressed system at Oak Ridge National Laboratory contains
on each path and the challenges that architecture poses 18,688 AMD Opteron CPUs and 18,688 NVidia Tesla
to resilient computation. accelerators.
In turn, Section 4 explores how each path Other basic features of this design are fewer more
approaches energy efficiency and what improvements powerful nodes, a multi-level memory, and a more
will be required by each architecture to keep power complex, split programming model –one for the CPU
consumption under 30 MW at 1000 petaflops (i.e., and one for the accelerator. Typically, the CPU and
exascale computing). In Section 5, we examine the accelerators each have different types of memory with
implications for software complexity on each of the different performance and fault tolerance characteris-
two architectural paths. In each of these sections, we tics. In the future, the node packaging may get denser,
will explore how alternative approaches from commer- but the basic features of this design are expected to
cial cloud computing may provide insights to exascale remain largely unchanged.
hardware provisioning, resilience and energy manage- The second system design is based on homogeneous
ment. Finally, Section 6 summarizes our observations nodes. The basic features of this design are large num-
and identifies potential research directions for the com- bers of smaller nodes, each with a pool of homoge-
putational science community. neous cores and a single level of memory. Examples of
the second design are the third through fifth systems on
the TOP500 list: the Sequoia system at Lawrence
2 Two roads into the future Livermore National Laboratory, the K computer at
The promise of commodity clusters has been fully rea- RIKEN in Japan and the Mira system at Argonne
lized. Inexpensive, yet powerful, microprocessors and National laboratory.
their integration in Beowulf clusters paved the way for Sequoia has just under 100,000 nodes, each with
today’s commodity HPC infrastructure. When coupled 16 IBM Power cores. The K computer has 88,000 nodes
with high-capacity secondary storage systems, high- each with 8 core SPARC processors. Mira has about
speed networks, inexpensive dynamic random access 50,000 Blue Gene/Q nodes each with 16 IBM Power
memory (DRAM) and powerful functional accelera- cores.
tors, these clusters dominate almost all aspects of HPC. Looking forward, it is likely that the nodes of this
We have progressed from leading edge systems contain- second design will be composed of processors with hun-
ing a few hundred processors to ones now containing dreds of low-power cores per node. In addition, the
tens of thousands of nodes and millions of cores. Yet memory architecture may become more complex, with
not all is well in the future of supercomputing. multiple levels and/or coherency domains. The
introduction of ‘‘burst buffers’’ and stacked memory communication costs (time and the energy required to
subsystems are early examples of this trend, emphasiz- transmit data) on petascale and putative exascale sys-
ing the need for higher bandwidth, greater locality and tems will also likely preclude global knowledge and
reduced data movement. control. This has profound implications for managing
and coordinating large numbers of nodes and parallel
2.1 Exascale challenges tasks.
To make larger systems useable and cost effective, it
As node counts for trans-petascale clusters continue
seems probable that we will need to develop and adopt
growing, overall system reliability and energy consump-
new design and operational models that embody two
tion are increasingly critical issues. Indeed, such large
important realities of large-scale systems: (a) frequent
systems are likely to have a mean time to failure of only
hardware component failures are part of normal opera-
hours unless more effective and resilient software is
tion; and (b) system and application optimization must
developed.
be multivariate, including energy cost and efficiency as
Likewise, the rising energy requirements of ever-
complements to performance and scalability.
larger HPC systems now pose limits on the practicality
of their deployment, due to both energy availability
and cost. Today’s largest systems consume several 2.2 Cloud scaling lessons and futures
megawatts of power. For example, China’s Tianhe-2
uses 17 MW when fully utilized, and other systems on Because the number of nodes in trans-petascale systems
the Top500 list are not far behind that level. is already comparable to that found in commercial
The top HPC sites around the world are built for a cloud computing systems (Barga et al., 2011), it seems
maximum of about 20 MW of power consumption for likely that there are useful lessons to be learned. These
a single system, plus the associated cooling capability insights might suggest new, adaptable designs for sys-
and power for data storage. Many of these facilities are temic resilience and energy efficiency.
constrained to specific locations for either national Although they share common scales, HPC and com-
security or historical reasons. This is in striking con- mercial cloud services do differ in some marked ways.
trast to commercial cloud data centers, which have While both are driven by cost, reliability and energy effi-
been placed to maximize the geographical advantages ciency imperatives, commercial cloud computing requires
of free environmental cooling and inexpensive energy continuous service, even in the face of substantial hard-
(e.g., locating near hydroelectric plants or wind farms). ware failures. In addition, the time value of money and
In HPC circles, the general rule of thumb is that a service demands necessitate just-in-time system deploy-
megawatt-year costs roughly one million US dollars. In ment with minimal time for on-site configuration and
the Tianhe-2 case, this would correspond to a cost of testing. Finally, as noted earlier, bounds on energy avail-
US$50,000/day just for the electricity consumed by the ability and costs have focused cloud optimizations on
system. Simply put, both the annual cost and available energy minimization and cost-effective cooling.
infrastructure constrain the amount of electricity trans- Despite the clear potential applicability of cloud
petascale systems can consume. Consequently, energy computing approaches to HPC system configuration
consumption is a major driver in the emergence of the and operation, only some of the approaches have been
two architectural designs. tailored to or adopted by the HPC community. Two
Both energy and resilience have been cited as among possible approaches, described below, include (a) relia-
the top ten challenges of exascale computing (Fuller bility and hardware provisioning models for redun-
and Millett, 2011). Despite this, our software and oper- dancy and systemic resilience based on field-replaceable
ating models for petascale clusters remains rooted in modules (FRMs) and (b) energy-aware batch schedu-
their historical origins from uniprocessor operating sys- lers and energy cost models for resource allocation and
tems. Operating systems, utilities and libraries for par- scheduling.
allel today’s systems are variants of those same Linux
operating systems that spawned commodity clusters 20
3 Redundancy and resilience
years ago (Becker et al., 1999).
These implementation choices are driven both by With the advent of monitoring infrastructures on both
the cost to develop software tailored to parallel systems massive cloud data centers and petascale HPC facilities,
and the desire to maintain compatibility and user famil- commodity component and system reliability data are
iarity with sequential systems. However, this uniproces- now being captured at unprecedented scale. Analysis of
sor software ecosystem still presumes reliable or nearly that data has revealed trends and behaviors that contra-
reliable operation of the hardware; failures are assumed dict many widely held beliefs and practices.
to be anomalous events. An early analysis of Los Alamos HPC system failure
Unlike today’s deus ex machina system software, data by Schroeder and Gibson (2006) showed that 50%
which presumes centralized control, data volumes and of node failures were due to hardware, a consistent
theme across systems of varying size and architecture. even larger number of undetected (silent) errors in cur-
They also observed that ‘‘. the failure rate of a system rent and future trans-petascale systems. The worst case
grows proportional to the number of processor chips in scenario is not that a silent error triggers a system fail-
the system. Second, there is little indication that sys- ure, but rather that the silent error results in an incor-
tems and their hardware get more reliable over time as rect answer being calculated. Because by definition the
technology changes.’’ error is undetected, the vendor and/or system software
Separate analysis of cloud data center disk and mem- cannot correct silent errors.
ory failures showed that the oft-assumed bathtub model Mathematicians have begun developing numerical
of component failure (i.e., high early failure rates, then algorithms that can be proved to converge to the cor-
a steady state of lower failure rates, followed by end of rect answer despite silent error corruption during com-
life higher failure rates) was incorrect (Sankar et al., putations (Stoyanov and Webster, YEAR ). The next
2013). DRAM errors were observed to be much more step would be to incorporate resilient numerical algo-
frequent than commonly believed, nor was operating rithms into HPC applications so they can tolerate some
temperature as critical as long believed in minimizing number of silent errors and system failures. This is but
failures (Schroeder et al., 2011; Tiwari et al., 2015). a first step toward a broader recognition that computa-
Simply put, the conventional wisdom about the types tions are themselves samples of a solution space, just as
and frequencies of hardware failures in the field at scale experiments are samples from a measurement space.
proved incorrect and incorrect in some surprising ways. The key is maximizing the probability that the compu-
In the world of HPC, fears that trans-petascale sys- tational samples are unbiased due to hardware failures,
tems with large numbers of components would experi- algorithmic features or software errors.
ence continuous failure turned out to be largely true, Because overall system reliability is the product of
but the vendors have made many errors transparent to the individual component reliabilities, even though the
users. In Sridharan and Liberty (2012) and Sridharan individual reliabilities may be high, the large number of
et al. (2013), the error correcting code (ECC) error rate components can make system reliability low. Beyond a
in Titan’s 600 TB of DRAM was studied in depth. The certain size, the overall reliability of a large system can
system logged 8000 single-bit errors every hour (all cor- be too low to be usable. This is especially true for tradi-
rected by the ECC). tional data parallel message passing interface (MPI)
Cosmic ray flux can account for at least half of the applications where the data is non-redundantly stored
errors observed on Titan. One double-bit error and periodic checkpointing is required to preserve reco-
occurred every 24 hours (which is better than the pre- verable states. As system size grows, checkpointing can
dicted DRAM FIT rate), and these double-bit errors consume an ever-larger fraction of available computing
were repaired via chip-kill (Xun and Kumar, 2013). time. This has motivated consideration of non-volatile
The study also showed that bit flips were often clus- random access memory (NVRAM) burst buffers and
tered, which is consistent with a cosmic ray strike that novel checkpointing schemes for future HPC systems.
disrupts several bits within a region. Ideally, systems would be homeostatic, adapting to
Resilience extends to every component of the sys- hardware and software failures without human inter-
tem, not just the complex CPU and memory circuits, vention and continuing to operate until the number of
and failures can affect systems at many levels and from failures became too large to meet performance goals. In
sometimes surprising sources. For example, the ORNL this spirit, the commercial cloud services community
system preceding Titan was called Jaguar. Jaguar con- long ago embraced systemic resilience and continued
tained the same number of racks as Titan, but without operation even during the inevitable hardware, software
the GPUs. and operator (user error) failures. Indeed, systems such
Jaguar had a serious resilience challenge due to fail- as Netflix’s Chaos Monkey (Bennett and Tseitlin, 2012)
ing voltage regulators. Although voltage regulators are intentionally inject faults into the infrastructure to test
simple circuits, there were over 18,000 of them in operational resilience.
Jaguar and, unlike Titan, a single regulator failure Modern HPC systems have embraced a related
could corrupt the entire system. Despite substantial on- approach. Despite the fact that ORNL’s Titan experi-
site analysis, it remains a mystery why they would fail. ences a node failure roughly every two days, the system
Studies showed that they did not fail under load, nor has not had an unscheduled outage in over 8 months.
when they were idle, but would fail randomly after a Cray’s resilience software detects the failure and recon-
load had been removed. figures the system around the failed node. It then places
the application(s) affected by the failure at the head of
the batch scheduling queue to re-execute from their last
3.1 Systemic resilience checkpoint. All of this occurs without user intervention.
The large number of detected DRAM errors in today’s Nodes and other FRMs can all be hot-swapped while
large-scale systems raises the specter of a potentially the Titan system is running. Periodically, technicians
remove the failed nodes and replace them with new and failed nodes are repaired (if possible) and returned
nodes. When replaced, Cray’s resilience software detects to the pool of spares, or the spare pool is replenished
the new node and reconfigures it for use by applications with new nodes. The key parameters for this model are
– all without system interruption. Because failed compo- the number of spares S as a function of system size, the
nents are periodically replaced, the HPC system never failure and repair rates, m and l, respectively, and p,
degrades to a point that users notice. the probability of successful repair.
When l = 0 (i.e., no repairs), this is a simple linear
death process, with spares being consumed as nodes
3.2 Modeling failures and repairs fail. This corresponds to a fixed price contract with a
Both commercial cloud data centers and large HPC sys- finite number of spare nodes. Conversely, if the contact
tems can be described via a simple over-provisioning specifies a performance level, then the number of spares
model that treats nodes as independent members of a is presumed to be ‘‘infinite’’ with respect to the opera-
computation pool, with a pool of spares. This approach tional lifetime of the system. This simple model does
manages failures and performance at two distinct not include the effects of component interdependence,
timescales. HPC system packaging and interconnects, nor of ‘‘cata-
The first timescale is the system’s nominal deploy- strophic’’ failures than can disable large numbers of
ment lifetime, typically 3–5 years. The procurement and nodes simultaneously.
operations goal is to maintain the system above the Although there are exact analytic solutions to the
baseline performance target for the system’s expected model of Figure 1 with under-simplified, non-realistic
lifetime. Simply put, there should be enough opera- assumptions (i.e., with negative exponential distribu-
tional hardware to meet performance expectations dur- tions for failure and repair rates and component failure
ing the entire system lifetime. Vendors and system independence), a general solution for actual, interde-
operators manage this via contracts that specify service pendent failure and repair distributions is quite challen-
level agreements and a level of on-site spares. As ging. Several groups have used simulation to study
described earlier, these spares are used to replace failed variants of this failure and repair model for HPC sys-
components and additional spares are ordered and tems in the context of determining checkpoint intervals
deployed as required by the contract. (Bougeret et al., 2011), using both an exponential distri-
The second timescale is that of day-to-day opera- bution and a Weibull distribution. The latter has been
tions, where the goal is to ensure the system is able to shown to be more representative of observed hardware
deliver application performance at expected and accep- failure distributions.
table levels. Failing nodes are detected and replaced.
Hardware repairs to failed nodes can be conducted off-
line (on site or by the vendor), and (if possible) they can 3.3 Reliability with heterogeneous nodes
be returned to the spare pool (Vishwanath, 2009). Although the performance efficiency of GPUs is well
These two timescales can be conceptualized as a sim- understood, their resilience characteristics in large-scale
ple and well known birth–death process (Kleinrock, computing systems have not been broadly evaluated.
1975), as indicated by Figure 1. Here, N nodes are allo- One of the first such studies (Tiwari et al., 2015) pre-
cated for computation and S nodes reserved as spares. sents the results of an 18-month study of Titan to qua-
A spare is activated each time a node failure is detected, litatively and quantitatively assess GPU errors on a
large-scale, GPU-enabled system. The study observed a heterogeneous systems. For example, an IBM
very low rate of GPU-related failures – approximately BlueGene rack such as that in Sequoia contains 1024
one per day. nodes. This higher density has led to concerns about
This is significant, as more than two failures per day thermally triggered failures due to hot spots within the
would be expected based on the vendor-specified mean racks. Such failure modes include component overheat-
time before failure (MTBF) for the GPU card. Again, ing and failing outright or aging prematurely.
this speaks directly to the assumptions behind vendor- Higher density also leads to the potential for corre-
specified MTBF; these are based on some combination lated failures (e.g., a power supply failure disabling a
of experience with previous generation products, extra- large number of nodes). Despite these risks, empirical
polations from accelerated sample testing and vendor data has shown that the IBM BlueGene systems have
marketing. Experimental data at scale has shown been remarkaby reliable compared to other HPC sys-
repeatedly that these models are approximations, at tems. In large measure, this is due to design focus on
best. resiliency.
In the Titan study, the primary causes of GPU fail-
ures were double-bit errors and ECC page retirement
recording errors. Because all memory in Titan is ECC 4 Large-scale energy efficiency
protected, even the GPU memory, this again highlights Commercial cloud data centers each consume tens of
the importance of managing multibit DRAM errors in megawatts of power, focusing optimization on both
large-scale systems. power consumption and on efficient energy use. Power
Secondly, the study found that 98% of all single bit Usage Effectiveness (PUE) is a well-established metric
errors occurred in only 10 (out of 18,688) GPU cards. for assessing the energy efficiency of such data centers.
This suggests that only a few cards in such large sys- Intuitively, PUE measures the fraction of the energy
tems may be a significant source of errors. Finding and delivered to a facility that powers the actual computing
removing such faulty cards can dramatically improve equipment.
system MTBF. This speaks to the need to test and bin Design optimizations have reduced industrial data
DRAM components carefully before installation. center PUEs from 5–7 to near 1 in state-of-the-art data
Thirdly, although adding GPUs decreased the centers. This has included raising facility operating tem-
MTBF of the overall system, the performance advan- peratures, based on experimental data at scale that has
tage of GPUs is so substantial that the system (Titan) shown components can tolerate higher temperatures. It
delivers much more useful work per unit time than has also included greater reliance on airside economiza-
would a CPU-only system, even in the presence of tion (i.e., free air cooling) and packaging and power dis-
failures. tribution for greater efficiency.
This work has also begun to influence the design
and deployment of new HPC data centers, substantially
3.4 Reliability with homogenous nodes reducing their PUE. For example, the new NSF Blue
The biggest reliability challenge for homogeneous sys- Waters facility at the National Center for
tems is the sheer number of components. Because the Supercomputing Applications (NCSA) has a PUE near
individual nodes are less powerful (due to design and 1.2 (Rath, 2010), due to efficient facility design. Other
node energy budget constraints), homogeneous systems HPC facilities are now adopting similar strategies,
of comparable performance have an order of magni- either through retrofits or new facility designs. In addi-
tude more nodes than heterogeneous systems. Because tion, hybrid energy-performance measurements such as
the failures in time (FIT) rates of individual chips the Green500 are raising awareness of energy in HPC
remain relatively constant over time, reliability is environments.
increased by minimizing the number chips per node. As noted earlier, today’s petascale systems already
Such systems have utilized system-on-a-chip and consume in excess of ten megawatts of power, and exas-
reduced numbers of DRAM chips to achieve these cale systems will consume more (Kamil et al., 2008),
reductions, as well as to reach node power targets even when based on hypothesized, hyper-efficient hard-
The overall experience with the LLNL Sequoia sys- ware. At this scale, energy infrastructure and consump-
tem (the largest system of this class in the world) has tion are a substantial fraction of the total cost of
shown that about one node fails every couple of days ownership (TCO), just as for commercial data centers.
(a rate similar to ORNL’s Titan), despite the fact that Computing resource allocations on commercial ser-
Sequoia has nearly 100,000 nodes. The Sequoia vices are directly measured in currency. In contrast, sci-
BlueGene node card is a field-replaceable unit and can entific computing allocations on national research
be hot-swapped, as is the case for ORNL’s Titan. facilities are still denominated in normalized service
Reflecting its design point, the packaging density of units. Consequently, few users are aware of the true
a homogeneous system is often higher than costs of a computation, which necessarily includes
capital acquisition cost, depreciation and operations. budget of P, the scheduling objective is to select an
The latter also includes the cost of energy to operating optimal subset S of J such that
the facility.
Energy costs could serve as one cost proxy for com- iP
=S iP
=S
pk P and max nk N P 2 Pp , Po
puting allocations, encouraging users to manage alloca- i=1 i=1
tions more wisely and allowing facilities to schedule
parallel jobs for energy and thermal efficiency. More (i.e., satisfying the maximum power and system
broadly, the goal would be to determine when (the time resource constraints) at energy cost
of day), how (which of the possibly heterogeneous
iP
=S
resources and at what scale, given sublinear speedups) E pk E 2 Ep , Eo
and where (if different sites have different costs) to i=1
compute and to make the proxy cost of computing
and ensuring fairness and maximal consistency with
manifest to users.
current schedulers and user experience.
Backfill scheduling algorithms have been studied
4.1 Energy constrained scheduling extensively, and there is no need to recapitulate that
work. Rather, the interesting question is the effect of
Energy for HPC systems originates from either energy including energy availability and cost on extant backfill
utilities or organizational power plants (i.e., at universi- scheduling algorithms, for example:
ties, national laboratories or government facilities). In
the first case, market forces and regulatory commis- FCFS, single queue with priorities;
sions determine rate schedules. These rates may vary FCFS, conservative backfill single queue with
based on the time or day, peak load requirements, or priorities;
limits from the consuming organization, or even FCFS, aggressive backfill, single queue with
auction-based pricing. For government and academic priorities;
users, contracts usually specify fixed prices, albeit pos- Maui backfill, multiple queues mimicking current
sibly with time of day (peak or off-peak) differentials. site configurations.
In addition, government prices are often substantially
lower than those available to commercial users. In practice, this means some modified version of stan-
Quite clearly, HPC jobs have distinct energy con- dard, multiple queue backfill scheduling (Jackson et al.,
sumption profiles, based on their mixes of computa- 2001; Lawson and Smirni, 2005) (i.e., conservative or
tion, communication and input/output phases and the EASY) that uses both energy and number of nodes as
extent of performance optimization for homogeneous backfill constraints.
or heterogeneous multicore nodes with accelerators.
These mixes are also dependent on input parameters
and the number of nodes requested when the job is sub- 4.2 Adaptive parallelism
mitted to a batch scheduler. Consider two cases for Now consider a generalization of the cases described
energy availability and price as follows.2 above, where the scheduler can choose the degree of
job parallelism within a user-specified range. In this
Peak/off-peak pricing but no constraints on energy more general case, the scheduling objective is still select-
availability, up to the maximum the HPC system ing an optimal subset of jobs, but with a combinato-
can consume. In this case, let Ep and Eo denote the rially larger number of job combinations.
peak and off-peak energy prices in cents per kilo- For simplicity’s sake, consider the special case of two
watt hour (kWh) different job configurations. This could correspond to
Fixed pricing but peak/off-peak energy availability, either (a) two different node parallelism levels or (b) use
where the off-peak maximum is the maximum the of nodes with or without accelerators (e.g., GPUs). In
HPC system can consume, but the peak is less (i.e., an energy- or cost-constrained environment, one or the
during peak times the system cannot draw its maxi- other might be preferred, based on the characteristics of
mum energy). In this case, let Pp and Pp denote the the jobs in the batch queue and the speedup as a func-
peak and off-peak energy bounds. tion of node type and number.
Intuitively, one might choose to either run "fast and
Now consider a sequence of M batch jobs hot’’ or ‘‘cool and slow’’ based on energy availability
J = fj1 , :::jk , :::jM g, where jk = (nk , pk , tk ) and nk , pk , tk and speedup scaling. Such an approach might involve a
denote the number of nodes requested, the estimated learning system that uses data from previous executions
total energy consumption for the job, and the maxi- to build a performance-energy profile for energy-
mum execution time for job k, respectively. If the HPC constrained scheduling and adaptively explores the
system contains N nodes and has a total system energy parameter space to find efficient configurations.
of Iowa. He currently serves as a member of the Chancellor’s Eminent Professor at UNC Chapel Hill,
Department of Energy’s Advanced Scientific as well as the Director of the Renaissance Computing
Computing Advisory Committee and as a member of Institute (RENCI) and the Chancellor’s Senior Advisor
the ICANN GNO Council. Previously, he was for Strategy and Innovation for UNC Chapel Hill.
Microsoft’s Corporate Vice President for Technology Prior to that, he was Gutgsell Professor and Head of
Policy and Extreme Computing, where he helped shape the Department of Computer Science at the University
Microsoft’s long-term vision for technology innova- of Illinois at Urbana-Champaign (UIUC) and Director
tions in cloud computing and the company’s associated of the NCSA. He is a former member of the US
policy engagement with governments and institutions President’s Council of Advisors on Science and
around the world. Before joining Microsoft, he was the Technology (PCAST).