0% found this document useful (0 votes)

23 views10 pages

A Survey of High-Performance Computing Scaling Challenge

The document discusses the challenges of building extremely large high-performance computing systems beyond the petascale level, known as trans-petascale systems. It reviews the state of high-performance computing and reflects on challenges of reliability, energy efficiency, and software complexity at future exascale scales. Two emerging architectural approaches are also described - systems with many homogeneous nodes and systems with a smaller number of heterogeneous nodes combining CPUs and accelerators.

Uploaded by

salah nura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views10 pages

A Survey of High-Performance Computing Scaling Challenge

Uploaded by

salah nura

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Original Article

The International Journal of High

Performance Computing Applications
1–10
A survey of high-performance Ó The Author(s) 2015
Reprints and permissions:
computing scaling challenges sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/1094342015597083
hpc.sagepub.com

Al Geist1 and Daniel A Reed2

Abstract
Commodity clusters revolutionized high-performance computing when they first appeared two decades ago. As scale
and complexity have grown, new challenges in reliability and systemic resilience, energy efficiency and optimization and
software complexity have emerged that suggest the need for re-evaluation of current approaches. This paper reviews
the state of the art and reflects on some of the challenges likely to be faced when building trans-petascale computing sys-
tems, using insights and perspectives drawn from operational experience and community debates.

Keywords
Reliability, exascale systems, cloud computing, energy efficiency, data center

1 Introduction grown with aggregate system performance, placing

great stress on our historical model of application-
Twenty years ago, Sterling and Becker’s seminal work mediated checkpoint-restart.
on Beowulf clusters and UC-Berkeley’s Network of Meanwhile, commercial cloud computing vendors
Workstations (NOW) projects (Anderson et al., 1995; have embraced component failure as a common event,
Becker et al., 1999) both stimulated widespread experi- relying on redundancy to tolerate hardware failures.
ments with commodity hardware and open source soft- They have also adopted a weak consistency transac-
ware for scientific computing. At that time, the second tional model and system-wide resilience testing tools
wave of the ‘‘attack of the killer micros’’1 was well (e.g., such as Netflix Simian Army (Bennett and
underway, as the rising performance of inexpensive x86
Tseitlin, 2012)) to continuously validate quality of ser-
processors made it possible to assemble clusters with
vice. Although this transactional model would be
aggregate performance substantially greater than that
difficult to apply to transitional, data-partitioned high-
achievable with vendor-packaged systems and at a
performance computing (HPC) workloads, continuous
much lower cost. Simply put, a new price-performance
testing is applicable to a new generation of sensor-
standard was being created for advanced computing,
driven scientific workflows.
and the long-held dreams of terascale performance
The energy required for petascale clusters is now
were now within reach.
measured in megawatts, and commercial cloud data
Today, that possibility is not only a reality, it is the
centers consume 25–60 megawatts. Putative exascale
global standard for scientific computing. High-perfor-
systems would consume hundreds of megawatts using
mance x86 processors, augmented with functional
current technologies. These requirements mean both
accelerators (e.g., Intel Xeon Phi or NVidia Tesla), and
the energy capital and operating costs for leading edge
open source Linux software and associated toolkits
computing systems are now substantial fractions of the
now dominate advanced computing. Petascale clusters
total cost of ownership (TCO), with profound implica-
are increasingly common and dominate the Top500
tions for facility design, location and management.
(Meuer et al.., 2014) list of the world’s most powerful
computers.
Looking forward, the challenges of resilience, energy 1
Oak Ridge National Laboratory, USA
2
efficiency and software complexity all suggest we need Department of Computer Science, University of Iowa, USA
a new and equally disruptive approach to trans-
Corresponding author:
petascale computing. In the world of large-scale cluster Daniel A Reed, Department of Computer Science, University of Iowa,
computing, cluster reliability and resilience have proven Iowa City, IA 52242, USA.
increasingly problematic as component counts have Email: dan-reed@uiowa.edu

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

2 The International Journal of High Performance Computing Applications

Finally, the complexity of software development Today, the end of classical Dennard scaling
continues to rise. Multidisciplinary models increasingly (Dennard et al, 1974) has profoundly influenced both
combine diverse algorithms and data representations, chip design and system design. A decade ago, multipro-
all of which must be optimized for deep memory hier- cessors emerged to provide continued performance
archies, multicore processors and functional accelera- increases, subject to chip power and clock frequency
tors. The rise of ‘‘big data’’ (Microsoft Research, 2009) constraints. More recently, accelerators have offered
further exacerbates application development and per- greater operations/joule and higher performance for
formance optimization. This is particularly true given some data parallel workloads. Looking forward, the
the divergent software ecosystems – programming lan- semiconductor fabrication challenges to create ever
guages, libraries and toolkits – for data analytics and smaller transistors are increasingly difficult.
scientific computing. Fabrication process variation is increasingly leading to
Similarly, two architectural paths are emerging in performance, energy and resilience variations across
the trans-petascale era as options to address the three multicore chips. The long-term future is uncertain,
key challenges of reliability, energy consumption and spanning options as diverse as analog or neuromorphic
software complexity (Geist and Lucas, 2009). The first computing and quantum devices (Fuller and Millett,
is characterized by a large number of homogeneous 2011).
nodes composed of manycore processors. The second is In the nearer term, two diverse architectural designs
characterized by an order of magnitude smaller number are emerging in the trans-petascale era. One design is
of heterogeneous nodes composed of graphics process- based on systems with heterogeneous nodes. Today,
ing units (GPUs) and central processing units (CPUs). these are typified by nodes combining powerful CPUs
With this backdrop, the remainder of the paper is with accelerators. Examples of this design are the two
organized as follows. We begin in Section 2 with a largest systems on the TOP500 list. China’s Tianhe-2
description of the two architectural paths, how they system, currently ranked first on the Top500 list con-
have emerged and how they are projected to evolve in tains 32,000 Intel Xeon CPUs and 48,000 Xeon Phi
the trans-petascale era. This is followed in Section 3 by accelerators (Meuer et al., 2014). Similarly, the Titan
a description of how resilience is potentially addressed system at Oak Ridge National Laboratory contains
on each path and the challenges that architecture poses 18,688 AMD Opteron CPUs and 18,688 NVidia Tesla
to resilient computation. accelerators.
In turn, Section 4 explores how each path Other basic features of this design are fewer more
approaches energy efficiency and what improvements powerful nodes, a multi-level memory, and a more
will be required by each architecture to keep power complex, split programming model –one for the CPU
consumption under 30 MW at 1000 petaflops (i.e., and one for the accelerator. Typically, the CPU and
exascale computing). In Section 5, we examine the accelerators each have different types of memory with
implications for software complexity on each of the different performance and fault tolerance characteris-
two architectural paths. In each of these sections, we tics. In the future, the node packaging may get denser,
will explore how alternative approaches from commer- but the basic features of this design are expected to
cial cloud computing may provide insights to exascale remain largely unchanged.
hardware provisioning, resilience and energy manage- The second system design is based on homogeneous
ment. Finally, Section 6 summarizes our observations nodes. The basic features of this design are large num-
and identifies potential research directions for the com- bers of smaller nodes, each with a pool of homoge-
putational science community. neous cores and a single level of memory. Examples of
the second design are the third through fifth systems on
the TOP500 list: the Sequoia system at Lawrence
2 Two roads into the future Livermore National Laboratory, the K computer at
The promise of commodity clusters has been fully rea- RIKEN in Japan and the Mira system at Argonne
lized. Inexpensive, yet powerful, microprocessors and National laboratory.
their integration in Beowulf clusters paved the way for Sequoia has just under 100,000 nodes, each with
today’s commodity HPC infrastructure. When coupled 16 IBM Power cores. The K computer has 88,000 nodes
with high-capacity secondary storage systems, high- each with 8 core SPARC processors. Mira has about
speed networks, inexpensive dynamic random access 50,000 Blue Gene/Q nodes each with 16 IBM Power
memory (DRAM) and powerful functional accelera- cores.
tors, these clusters dominate almost all aspects of HPC. Looking forward, it is likely that the nodes of this
We have progressed from leading edge systems contain- second design will be composed of processors with hun-
ing a few hundred processors to ones now containing dreds of low-power cores per node. In addition, the
tens of thousands of nodes and millions of cores. Yet memory architecture may become more complex, with
not all is well in the future of supercomputing. multiple levels and/or coherency domains. The

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Geist and Reed 3

introduction of ‘‘burst buffers’’ and stacked memory communication costs (time and the energy required to
subsystems are early examples of this trend, emphasiz- transmit data) on petascale and putative exascale sys-
ing the need for higher bandwidth, greater locality and tems will also likely preclude global knowledge and
reduced data movement. control. This has profound implications for managing
and coordinating large numbers of nodes and parallel
2.1 Exascale challenges tasks.
To make larger systems useable and cost effective, it
As node counts for trans-petascale clusters continue
seems probable that we will need to develop and adopt
growing, overall system reliability and energy consump-
new design and operational models that embody two
tion are increasingly critical issues. Indeed, such large
important realities of large-scale systems: (a) frequent
systems are likely to have a mean time to failure of only
hardware component failures are part of normal opera-
hours unless more effective and resilient software is
tion; and (b) system and application optimization must
developed.
be multivariate, including energy cost and efficiency as
Likewise, the rising energy requirements of ever-
complements to performance and scalability.
larger HPC systems now pose limits on the practicality
of their deployment, due to both energy availability
and cost. Today’s largest systems consume several 2.2 Cloud scaling lessons and futures
megawatts of power. For example, China’s Tianhe-2
uses 17 MW when fully utilized, and other systems on Because the number of nodes in trans-petascale systems
the Top500 list are not far behind that level. is already comparable to that found in commercial
The top HPC sites around the world are built for a cloud computing systems (Barga et al., 2011), it seems
maximum of about 20 MW of power consumption for likely that there are useful lessons to be learned. These
a single system, plus the associated cooling capability insights might suggest new, adaptable designs for sys-
and power for data storage. Many of these facilities are temic resilience and energy efficiency.
constrained to specific locations for either national Although they share common scales, HPC and com-
security or historical reasons. This is in striking con- mercial cloud services do differ in some marked ways.
trast to commercial cloud data centers, which have While both are driven by cost, reliability and energy effi-
been placed to maximize the geographical advantages ciency imperatives, commercial cloud computing requires
of free environmental cooling and inexpensive energy continuous service, even in the face of substantial hard-
(e.g., locating near hydroelectric plants or wind farms). ware failures. In addition, the time value of money and
In HPC circles, the general rule of thumb is that a service demands necessitate just-in-time system deploy-
megawatt-year costs roughly one million US dollars. In ment with minimal time for on-site configuration and
the Tianhe-2 case, this would correspond to a cost of testing. Finally, as noted earlier, bounds on energy avail-
US$50,000/day just for the electricity consumed by the ability and costs have focused cloud optimizations on
system. Simply put, both the annual cost and available energy minimization and cost-effective cooling.
infrastructure constrain the amount of electricity trans- Despite the clear potential applicability of cloud
petascale systems can consume. Consequently, energy computing approaches to HPC system configuration
consumption is a major driver in the emergence of the and operation, only some of the approaches have been
two architectural designs. tailored to or adopted by the HPC community. Two
Both energy and resilience have been cited as among possible approaches, described below, include (a) relia-
the top ten challenges of exascale computing (Fuller bility and hardware provisioning models for redun-
and Millett, 2011). Despite this, our software and oper- dancy and systemic resilience based on field-replaceable
ating models for petascale clusters remains rooted in modules (FRMs) and (b) energy-aware batch schedu-
their historical origins from uniprocessor operating sys- lers and energy cost models for resource allocation and
tems. Operating systems, utilities and libraries for par- scheduling.
allel today’s systems are variants of those same Linux
operating systems that spawned commodity clusters 20
3 Redundancy and resilience
years ago (Becker et al., 1999).
These implementation choices are driven both by With the advent of monitoring infrastructures on both
the cost to develop software tailored to parallel systems massive cloud data centers and petascale HPC facilities,
and the desire to maintain compatibility and user famil- commodity component and system reliability data are
iarity with sequential systems. However, this uniproces- now being captured at unprecedented scale. Analysis of
sor software ecosystem still presumes reliable or nearly that data has revealed trends and behaviors that contra-
reliable operation of the hardware; failures are assumed dict many widely held beliefs and practices.
to be anomalous events. An early analysis of Los Alamos HPC system failure
Unlike today’s deus ex machina system software, data by Schroeder and Gibson (2006) showed that 50%
which presumes centralized control, data volumes and of node failures were due to hardware, a consistent

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

4 The International Journal of High Performance Computing Applications

theme across systems of varying size and architecture. even larger number of undetected (silent) errors in cur-
They also observed that ‘‘. the failure rate of a system rent and future trans-petascale systems. The worst case
grows proportional to the number of processor chips in scenario is not that a silent error triggers a system fail-
the system. Second, there is little indication that sys- ure, but rather that the silent error results in an incor-
tems and their hardware get more reliable over time as rect answer being calculated. Because by definition the
technology changes.’’ error is undetected, the vendor and/or system software
Separate analysis of cloud data center disk and mem- cannot correct silent errors.
ory failures showed that the oft-assumed bathtub model Mathematicians have begun developing numerical
of component failure (i.e., high early failure rates, then algorithms that can be proved to converge to the cor-
a steady state of lower failure rates, followed by end of rect answer despite silent error corruption during com-
life higher failure rates) was incorrect (Sankar et al., putations (Stoyanov and Webster, YEAR ). The next
2013). DRAM errors were observed to be much more step would be to incorporate resilient numerical algo-
frequent than commonly believed, nor was operating rithms into HPC applications so they can tolerate some
temperature as critical as long believed in minimizing number of silent errors and system failures. This is but
failures (Schroeder et al., 2011; Tiwari et al., 2015). a first step toward a broader recognition that computa-
Simply put, the conventional wisdom about the types tions are themselves samples of a solution space, just as
and frequencies of hardware failures in the field at scale experiments are samples from a measurement space.
proved incorrect and incorrect in some surprising ways. The key is maximizing the probability that the compu-
In the world of HPC, fears that trans-petascale sys- tational samples are unbiased due to hardware failures,
tems with large numbers of components would experi- algorithmic features or software errors.
ence continuous failure turned out to be largely true, Because overall system reliability is the product of
but the vendors have made many errors transparent to the individual component reliabilities, even though the
users. In Sridharan and Liberty (2012) and Sridharan individual reliabilities may be high, the large number of
et al. (2013), the error correcting code (ECC) error rate components can make system reliability low. Beyond a
in Titan’s 600 TB of DRAM was studied in depth. The certain size, the overall reliability of a large system can
system logged 8000 single-bit errors every hour (all cor- be too low to be usable. This is especially true for tradi-
rected by the ECC). tional data parallel message passing interface (MPI)
Cosmic ray flux can account for at least half of the applications where the data is non-redundantly stored
errors observed on Titan. One double-bit error and periodic checkpointing is required to preserve reco-
occurred every 24 hours (which is better than the pre- verable states. As system size grows, checkpointing can
dicted DRAM FIT rate), and these double-bit errors consume an ever-larger fraction of available computing
were repaired via chip-kill (Xun and Kumar, 2013). time. This has motivated consideration of non-volatile
The study also showed that bit flips were often clus- random access memory (NVRAM) burst buffers and
tered, which is consistent with a cosmic ray strike that novel checkpointing schemes for future HPC systems.
disrupts several bits within a region. Ideally, systems would be homeostatic, adapting to
Resilience extends to every component of the sys- hardware and software failures without human inter-
tem, not just the complex CPU and memory circuits, vention and continuing to operate until the number of
and failures can affect systems at many levels and from failures became too large to meet performance goals. In
sometimes surprising sources. For example, the ORNL this spirit, the commercial cloud services community
system preceding Titan was called Jaguar. Jaguar con- long ago embraced systemic resilience and continued
tained the same number of racks as Titan, but without operation even during the inevitable hardware, software
the GPUs. and operator (user error) failures. Indeed, systems such
Jaguar had a serious resilience challenge due to fail- as Netflix’s Chaos Monkey (Bennett and Tseitlin, 2012)
ing voltage regulators. Although voltage regulators are intentionally inject faults into the infrastructure to test
simple circuits, there were over 18,000 of them in operational resilience.
Jaguar and, unlike Titan, a single regulator failure Modern HPC systems have embraced a related
could corrupt the entire system. Despite substantial on- approach. Despite the fact that ORNL’s Titan experi-
site analysis, it remains a mystery why they would fail. ences a node failure roughly every two days, the system
Studies showed that they did not fail under load, nor has not had an unscheduled outage in over 8 months.
when they were idle, but would fail randomly after a Cray’s resilience software detects the failure and recon-
load had been removed. figures the system around the failed node. It then places
the application(s) affected by the failure at the head of
the batch scheduling queue to re-execute from their last
3.1 Systemic resilience checkpoint. All of this occurs without user intervention.
The large number of detected DRAM errors in today’s Nodes and other FRMs can all be hot-swapped while
large-scale systems raises the specter of a potentially the Titan system is running. Periodically, technicians

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Geist and Reed 5

Figure 1. Simple component failure model with repairs.

remove the failed nodes and replace them with new and failed nodes are repaired (if possible) and returned
nodes. When replaced, Cray’s resilience software detects to the pool of spares, or the spare pool is replenished
the new node and reconfigures it for use by applications with new nodes. The key parameters for this model are
– all without system interruption. Because failed compo- the number of spares S as a function of system size, the
nents are periodically replaced, the HPC system never failure and repair rates, m and l, respectively, and p,
degrades to a point that users notice. the probability of successful repair.
When l = 0 (i.e., no repairs), this is a simple linear
death process, with spares being consumed as nodes
3.2 Modeling failures and repairs fail. This corresponds to a fixed price contract with a
Both commercial cloud data centers and large HPC sys- finite number of spare nodes. Conversely, if the contact
tems can be described via a simple over-provisioning specifies a performance level, then the number of spares
model that treats nodes as independent members of a is presumed to be ‘‘infinite’’ with respect to the opera-
computation pool, with a pool of spares. This approach tional lifetime of the system. This simple model does
manages failures and performance at two distinct not include the effects of component interdependence,
timescales. HPC system packaging and interconnects, nor of ‘‘cata-
The first timescale is the system’s nominal deploy- strophic’’ failures than can disable large numbers of
ment lifetime, typically 3–5 years. The procurement and nodes simultaneously.
operations goal is to maintain the system above the Although there are exact analytic solutions to the
baseline performance target for the system’s expected model of Figure 1 with under-simplified, non-realistic
lifetime. Simply put, there should be enough opera- assumptions (i.e., with negative exponential distribu-
tional hardware to meet performance expectations dur- tions for failure and repair rates and component failure
ing the entire system lifetime. Vendors and system independence), a general solution for actual, interde-
operators manage this via contracts that specify service pendent failure and repair distributions is quite challen-
level agreements and a level of on-site spares. As ging. Several groups have used simulation to study
described earlier, these spares are used to replace failed variants of this failure and repair model for HPC sys-
components and additional spares are ordered and tems in the context of determining checkpoint intervals
deployed as required by the contract. (Bougeret et al., 2011), using both an exponential distri-
The second timescale is that of day-to-day opera- bution and a Weibull distribution. The latter has been
tions, where the goal is to ensure the system is able to shown to be more representative of observed hardware
deliver application performance at expected and accep- failure distributions.
table levels. Failing nodes are detected and replaced.
Hardware repairs to failed nodes can be conducted off-
line (on site or by the vendor), and (if possible) they can 3.3 Reliability with heterogeneous nodes
be returned to the spare pool (Vishwanath, 2009). Although the performance efficiency of GPUs is well
These two timescales can be conceptualized as a sim- understood, their resilience characteristics in large-scale
ple and well known birth–death process (Kleinrock, computing systems have not been broadly evaluated.
1975), as indicated by Figure 1. Here, N nodes are allo- One of the first such studies (Tiwari et al., 2015) pre-
cated for computation and S nodes reserved as spares. sents the results of an 18-month study of Titan to qua-
A spare is activated each time a node failure is detected, litatively and quantitatively assess GPU errors on a

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

6 The International Journal of High Performance Computing Applications

large-scale, GPU-enabled system. The study observed a heterogeneous systems. For example, an IBM
very low rate of GPU-related failures – approximately BlueGene rack such as that in Sequoia contains 1024
one per day. nodes. This higher density has led to concerns about
This is significant, as more than two failures per day thermally triggered failures due to hot spots within the
would be expected based on the vendor-specified mean racks. Such failure modes include component overheat-
time before failure (MTBF) for the GPU card. Again, ing and failing outright or aging prematurely.
this speaks directly to the assumptions behind vendor- Higher density also leads to the potential for corre-
specified MTBF; these are based on some combination lated failures (e.g., a power supply failure disabling a
of experience with previous generation products, extra- large number of nodes). Despite these risks, empirical
polations from accelerated sample testing and vendor data has shown that the IBM BlueGene systems have
marketing. Experimental data at scale has shown been remarkaby reliable compared to other HPC sys-
repeatedly that these models are approximations, at tems. In large measure, this is due to design focus on
best. resiliency.
In the Titan study, the primary causes of GPU fail-
ures were double-bit errors and ECC page retirement
recording errors. Because all memory in Titan is ECC 4 Large-scale energy efficiency
protected, even the GPU memory, this again highlights Commercial cloud data centers each consume tens of
the importance of managing multibit DRAM errors in megawatts of power, focusing optimization on both
large-scale systems. power consumption and on efficient energy use. Power
Secondly, the study found that 98% of all single bit Usage Effectiveness (PUE) is a well-established metric
errors occurred in only 10 (out of 18,688) GPU cards. for assessing the energy efficiency of such data centers.
This suggests that only a few cards in such large sys- Intuitively, PUE measures the fraction of the energy
tems may be a significant source of errors. Finding and delivered to a facility that powers the actual computing
removing such faulty cards can dramatically improve equipment.
system MTBF. This speaks to the need to test and bin Design optimizations have reduced industrial data
DRAM components carefully before installation. center PUEs from 5–7 to near 1 in state-of-the-art data
Thirdly, although adding GPUs decreased the centers. This has included raising facility operating tem-
MTBF of the overall system, the performance advan- peratures, based on experimental data at scale that has
tage of GPUs is so substantial that the system (Titan) shown components can tolerate higher temperatures. It
delivers much more useful work per unit time than has also included greater reliance on airside economiza-
would a CPU-only system, even in the presence of tion (i.e., free air cooling) and packaging and power dis-
failures. tribution for greater efficiency.
This work has also begun to influence the design
and deployment of new HPC data centers, substantially
3.4 Reliability with homogenous nodes reducing their PUE. For example, the new NSF Blue
The biggest reliability challenge for homogeneous sys- Waters facility at the National Center for
tems is the sheer number of components. Because the Supercomputing Applications (NCSA) has a PUE near
individual nodes are less powerful (due to design and 1.2 (Rath, 2010), due to efficient facility design. Other
node energy budget constraints), homogeneous systems HPC facilities are now adopting similar strategies,
of comparable performance have an order of magni- either through retrofits or new facility designs. In addi-
tude more nodes than heterogeneous systems. Because tion, hybrid energy-performance measurements such as
the failures in time (FIT) rates of individual chips the Green500 are raising awareness of energy in HPC
remain relatively constant over time, reliability is environments.
increased by minimizing the number chips per node. As noted earlier, today’s petascale systems already
Such systems have utilized system-on-a-chip and consume in excess of ten megawatts of power, and exas-
reduced numbers of DRAM chips to achieve these cale systems will consume more (Kamil et al., 2008),
reductions, as well as to reach node power targets even when based on hypothesized, hyper-efficient hard-
The overall experience with the LLNL Sequoia sys- ware. At this scale, energy infrastructure and consump-
tem (the largest system of this class in the world) has tion are a substantial fraction of the total cost of
shown that about one node fails every couple of days ownership (TCO), just as for commercial data centers.
(a rate similar to ORNL’s Titan), despite the fact that Computing resource allocations on commercial ser-
Sequoia has nearly 100,000 nodes. The Sequoia vices are directly measured in currency. In contrast, sci-
BlueGene node card is a field-replaceable unit and can entific computing allocations on national research
be hot-swapped, as is the case for ORNL’s Titan. facilities are still denominated in normalized service
Reflecting its design point, the packaging density of units. Consequently, few users are aware of the true
a homogeneous system is often higher than costs of a computation, which necessarily includes

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Geist and Reed 7

capital acquisition cost, depreciation and operations. budget of P, the scheduling objective is to select an
The latter also includes the cost of energy to operating optimal subset S of J such that
the facility.
Energy costs could serve as one cost proxy for com- iP
=S iP
=S
pk P and max nk N P 2 Pp , Po
puting allocations, encouraging users to manage alloca- i=1 i=1
tions more wisely and allowing facilities to schedule
parallel jobs for energy and thermal efficiency. More (i.e., satisfying the maximum power and system
broadly, the goal would be to determine when (the time resource constraints) at energy cost
of day), how (which of the possibly heterogeneous
iP
=S
resources and at what scale, given sublinear speedups) E pk E 2 Ep , Eo
and where (if different sites have different costs) to i=1
compute and to make the proxy cost of computing
and ensuring fairness and maximal consistency with
manifest to users.
current schedulers and user experience.
Backfill scheduling algorithms have been studied
4.1 Energy constrained scheduling extensively, and there is no need to recapitulate that
work. Rather, the interesting question is the effect of
Energy for HPC systems originates from either energy including energy availability and cost on extant backfill
utilities or organizational power plants (i.e., at universi- scheduling algorithms, for example:
ties, national laboratories or government facilities). In
the first case, market forces and regulatory commis- FCFS, single queue with priorities;
sions determine rate schedules. These rates may vary FCFS, conservative backfill single queue with
based on the time or day, peak load requirements, or priorities;
limits from the consuming organization, or even FCFS, aggressive backfill, single queue with
auction-based pricing. For government and academic priorities;
users, contracts usually specify fixed prices, albeit pos- Maui backfill, multiple queues mimicking current
sibly with time of day (peak or off-peak) differentials. site configurations.
In addition, government prices are often substantially
lower than those available to commercial users. In practice, this means some modified version of stan-
Quite clearly, HPC jobs have distinct energy con- dard, multiple queue backfill scheduling (Jackson et al.,
sumption profiles, based on their mixes of computa- 2001; Lawson and Smirni, 2005) (i.e., conservative or
tion, communication and input/output phases and the EASY) that uses both energy and number of nodes as
extent of performance optimization for homogeneous backfill constraints.
or heterogeneous multicore nodes with accelerators.
These mixes are also dependent on input parameters
and the number of nodes requested when the job is sub- 4.2 Adaptive parallelism
mitted to a batch scheduler. Consider two cases for Now consider a generalization of the cases described
energy availability and price as follows.2 above, where the scheduler can choose the degree of
job parallelism within a user-specified range. In this
Peak/off-peak pricing but no constraints on energy more general case, the scheduling objective is still select-
availability, up to the maximum the HPC system ing an optimal subset of jobs, but with a combinato-
can consume. In this case, let Ep and Eo denote the rially larger number of job combinations.
peak and off-peak energy prices in cents per kilo- For simplicity’s sake, consider the special case of two
watt hour (kWh) different job configurations. This could correspond to
Fixed pricing but peak/off-peak energy availability, either (a) two different node parallelism levels or (b) use
where the off-peak maximum is the maximum the of nodes with or without accelerators (e.g., GPUs). In
HPC system can consume, but the peak is less (i.e., an energy- or cost-constrained environment, one or the
during peak times the system cannot draw its maxi- other might be preferred, based on the characteristics of
mum energy). In this case, let Pp and Pp denote the the jobs in the batch queue and the speedup as a func-
peak and off-peak energy bounds. tion of node type and number.
Intuitively, one might choose to either run "fast and
Now consider a sequence of M batch jobs hot’’ or ‘‘cool and slow’’ based on energy availability
J = fj1 , :::jk , :::jM g, where jk = (nk , pk , tk ) and nk , pk , tk and speedup scaling. Such an approach might involve a
denote the number of nodes requested, the estimated learning system that uses data from previous executions
total energy consumption for the job, and the maxi- to build a performance-energy profile for energy-
mum execution time for job k, respectively. If the HPC constrained scheduling and adaptively explores the
system contains N nodes and has a total system energy parameter space to find efficient configurations.

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

8 The International Journal of High Performance Computing Applications

5 Growing software complexity 6 Conclusions

The complexity of application and system software con- The world of advanced computing is far different from
tinues to grow in several dimensions. Firstly, as compu- when modest-scale commodity clusters were a novel
tational performance increases, scientists develop more idea. We have learned a great many lessons about
complex applications that incorporate finer temporal large-scale system design and operation using commod-
and spatial scales, more complex physics and (increas- ity components. Yet those same successes have raised
ingly) data assimilation. daunting new challenges in the trans-petascale regime.
The second is the structuring of software to mini- As node counts for trans-petascale systems grow to
mize data movement. Data locality is key to maximized tens of thousands and with proposed exascale systems
performance and minimized energy use on today’s sys- likely to contain hundreds of thousands of nodes and
tems with deep memory hierarchies. Other dimensions billion-way parallelism, the assumption of fully reliable
of software complexity include adding algorithmic resi- hardware operation becomes much less credible.
lience, particularly to tolerate silent errors, and adding Although the MTBF for the individual components
energy awareness or adaptability to allow the scheduler (i.e., processors, disks, memories, power supplies, fans
to minimize energy use dynamically. and networks) is high, the large overall component
Adding to these software complexities is the exis- count means that component failures are and will be a
tence of two architectural paths, each with particular frequent concern. System design for resilience, manu-
challenges. In the heterogeneous node case, the split facturing and testing for component validation, and
programming model is particularly pernicious. One algorithmic and software resilience are all essential ele-
programming model is needed for the multicore CPU ments of operational resilience at scale.
(e.g., data parallel MPI) and another for the accelerator Increasing system sizes bring a complementary chal-
(e.g., OpenCL). Design choices such as how and where lenge surrounding energy availability and costs, with
to run different parts of the application have dramatic projected systems expected to consume tens of mega-
effects on its performance and also resilience and energy watts of power. At this scale, both the energy availabil-
consumption. ity and cost become significant design and operation
In the homogeneous node case, the key challenge is constraints, which are manifest in job scheduling poli-
the amount of parallelism required to utilize the large cies and resource allocation. In addition, the cost of
number of nodes, along with the challenges of data researcher resource allocations and use become more
movement minimization, resilience and energy effi- than just abstractions. They are real economic choices
ciency. Scalability can be limited by the problem size, with implications for HPC system operators and
the algorithms used and by the speed that data can be research funding agencies.
shared across nodes. Reducing communication among Driven by the challenges of energy efficiency, two
nodes has the dual benefit of reducing energy consump- diverse architectural designs are emerging in the
tion and also improving resilience by reducing the pro- trans-petascale era. One is typified by a small number
pagation of errors. of heterogeneous nodes containing multicore
Because future instances of both homogeneous and CPUs and accelerators, and the other is typified by an
heterogeneous designs are likely to have multiple levels order of magnitude larger number of homogeneous
of memory, these hierarchical memories will affect the nodes.
complexity of the software. The NUMA nature of the For future exascale systems to be useable, it seems
hierarchy and the different types of memory – non- likely that we must develop new design methodologies
volatile memory, DRAM and high bandwidth (stacked) and operating principles that embody the three impor-
memory – will challenge the software developer to find tant realities of large-scale systems: (a) frequent hard-
the most efficient way to utilize the different memory ware component failures are a part of normal
types, as well as minimizing the movement of data operation; (b) energy consumption and power costs
within the hierarchy. must be managed with as much care as performance
Finally, the two architectural paths add the complex- and resilience; and (c) software complexity must be
ity of performance portability (i.e., the desire to develop managed to lessen software development costs.
one source code that runs well on both architectures In each case, there are potential lessons to be learned
with little or no change). The research need and oppor- from commercial cloud computing regarding compo-
tunity is to identify low-overhead (both development nent rightsizing and optimization, energy management
and execution) approaches to performance portability and efficiency, and programming efficiency. Many of
across the existing and future generations of these two the architectural and software issues are common
architectural paths. across both.

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Geist and Reed 9

Acknowledgements supercomputing, ACM, Cambridge, Massachusetts,

Al Geist acknowledges support from the US Department of pp.229–238.
Energy Office of Science (contract DE-AC0500OR22725). Meuer H, Strohmaier E, Dongarra J, et al. (2014) Top 500
Daniel A Reed acknowledges support from the National Supercomputer Sites. Available at: http://www.top500.org/
Science Foundation (NSF grant ACI-1349521). Microsoft Research (2009) The Fourth Paradigm: Data-Inten
sive Scientific Discovery.
Rath J (2010) Blue Waters: Awesome power, awesome effi-
Funding ciency. Data Center Knowledge. Available at: http://
This research received no specific grant from any funding www.datacenterknowledge.com/archives/2010/06/24/blue-
agency in the public, commercial, or not-for-profit sectors. waters-awesome-power-awesome-efficiency/b
Sankar S, Shaw M, Vaid K, et al. (2013) Datacenter Scale eva-
luation of the impact of temperature on hard disk drive
Notes failures. ACM Transactions on Storage 9(2): 1-24.
1. The term ‘‘killer micro’’ is due to Eugene Brooks, from his Schroeder B and Gibson GA (2006) A large-scale study of
presentation to the Teraflop Computing Panel, ‘‘Attack failures in high-performance computing systems. In: pro-
of the Killer Micros,’’ at Supercomputing 1989 in Reno, ceedings of the international conference on dependable sys-
NV. tems and networks, pp.249–258.
2. Auction-based energy pricing is an additional complica- Schroeder B, Pinheiro E and Weber W-D (2011) DRAM
tion, but it is not available in all regional and local mar- errors in the wild: A large-scale field study. Communica-
kets. The same is true of local co-generation based on tions of the ACM 54(2): 100–107.
renewables (i.e. wind and solar). Sridharan V and Liberty D (2012) A study of DRAM failures
in the field. In: SC12.
Sridharan V, Stearley J, DeBardeleben N, et al. (2013) Feng
References shui of supercomputer memory: positional effects in
Anderson TE, Culler DE and Patterson DA (1995) A case for DRAM and SRAM faults. In: SC13.
NOW (networks of workstations). Micro IEEE 15(1): Stoyanov M and Webster C (YEAR) Numerical analysis in
54–64. the presence of hardware faults: Fixed point methods.
Barga R, Gannon D and Reed D (2011) The client and the SIAM Journal of Scientific Computing.
cloud: Democratizing research computing. IEEE Internet Tiwari D, Gupta S, Rogers J, et al. (2015) Understanding
Computing 15(1): 72–75. GPU errors on large-scale HPC systems and the implica-
Becker DJ, Salmon J, Sevarese DF, et al. (1999) How to Build tions for system design and operation. In: Proceedings of
a Beowulf: A Guide to the Implementation and Application the 21st IEEE international symposium on high-performance
of PC Clusters. Cambridge, MA: MIT Press. computer architecture (HPCA), Burlingame, CA, 7–11
Bennett C and Tseitlin A (2012) Chaos Monkey Released Into February 2015, pp. 331–342.
The Wild Netflix. Netflix. Available at: http://techblog.net- Vishwanath KV, Greenberg A and Reed DA (2009) Modular
flix.com/2012/07/chaos-monkey-released-into-wild.html data centers: How to design them? In: proceedings of the
Bougeret M, Casanova H, Rabie M, et al. (2011) Checkpoint- 1st ACM workshop on large-scale system and application
ing strategies for parallel jobs. In: SC11. performance, ACM, Garching, Germany, pp.3–10.
Dennard RH, Gaensslen FH, Yu H-n, et al. (1974) Design of Xun J and Kumar R (2013) Adaptive Reliability chipkill cor-
ion-implanted MOSFET’s with very small physical dimen- rect (ARCC). In: 2013 IEEE 19th international symposium
sions. IEEE Journal of Solid State Circuits SC-9 9(5): on high performance computer architecture (HPCA2013),
256–268. pp.270–281.
Fuller SH and Millett I (eds) (2011) The future of computing
performance: Game over or next level? Committee on Sus-
Author biographies
taining Growth in Computing Performance, National
Research Council. Al Geist is a Corporate Research Fellow at Oak Ridge
Geist A and Lucas R (2009) Major computer science chal- National Laboratory. He is the Chief Technology
lenges at exascale. International Journal of High Perfor- Officer of the Leadership Computing Facility and
mance Applications 23(4): 427–436. Chief Scientist for the Computer Science and
Jackson DB, Snell Q and Clement MJ (2001) Core algorithms Mathematics Division. His recent research is on exas-
of the maui scheduler. In: revised papers from the 7th inter- cale computing and resilience needs of the hardware
national workshop on job scheduling strategies for parallel
and software. He leads the ASCR technical Council on
processing , pp.87–102, Springer.
Resilience and is the PI at the multi-lab Extreme-Scale
Kamil S, Shalf J and Strohmaier E (2008) Power efficiency in
high performance computing. In: high-performance, power- Algorithms and Software Institute.
aware computing (HPPAC 2008).
Kleinrock K. Queueing Systems. Volume 1: Theory. Hoboken, Daniel A Reed is Vice President for Research and
NJ: John Wiley 1975. Economic Development, as well as University Chair in
Lawson B and Smirni E (2005) Power-aware Resource alloca- Computational Science and Bioinformatics, and
tion in high-end systems via online simulation. In: proceed- Professor of Computer Science, Electrical and
ings of the 19th annual international conference on Computer Engineering and Medicine, at the University

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

10 The International Journal of High Performance Computing Applications

of Iowa. He currently serves as a member of the Chancellor’s Eminent Professor at UNC Chapel Hill,
Department of Energy’s Advanced Scientific as well as the Director of the Renaissance Computing
Computing Advisory Committee and as a member of Institute (RENCI) and the Chancellor’s Senior Advisor
the ICANN GNO Council. Previously, he was for Strategy and Innovation for UNC Chapel Hill.
Microsoft’s Corporate Vice President for Technology Prior to that, he was Gutgsell Professor and Head of
Policy and Extreme Computing, where he helped shape the Department of Computer Science at the University
Microsoft’s long-term vision for technology innova- of Illinois at Urbana-Champaign (UIUC) and Director
tions in cloud computing and the company’s associated of the NCSA. He is a former member of the US
policy engagement with governments and institutions President’s Council of Advisors on Science and
around the world. Before joining Microsoft, he was the Technology (PCAST).

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Scalable Computing Over The Internet
No ratings yet
Scalable Computing Over The Internet
15 pages
2-IJCI Vol. 3 No. 7-July 2024-Paper1-Mr. M.mahagoub3
No ratings yet
2-IJCI Vol. 3 No. 7-July 2024-Paper1-Mr. M.mahagoub3
41 pages
IEEE Paper A Study of Cloud Computing en
No ratings yet
IEEE Paper A Study of Cloud Computing en
7 pages
PDF Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo download
100% (2)
PDF Heterogeneous Computing Architectures Challenges and Vision 1st Edition Olivier Terzo download
55 pages
Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu - The latest ebook is available, download it today
No ratings yet
Design and Applications of Emerging Computer Systems 1st Edition Weiqiang Liu - The latest ebook is available, download it today
68 pages
A Detailed Survey On High Performance Computing
No ratings yet
A Detailed Survey On High Performance Computing
4 pages
6277
No ratings yet
6277
56 pages
Comsol 2 Day Advanced Day1
100% (1)
Comsol 2 Day Advanced Day1
130 pages
Author Carson You PBP 5
No ratings yet
Author Carson You PBP 5
9 pages
Class 9th Computer Notes
33% (6)
Class 9th Computer Notes
106 pages
Advances in Computers, Vol.72, High Performance Computing (AP, 2008) (ISBN 0123744113) (369s) - CsAl
No ratings yet
Advances in Computers, Vol.72, High Performance Computing (AP, 2008) (ISBN 0123744113) (369s) - CsAl
369 pages
2219-Article Text-15412-2-10-20230802
No ratings yet
2219-Article Text-15412-2-10-20230802
12 pages
(eBook PDF) Computer Architecture: A Quantitative Approach 5th Edition download
100% (2)
(eBook PDF) Computer Architecture: A Quantitative Approach 5th Edition download
49 pages
Seminar 1015 Twhuang
No ratings yet
Seminar 1015 Twhuang
44 pages
Cai Nat
No ratings yet
Cai Nat
25 pages
L1.1 HPC Environment
No ratings yet
L1.1 HPC Environment
27 pages
New Advances in High Performance Computing and Simulation: Parallel and Distributed Systems, Algorithms, and Applications
No ratings yet
New Advances in High Performance Computing and Simulation: Parallel and Distributed Systems, Algorithms, and Applications
7 pages
Ebook HPC v3
No ratings yet
Ebook HPC v3
14 pages
AI-HPC Is Happening Now
No ratings yet
AI-HPC Is Happening Now
16 pages
Introduction To HPC and Current Usage in HEP
No ratings yet
Introduction To HPC and Current Usage in HEP
33 pages
Energy Aware Edge Computing A Survey
No ratings yet
Energy Aware Edge Computing A Survey
25 pages
A Comparative Survey of Big Data Computing and HPC
No ratings yet
A Comparative Survey of Big Data Computing and HPC
38 pages
Load Balancing and Process Management Over Grid Computing
No ratings yet
Load Balancing and Process Management Over Grid Computing
4 pages
0 Cloud
No ratings yet
0 Cloud
7 pages
Quantum Processing Unit
No ratings yet
Quantum Processing Unit
13 pages
2505.02738v1
No ratings yet
2505.02738v1
7 pages
04 - Computer Clusters
No ratings yet
04 - Computer Clusters
66 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
45 pages
Pratima Thesis June2015
No ratings yet
Pratima Thesis June2015
36 pages
2219-Article Text-15412-2-10-20230802
No ratings yet
2219-Article Text-15412-2-10-20230802
12 pages
Iot and The Need For High Performance Computing: October 2014
No ratings yet
Iot and The Need For High Performance Computing: October 2014
7 pages
HPC
No ratings yet
HPC
30 pages
Cit 104
100% (1)
Cit 104
169 pages
Armhpc SC
No ratings yet
Armhpc SC
37 pages
Breaking HPC Barriers With The 56GbE Cloud 2016 Procedia Computer Science
No ratings yet
Breaking HPC Barriers With The 56GbE Cloud 2016 Procedia Computer Science
9 pages
Chap2_ComputingTrends
No ratings yet
Chap2_ComputingTrends
55 pages
1 Intro to HPC Compressed 1 Part 1
No ratings yet
1 Intro to HPC Compressed 1 Part 1
22 pages
Toward Exposing and Accessing HPC Applications in a SaaS Cloud
No ratings yet
Toward Exposing and Accessing HPC Applications in a SaaS Cloud
8 pages
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
No ratings yet
Introduction To High Performance Computing: Shaohao Chen Research Computing Services (RCS) Boston University
29 pages
Possibility_of_HPC_Application_on_Cloud_Infrastructure_by_Container_Cluster
No ratings yet
Possibility_of_HPC_Application_on_Cloud_Infrastructure_by_Container_Cluster
6 pages
Mathematics 11 01055
No ratings yet
Mathematics 11 01055
13 pages
Cluster
No ratings yet
Cluster
172 pages
Energy Efficiency at Extreme Scales: Kathy Yelick NERSC Director
No ratings yet
Energy Efficiency at Extreme Scales: Kathy Yelick NERSC Director
32 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
Fcs All Notes - Merged
No ratings yet
Fcs All Notes - Merged
180 pages
1-s2.0-S1383762122001138-main
No ratings yet
1-s2.0-S1383762122001138-main
51 pages
HPC%20Tools%20and%20Technologies%20for%20Web%20Programming
No ratings yet
HPC%20Tools%20and%20Technologies%20for%20Web%20Programming
33 pages
Module 1-Topic 1
No ratings yet
Module 1-Topic 1
36 pages
HPC_Seminar_Report (1)
No ratings yet
HPC_Seminar_Report (1)
2 pages
CC Module-1 Notes
No ratings yet
CC Module-1 Notes
32 pages
ModernComputingVision1 s2.0 S2772503024000021 Main
No ratings yet
ModernComputingVision1 s2.0 S2772503024000021 Main
38 pages
d
No ratings yet
d
2 pages
Module 1 ppt
No ratings yet
Module 1 ppt
33 pages
Hpc in Abstract
No ratings yet
Hpc in Abstract
3 pages
module1 part1
No ratings yet
module1 part1
26 pages
Grid Computing
No ratings yet
Grid Computing
8 pages
High-Performance Computing in University Scientific Research
No ratings yet
High-Performance Computing in University Scientific Research
3 pages
Understanding Operating Systems Sixth Edition
100% (3)
Understanding Operating Systems Sixth Edition
46 pages
GCC Unit-I
No ratings yet
GCC Unit-I
51 pages
CHSL Answer Key Sudhir
No ratings yet
CHSL Answer Key Sudhir
31 pages
Computer Architecture: Challenges and Opportunities For The Next Decade
No ratings yet
Computer Architecture: Challenges and Opportunities For The Next Decade
13 pages
CC Module-1 Notes
No ratings yet
CC Module-1 Notes
32 pages
CSM Psvcet It
No ratings yet
CSM Psvcet It
30 pages
TISQAAD COLLEGE
No ratings yet
TISQAAD COLLEGE
31 pages
Distributed and Cloud Computing topic 1
No ratings yet
Distributed and Cloud Computing topic 1
10 pages
Classification of Computers - Type of Computer
No ratings yet
Classification of Computers - Type of Computer
4 pages
Technical Seminar: Paramnet Iii
No ratings yet
Technical Seminar: Paramnet Iii
25 pages
Computer Instructor Manual
No ratings yet
Computer Instructor Manual
123 pages
Unit I GCC
No ratings yet
Unit I GCC
6 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
16 pages
TNavigator Modules Available 2019 ENG
100% (1)
TNavigator Modules Available 2019 ENG
26 pages
Supercomputer - Wikipedia
No ratings yet
Supercomputer - Wikipedia
16 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
01 Chapter 01
No ratings yet
01 Chapter 01
26 pages
Ut SDS322
No ratings yet
Ut SDS322
8 pages
Altair Pbs Works
No ratings yet
Altair Pbs Works
2 pages
CP Notes CHPTR 1
No ratings yet
CP Notes CHPTR 1
27 pages
W1 Intro.4u
No ratings yet
W1 Intro.4u
7 pages
Ansys All You Need To Know About Hardware For Simulation
No ratings yet
Ansys All You Need To Know About Hardware For Simulation
36 pages
Different Types of Computer
No ratings yet
Different Types of Computer
15 pages
Ar - Programing
No ratings yet
Ar - Programing
3 pages
CS Module 2
No ratings yet
CS Module 2
10 pages
Leading Cloud Service, Semiconductor, and System Providers Unite To Form Ultra Ethernet Consortium
No ratings yet
Leading Cloud Service, Semiconductor, and System Providers Unite To Form Ultra Ethernet Consortium
1 page
Systematic Literature Review and Survey On High Performance Computing in Cloud INTRODUCTION
No ratings yet
Systematic Literature Review and Survey On High Performance Computing in Cloud INTRODUCTION
1 page
Grade 8 ICT Note To Be Send..
No ratings yet
Grade 8 ICT Note To Be Send..
7 pages
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
From Everand
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Encapsulating Legacy: A Guide to Service-Oriented Architecture in Mainframe Systems: Mainframes
From Everand
Encapsulating Legacy: A Guide to Service-Oriented Architecture in Mainframe Systems: Mainframes
Isaac Nangan
No ratings yet
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
From Everand
Grid Computing: A Revolutionary Approach to Scientific Research and Data Management
Pasquale De Marco
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Survey of High-Performance Computing Scaling Challenge

Uploaded by

A Survey of High-Performance Computing Scaling Challenge

Uploaded by

Original Article

The International Journal of High

Al Geist1 and Daniel A Reed2

1 Introduction grown with aggregate system performance, placing

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Figure 1. Simple component failure model with repairs.

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

5 Growing software complexity 6 Conclusions

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Acknowledgements supercomputing, ACM, Cambridge, Massachusetts,

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

Downloaded from hpc.sagepub.com at Stockholm University Library on August 15, 2015

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.