2012 Euro-Par 22 PDF
2012 Euro-Par 22 PDF
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Michael Alexander Pasqua D’Ambra
Adam Belloum George Bosilca
Mario Cannataro Marco Danelutto
Beniamino Di Martino Michael Gerndt
Emmanuel Jeannot Raymond Namyst
Jean Roman Stephen L. Scott
Jesper Larsson Traff Geoffroy Vallée
Josef Weidendorfer (Eds.)
Euro-Par 2011:
Parallel Processing
Workshops
CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC,
HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC
Bordeaux, France, August 29 – September 2, 2011
Revised Selected Papers, Part II
13
Volume Editors
Vice-Chair
Luc Bougé ENS Cachan, France
European Respresentatives
José Cunha New University of Lisbon, Portugal
Marco Danelutto University of Pisa, Italy
Emmanuel Jeannot INRIA, France
Paul Kelly Imperial College, UK
Harald Kosch University of Passau, Germany
Thomas Ludwig University of Heidelberg, Germany
Emilio Luque University Autonoma of Barcelona, Spain
Tomàs Margalef University Autonoma of Barcelona, Spain
Wolfgang Nagel Dresden University of Technology, Germany
Rizos Sakellariou University of Manchester, UK
Henk Sips Delft University of Technology,
The Netherlands
Domenico Talia University of Calabria, Italy
Honorary Members
Ron Perrott Queen’s University Belfast, UK
Karl Dieter Reinartz University of Erlangen-Nuremberg, Germany
Program Committee
Pasquale Cantiello Second University of Naples, Italy
Maria Fazio University of Messina, Italy
Florin Fortis West University of Timisoara, Romania
Francesco Moscato Second University of Naples, Italy
Viorel Negru West University of Timisoara, Romania
Massimo Villari University of Messina, Italy
Program Committee
Patrick Bridges UNM, USA
Thierry Delaitre The University of Westminster, UK
Christian Engelmann ORNL, USA
Douglas Fuller ORNL, USA
Ada Gavrilovska Georgia Tech, USA
Jack Lange University of Pittsburgh, USA
Adrien Lebre Ecole des Mines de Nantes, France
Laurent Lefevre INRIA, University of Lyon, France
Jean-Marc Menaud Ecole des Mines de Nantes, France
Christine Morin INRIA, France
Thomas Naughton ORNL, USA
Dimitrios Nikolopoulos University of Crete, Greece
Josh Simons VMWare, USA
Samuel Thibault LaBRI, France
Program Committee
David Bader Georgia Institute of Technology, USA
Martti Forsell VTT, Finland
Jim Held Intel, USA
Peter Hofstee IBM, USA
Magnus Jahre NTNU, Norway
Chris Jesshope University of Amsterdam, The Netherlands
Ben Juurlink Technical University of Berlin, Germany
Jörg Keller University of Hagen, Germany
Christoph Kessler University of Linköping, Sweden
Avi Mendelson Microsoft, Israel
Vitaly Osipov Karlsruhe Institute of Technology, Germany
Martti Penttonen University of Eastern Finland, Finland
Sven-Bodo Scholz University of Hertfordshire, UK
Jesper Larsson Träff University of Vienna, Austria
Theo Ungerer University of Augsburg, Germany
Uzi Vishkin University of Maryland, USA
Sponsors
VTT, Finland http://www.vtt.fi
University of Vienna http://www.univie.ac.at
Euro-Par http://www.euro-par.org
Organization XI
Program Committee
Patrick Amnestoy University of Toulouse, France
Peter Arbenz ETH Zurich, Switzerland
Rob Bisseling Utrecht University, The Netherlands
Daniela di Serafino Second University of Naples and ICAR-CNR,
Italy
Jack Dongarra University of Tennesse, USA
Salvatore Filippone University of Rome Tor Vergata, Italy
Laura Grigori INRIA, France
Andreas Grothey University of Edinburgh, UK
Mario Rosario Guarracino ICAR-CNR, Italy
Sven Hammarling University of Manchester and NAG Ltd., UK
Mike Heroux Sandia National Laboratories, USA
Gerardo Toraldo University of Naples Federico II and
ICAR-CNR, Italy
Bora Ucar CNRS, France
Rich Vuduc Georgia Tech, USA
Ulrike Meier Yang Lawrence Livermore National Laboratory, USA
Program Committee
Jacques Bahi University of Franche-Comté, France
Jorge Barbosa FEUP, Portugal
George Bosilca Innovative Computing Laboratory - University
of Tennessee, Knoxville, USA
Andrea Clematis IMATI CNR, Italy
Michel Dayde IRIT - INPT / ENSEEIHT, France
Frederic Desprez INRIA, France
Pierre-Francois Dutot Laboratoire LIG, France
Alfredo Goldman University of São Paulo - USP, Brasil
XII Organization
Program Committee
Pratul K. Agarwal Oak Ridge National Laboratory, USA
David A. Bader College of Computing, Georgia University of
Technology, USA
Ignacio Blanquer Universidad Politécnica de Valencia,
Valencia, Spain
Daniela Calvetti Case Western Reserve University, USA
Werner Dubitzky University of Ulster, UK
Ananth Y. Grama Purdue University, USA
Concettina Guerra University of Padova, Italy
Vicente Hernández Universitad Politécnica de Valencia, Spain
Salvatore Orlando University of Venice, Italy
Omer F. Rana Cardiff University, UK
Richard Sinnott National e-Science Centre, University of
Glasgow, Glasgow, UK
Fabrizio Silvestri ISTI-CNR, Italy
Erkki Somersalo Case Western Reserve University, USA
Paolo Trunfio University of Calabria, Italy
Albert Zomaya University of Sydney, Australia
Organization XIII
Program Committee
Nazim Agulmine University of Evry, France
Michael Brenner Leibniz Supercomputing Centre, Germany
Ewa Deelman University of Southern California, USA
Karim Djemame University of Leeds, UK
Thomas Fahringer University of Innsbruck, Austria
Alex Galis University College London, UK
Dieter Kranzlmüller Ludwig-Maximilians-Universität, Germany
Laurent Lefebre INRIA, France
Edgar Magana CISCO research labs, USA
Patricia Marcu Leibniz Supercomputing Centre, Germany
Carlos Merida Barcelona Supercomputing Center, Spain
Steven Newhouse European Grid Initiative, The Netherlands
Omer F. Rana Cardiff University, UK
Stefan Wesner High Performance Computing Center
Stuttgart, Germany
Philipp Wieder Technische Universität Dortmund, Germany
Ramin Yahyapour Technische Universität Dortmund, Germany
Program Committee
Andreas Knüpfer TU Dresden, Germany
Dieter an Mey RWTH Aachen, Germany
Jens Doleschal TU Dresden, Germany
Karl Fürlinger University of California at Berkeley, USA
Michael Gerndt TU München, Germany
Allen Malony University of Oregon, USA
XIV Organization
Program Committee
Vassil Alexandrov Barcelona Supercomputing Center, Spain
David E. Bernholdt Oak Ridge National Laboratory, USA
George Bosilca University of Tennessee, USA
Jim Brandt Sandia National Laboratories, USA
Patrick G. Bridges University of New Mexico, USA
Greg Bronevetsky Lawrence Livermore National Laboratory, USA
Franck Cappello INRIA/UIUC, France/USA
Kasidit Chanchio Thammasat University, Thailand
Zizhong Chen Colorado School of Mines, USA
Nathan DeBardeleben Los Alamos National Laboratory, USA
Jack Dongarra University of Tennessee, USA
Christian Engelmann Oak Ridge National Laboratory, USA
Yung-Chin Fang Dell, USA
Kurt B. Ferreira Sandia National Laboratories, USA
Ann Gentile Sandia National Laboratories, USA
Cecile Germain University Paris-Sud, France
Rinku Gupta Argonne National Laboratory, USA
Paul Hargrove Lawrence Berkeley National Laboratory, USA
Xubin He Virginia Commonwealth University, USA
Larry Kaplan Cray, USA
Daniel S. Katz University of Chicago, USA
Thilo Kielmann Vrije Universiteit Amsterdam, The Netherlands
Dieter Kranzlmueller LMU/LRZ Munich, Germany
Zhiling Lan Illinois Institute of Technology, USA
Chokchai (Box) Leangsuksun Louisiana Tech University, USA
Xiaosong Ma North Carolina State University, USA
Celso Mendes University of Illinois at Urbana Champaign,
USA
Organization XV
Steering Committee
Lars Bengtsson Chalmers University, Sweden
Ren Wu HP Labs, Palo Alto, USA
Program Committee
David A. Bader Georgia Tech, USA
Michael Bader Universität Stuttgart, Germany
Denis Barthou Université de Bordeaux, France
Lars Bengtsson Chalmers, Sweden
Karl Fürlinger LMU, Munich, Germany
Dominik Göddeke TU Dortmund, Germany
Georg Hager University of Erlangen-Nuremberg, Germany
Anders Hast University of Gävle, Sweden
Ben Juurlink TU Berlin, Germany
Rainer Keller HLRS Stuttgart, Germany
Gaurav Khanna University of Massachusetts Dartmouth, USA
Harald Köstler University of Erlangen-Nuremberg, Germany
Dominique Lavenier INRIA, France
Manfred Mücke University of Vienna, Austria
Andy Nisbet Manchester Metropolitan University, UK
Ioannis Papaefstathiou Technical University of Crete, Greece
Franz-Josef Pfreundt Fraunhofer ITWM, Germany
XVI Organization
Additional Reviewers
Antony Brandon Delft University of Technology,
The Netherlands
Roel Seedorf Delft University of Technology,
The Netherlands
Program Committee
Padmashree Apparao Intel Corp., USA
Hassan Barada Khalifa University, UAE
Volker Buege University of Karlsruhe, Germany
Isabel Campos IFCA, Spain
Stephen Childs Trinity College Dublin, Ireland
William Gardner University of Guelph, Canada
Derek Groen UVA, The Netherlands
Ahmad Hammad FZK, Germany
Sverre Jarp CERN, Switzerland
Xuxian Jiang NC State, USA
Kenji Kaneda Google, Japan
Krishna Kant Intel, USA
Yves Kemp DESY Hamburg, Germany
Marcel Kunze Karlsruhe Institute of Technology, Germany
Organization XVII
Mario Cannataro
Bioinformatics Laboratory,
Department of Medical and Surgical Sciences,
University Magna Græcia of Catanzaro,
88100 Catanzaro, Italy
cannataro@unicz.it
Foreword
The availability of high-throughput technologies, such as microarray and mass
spectrometry, and the diffusion of genomics and proteomics studies to large
populations, are producing an increasing amount of experimental and clinical
data. Biological databases and bioinformatics tools are key tools for organizing
and exploring such biological and biomedical data with the aim to discover
new knowledge in biology and medicine. However the storage, preprocessing and
analysis of experimental data is becoming the main bottleneck of the analysis
pipeline.
High-performance computing may play an important role in many phases
of life sciences research, from raw data management and processing, to data
integration and analysis, till data exploration and visualization, so well known
high performance computing techniques such as Parallel and Grid Computing, as
well as emerging computational models such as Graphics Processing and Cloud
Computing, are more and more used in bioinformatics.
The huge dimension of experimental data is the first reason to implement
large distributed data repositories, while high performance computing is nec-
essary both to face the complexity of bioinformatics algorithms and to allow
the efficient analysis of huge data. In such a scenario, novel parallel architec-
tures (e.g. CELL processors, GPU, FPGA, hybrid CPU/FPGA) coupled with
emerging programming models may overcome the limits posed by conventional
computers to the mining and exploration of large amounts of data.
The second edition of the Workshop on High Performance Bioinformatics
and Biomedicine (HiBB) aimed to bring together scientists in the fields of high
performance computing, computational biology and medicine to discuss the par-
allel implementation of bioinformatics algorithms, the application of high per-
formance computing in biomedical applications, and the organization of large
scale databases in biology and medicine. As in the past, also this year the work-
shop has been organized in conjunction with Euro-Par, the main European (but
international) conference on all aspects of parallel processing.
Presentations were organized in three sessions. The first session (Bioinformat-
ics and Systems Biology) comprised two papers discussing the parallel
2 M. Cannataro
October 2011
Mario Cannataro
On Parallelizing On-Line Statistics
for Stochastic Biological Simulations
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 3–12, 2012.
c Springer-Verlag Berlin Heidelberg 2012
4 M. Aldinucci et al.
is involved) and data storage (e.g., when the amounts of each species for each
time sample of a simulation have to be tracked).
A single stochastic simulation represents just one possible way in which the
system might react over the entire simulation time-span. Many simulations are
usually needed to get a representative picture of how the system behaves on the
whole. Multiple simulations exhibit a natural independence that would allow
them to be treated in a rather straightforward parallel way. On a multicore plat-
form, they might exhibit serious performance degradation due to the concurrent
usage of underlying memory and I/O resources.
In [2] we presented a highly parallelized simulator for the Calculus of Wrapped
Compartments (CWC) [5] which exploits, in an efficient way, the multi-core
architecture using the FastFlow programming framework [8]. The framework
relies on selective memory [1], i.e. data structure designed to perform the online
alignment and reduction of multiple computations. A stack of layers progressively
abstract the shared memory parallelism at the level of cores up to the definition
of useful programming constructs supporting structured parallel programming
on cache-coherent shared memory multi- and many-core architectures.
Even in distributed computing the data processing of hundreds (or even thou-
sands) simulations is often demoted to a secondary aspect in the computation
and treated as off-line post-processing tools. The storage and processing of sim-
ulation data, however, may require a huge amount of storage space (linear in
the number of simulations and the observation size of the time courses) and an
expensive post-processing phase, since data should be retrieved from permanent
storage and processed.
In this paper, we adapt the approach presented in [2] to support concurrent
real-time data analysis and mining. Namely, we enrich the parallel version of
the CWC simulator with on-line (parallel) statistics tools for the analysis of re-
sults on cache-coherent, shared memory multicore. To this aim, we exploit the
FastFlow framework, which makes it possible not only to run multiple parallel
stochastic simulations but also combine their results on the fly according to user-
defined analysis functions, e.g. statistical filtering or clustering. In this respect,
it is worth noticing that while running independent simulations is an embar-
rassingly parallel problem, running them aligned at the simulation time and
combining their trajectories with on-line procedures definitely is not as merging
high-frequency data streams. This, in turn, requires to enforce that simulations
proceed aligned according to the simulation time in order to avoid the explosion
of the working set of the statistical and mining reduction functions.
The Calculus of labelled Wrapped Compartments (CWC) [5,2] has been designed
to describe biological entities (like cells and bacteria) by means of a nested
structure of ambients delimited by membranes.
The terms of the calculus are built on a set of atoms (representing species
e.g. molecules, proteins or DNA strands) , ranged over by a, b, . . ., and on a set
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 5
simulation dataset
sim-obja@ti sim-objb@ti Sk Sk-1
Pipeline+Farm
ne+Farm
instances window windows
buffering [Sk-2,Sk-3, ...]
offload [Sk-3,Sk-4, ...]
stream dispatch
Farm
mean dispatch mean
ack Sim Sim
Eng Eng variance variance
selective Stat Stat
memory k-means Eng Eng k-means
schedule gather
next bulk
The CWC parallel simulator, which is extensively discussed in [2] and sketched
in Fig. 1 (left box), employs the selective memory concept, i.e. a data structure
supporting the on-line reduction of time-aligned trajectory data by way of one
or more user-defined associative functions (e.g. statistic and mining operators).
Selective memory distinguishes from standard parallel reduce operation because
it works on (possibly unbound) streams, and aligns simulation points (i.e. stream
items) according to simulation time before reducing them: since each simulation
proceed at a fixed time step, simulation points coming from different simulations
cannot simply be reduced as soon as they are produced [1].
In this work, we further extend the selective memory concept by making it
parallel via a FastFlow accelerator [8], which make it possible to offload selective
memory operators onto a parallel on-line statistical tools implementing the same
functions in parallel fashion. The pipeline has two stages: 1) statistic buffering,
and 2) a farm of statistic engines. The first stage creates dataset windows (i.e. a
number of arrays of simulation-time-aligned trajectory data from different sim-
ulation). The second stage farms out the execution of one or more filtering or
mining functions, which are independently executed on different (possibly over-
lapping) dataset windows. Additional filtering functions can be easily plugged in
by simply extending the list of statistics with additional (reentrant) sequential
or parallel functions (i.e. adding a function pointer to that list). Overall, the
parallel simulation (Fig. 1, left box) and parallel on-line filtering (Fig. 1, right
box), work, in turn, in a two-stage pipeline fashion.
Oscillatory Systems. Many processes in living organisms are oscillatory (e.g. the
beating of the heart or, on a microscopic scale, the cell cycle). In these systems
molecular noise plays a fundamental role inducing oscillations and spikes. We
are currently working on statistical tools to synthesize the qualitative behavior
of oscillations through peak detection and frequency analysis [16].
4 Examples
We now consider two motivating examples that illustrate the effectiveness of the
presented real-time statistical and mining reduction functions.
Simple Crystallization. Consider a simplified CWC set of rules for the crystal-
lization of species “a”:
1e−7 1e−7
: 2 ∗ a −→ b : a c −→ d
We here show how to reconstruct the first two moments of species “c” using the
on-line statistics based upon 100 simulations running for 100 time units using
a sampling time ΔS = 1 time unit. The starting term was: T = 106 ∗ a 10 ∗ c.
Figure 2(a) shows the on-line computation of the mean and standard deviation
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 9
10
raw simulations 2.5 standard deviation
standard deviation mean
5
number of "a" molecules x 10
mean raw simulations
8
number of "c" molecules
ODE 2.0
6 1.5
4 1.0
2 5.0
0.0
0
0 20 40 60 80 100 0.0 4.0 8.0 1.2 1.6 2.0
-4
time time x 10
for species c. Notice that in these cases of mono-stable behaviors, the mean of
the stochastic simulations overlap the solution of the corresponding deterministic
simulation using ODEs.
Switches. We here consider two sets of CWC rules abstracting the behavior of
a stable and an unstable biochemical switch [4] showing how to reconstruct the
equilibria of the species using the on-line clustering techniques on the filtered
trajectories. The stable switch with two competing agents a and c is based on a
very simple population model (with only 3 agents) that computes the majority
value quickly, provided the initial majority is sufficiently large. The essential idea
of the model is that when two agents a and c with different preferences meet,
one drops its preference and enters a special “blank” state b; b then adopts the
preference of any non-blank agent it meets. The rules modeling this case are:
10 10 10 10
: a c −→ c b : c a −→ a b : b a −→ a a : b c −→ c c
The unstable switch is based on a direct competition where a species a catalyzes
the transformation of another species b into a and, in turn, b catalyzes the
transformation of a into b. In this example any perturbation of a stable state
can initiate a random walk to the other stable state. The set of CWC rules
modeling this case are:
10 10
: a c −→ a a : c a −→ c c
In these cases, simple mean and standard deviation are not significant to summa-
rize the overall behavior. For instance in Fig. 2(b) the mean is not representative
of any simulation trajectory.
Figures 3 a) and b) show the resulting clusters (black circles) computed on-
line using K-means on the stable switch and QT on the unstable switch for
species a over 60 stochastic simulations. The stable switch was run for 2 · 10−4
time units with ΔS = 4 · 10−6 . The number of clusters for K-means was set to
2. The starting term was: T = 105 ∗ a 105 ∗ c. The unstable switch was run for
0.1 time units with ΔS = 2 · 10−3 . The threshold of clustering diameter for QT
10 M. Aldinucci et al.
2.0 200
number of "a" molecules x 105
1.0 100
5.0 50
0.0 0
0.0 4.0 8.0 1.2 1.6 2.0 0 0.02 0.04 0.06 0.08 0.1
-4
time x 10 time
(a) K-means clustering on the stable switch (b) QT clustering on the unstable switch
Fig. 3. On-line clustering results (black circles) on the stable and unstable switches
using K-means and QT, respectively. The figures report also the raw simulations.
was set to 100. The starting term was: T = 100 ∗ a 100 ∗ c. Circles diameters are
proportional to each cluster size.
K-means is suitable for stable systems where the number of clusters and their
tendencies are known in advance, in the other cases QT, although more compu-
tationally expensive, can build accurate partitions of trajectories giving evidence
of instabilities with a dynamic number of clusters.
Figure 4 shows the speedup of the simulation engines equipped with mean,
standard deviation, quantiles, K-means, and QT filters on a 8 cores Intel plat-
form against number of Simulation Engines with one and two Statistic Engines,
respectively, on varying number of simulations and sampling rates. The first ex-
periments show the ability of selective memory of reducing the I/O traffic as the
speedup remain stable with increased number of simulations, thus output size.
In the second experiment, the speedup decreases while the number of samples
increases highlighting that the bottleneck of the system is in the data analysis
stage of the pipeline: any further increase of Simulation Engines does not bring
performance benefits.
5 Related Work
The parallelisation of stochastic simulators has been extensively studied in the
last two decades. Many of these efforts focus on distributed architectures. Our
work differs from these efforts in three aspects: 1) it addresses multicore-specific
parallelisation issues; 2) it advocates a general parallelisation schema rather than
a specific simulator, 3) it addresses the on-line data analysis, thus it is designed
to manage large streams of data. To the best of our knowledge, many related
works cover some of these aspects, but few of them address all three aspects.
The Swarm algorithm [14], which is well suited for biochemical pathway opti-
misation has been used in a distributed environment, e.g., in Grid Cellware [7], a
grid-based modelling and simulation tool for the analysis of biological pathways
that offers an integrated environment for several mathematical representations
ranging from stochastic to deterministic algorithms.
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 11
12 12
Ideal Ideal
100 simulations 200 samples
10 200 simulations 10 1000 samples
6 6
4 4
2 2
0 0
2 4 6 8 10 12 2 4 6 8 10 12
number of Sim. Eng. (with 1 Stat Eng) number of Sim. Eng. (with 2 Stat Eng)
Fig. 4. Speedup on the stable switch simulation with 1 Statistic Engine for different
number of parallel simulations and 200 samples (left), and with 2 Statistic Engines for
different sampling rates and 200 simulations (right). The grey region delimits available
platform parallelism (Intel x86 64 with 8 cores).
6 Conclusions
Starting from the Calculus of Wrapped Compartments and its parallel simulator
we have discussed the problem of the analysis of stochastic simulation results,
which can be complex to interpret also due to intrinsic stochastic “noise” and
the overlapping of the many required experiments by the Monte Carlo method.
At this aim, we characterised some patterns of behaviour for biological sys-
tem dynamics, e.g. monostable, multi-stable, and oscillatory systems, and we
exemplified them with minimal yet paradigmatic examples from the literature.
For these, we identified data filters able to provide statistically significative in-
formation to the biological scientists in order to simplify the data analysis.
Both the simulations and the on-line statistic filters, which are both parallel
and pipelined, can be easily extended with new simulation algorithms and filters
12 M. Aldinucci et al.
References
1. Aldinucci, M., Bracciali, A., Liò, P., Sorathiya, A., Torquati, M.: StochKit-FF: Ef-
ficient Systems Biology on Multicore Architectures. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannataro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A., Di
Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586, pp.
167–175. Springer, Heidelberg (2011)
2. Aldinucci, M., Coppo, M., Damiani, F., Drocco, M., Torquati, M., Troina, A.:
On designing multicore-aware simulators for biological systems. In: Proc. of Intl.
Euromicro PDP 2011: Parallel Distributed and Network-Based Processing, pp.
318–325. IEEE, Ayia Napa (2011)
3. Barnat, J., Brim, L., Safránek, D.: High-performance analysis of biological systems
dynamics with the divine model checker. Briefings in Bioinformatics 11(3), 301–312
(2010)
4. Cardelli, L.: On switches and oscillators (2011), http://lucacardelli.name
5. Coppo, M., Damiani, F., Drocco, M., Grassi, E., Troina, A.: Stochastic Calculus
of Wrapped Compatnents. In: QAPL 2010, vol. 28, pp. 82–98. EPTCS (2010)
6. CWC Simulator website (2010), http://cwcsimulator.sourceforge.net/
7. Dhar, P.K., et al.: Grid cellware: the first grid-enabled tool for modelling and
simulating cellular processes. Bioinformatics 7, 1284–1287 (2005)
8. FastFlow website (2009), http://mc-fastflow.sourceforge.net/
9. Gillespie, D.: Exact stochastic simulation of coupled chemical reactions. J. Phys.
Chem. 81, 2340–2361 (1977)
10. Hartigan, J., Wong, M.: A k-means clustering algorithm. Journal of the Royal
Statistical Society C 28(1), 100–108 (1979)
11. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and
analysis of coexpressed genes. Genome Research 9(11), 1106 (1999)
12. Klingbeil, G., Erban, R., Giles, M., Maini, P.: Stochsimgpu: parallel stochastic
simulation for the systems biology toolbox 2 for matlab. Bioinformatics 27(8),
1170 (2011)
13. Petzold, L.: StochKit: stochastic simulation kit web page (2009),
http://www.engineering.ucsb.edu/~ cse/StochKit/index.html
14. Ray, T., Saini, P.: Engineering design optimization using a swarm with an intelli-
gent information sharing among individuals. Eng. Opt. 33, 735–748 (2001)
15. Regev, A., Shapiro, E.: Cells as computation. Nature 419, 343 (2002)
16. Sciacca, E., Spinella, S., Genre, A., Calcagno, C.: Analysis of calcium spiking in
plant root epidermis through cwc modeling. Electronic Notes in Theoretical Com-
puter Science 277, 65–76 (2011)
Scalable Sequence Similarity Search and Join
in Main Memory on Multi-cores
1 Introduction
Similarity-based searches and joins are important for many applications such as
document clustering or plagiarism detection [7,16]. In bioinformatics, similarity-
based queries are used for sequence read alignment or for finding homologous
sequences between different species. In recent years, much effort has been spent
on developing tools to speed up similarity-based queries on sequences. Many
prominent tools use sophisticated index structures and filter techniques that
enable significant runtime improvements [2,8,9].
A challenge arises from the immense growth of sequence databases in the
past few years. For example, the number of sequences stored in EMBL grows
exponentially every year and sums up to more than 300 billion nucleotides as
of May 2011. One strategy to deal with this huge amount of data is to divide
it into smaller parts and perform analyses partition-wise in parallel. For this
scenario, Google developed the programming paradigm MapReduce to enable a
massively-parallel processing of huge data sets in large distributed systems of
commodity hardware [4]. However, the main bottleneck of distributed MapRe-
duce is network bandwidth and disk I/O. Therefore, another option is to design
data structures and algorithms that adapt the MapReduce paradigm for many-
core servers [11]. We argue that modern many-core servers, combined with the
constantly falling prices for main memory, are perfectly suited to perform many
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 13–22, 2012.
c Springer-Verlag Berlin Heidelberg 2012
14 A. Rheinländer and U. Leser
2 Preliminaries
Let Σ ∗ be the set of all strings of any finite length over an alphabet Σ. The
length of a string s ∈ Σ ∗ is denoted by |s|. A substring s[i . . . j] of s starts at
position i and ends at position j, (1 ≤ i ≤ j ≤ |s|). Any substring of length q ∈ N
is called q-gram. Conceptually, we will ground our algorithms on operators for
similarity search and similarity join, which are defined as follows:
Let s be a string, R a bag of strings, d a distance function, and k a threshold. The
similarity-based search operator is defined as simsearch (s, R, k) = {r|d(r, s) ≤
k, r ∈ R}. Similarly, for two bags of strings R, S, the similarity-based join operator
is defined as simjoin (R, S, k) = {(r, s)|d(r, s) ≤ k, r ∈ R, s ∈ S}.
In PeARL, we support Hamming and edit distance as similarity measure. We
focus on edit distance based operations in this paper, but see [12] for the key
ideas on queries using Hamming distance. In general, the edit distance of r and
s is computed in O(|r| ∗ |s|) using dynamic programming. As we are mostly
interested in finding highly similar strings within a previously defined distance
threshold k, we use the k-banded alignment algorithm [5] with time complexity
O(k ∗ max{|r|, |s|}) instead.
Scalable Sequence Similarity Search and Join in Main Memory 15
Figure 1 displays a PeARL index for strings over Σ = {A, C, G, T }. Grey nodes
are string nodes, white nodes are infix nodes. Edge labels are not stored in
the index itself, but are displayed for better comprehensibility only. Displayed
q−gram sets indicate which bits in qGr are set.
3.1 Algorithms
Building the PeARL index for a set of strings R works as follows: In a first step,
R is sorted lexicographically, UIDs are assembled, and R is split into multiple
partitions based on shared prefixes. For each partition Ri ⊆ R, we start with an
empty trie TRi and iteratively insert each string contained in Ri using preorder
DFS traversal. After all strings from Ri have been inserted, we iterate once over
the whole trie and update the information min/max, f v and qGr.
Similar to indexing, our algorithms for similarity-based searches and joins are
also grounded on preorder DFS traversal of all trie partitions. Each algorithm
is equipped with filtering strategies. These filters, namely prefix and edit dis-
tance pruning [14], character frequency pruning [1], and q−gram filtering [6],
have been introduced in slightly different contexts before. Their concrete usage
and efficiency for trie-based search and join queries is shown in [12]. Therefore,
we only briefly summarize our search and join strategies in the following and
concentrate on our novel parallelization scheme later.
Similarity search starts with a given search string q and traverses each trie
partition in a PeARL index starting at root. Whenever a new child of the current
node is reached, we first check whether we can prune this node (see [12] for details
on filtering). If all filters have been passed successfully, we compute the edit
distance between the query and the prefix of the node. If the distance exceeds a
threshold k, we start a backtracking routine and traverse the remaining, not yet
examined paths in the trie. Otherwise we descend forward to the leaves. When
a string node x is reached and d(q, x) ≤ k holds, we report a match.
Scalable Sequence Similarity Search and Join in Main Memory 17
Similarity join for two sets R, S takes two PeARL indices PR , PS as input.
Each trie partition TRi is joined with each partition TSj . Recall that both tries
are partitioned by prefixes. We first check the partition prefixes on edit distance
and it might happen that k is already exceeded. In this case, we skip the cor-
responding trie pair. Otherwise, we compute the similarity-based intersection of
both partitions. As for search, we start at the root nodes and traverse both tries
concurrently. When unseen nodes are reached, we check all filters and prune, if
possible. Whenever two string nodes x ∈ TRi , y ∈ TSj are reached, and given
that d(x, y) ≤ k holds, we report a match.
Each map thread has access to mapTaskList and extracts one task (TRi , TSj )
out of this list. After some initialization steps, map calls the join routine, that
executes the similarity join of TRi k TSj and returns the set of all similar
string pairs contained in (TRi , TSj ) within the given distance k. These items are
inserted into an intermediate data structure. For each similar string pair (r, s),
an intermediate key is set to the UID of r. When one map iteration has finished
and as long as mapTaskList is not empty, the map thread extracts the next
(TRi , TSj ) pair out of this list and again computes the similarity join.
When all map tasks have been processed, the master partitions all inter-
mediate data on the basis of intermediate keys and passes each partition to a
separate reduce thread. This ensures that all similar string pairs which involve
r are assigned to the same intermediate partition. Finally, reduce sorts all (r, s)
pairs based by edit distance. Optionally, reduce can also emit the number of
similar strings found in S for each r, or filter the results found for r on best
score.
Parallelizing similarity searches is analog to the parallelization of similarity
joins. The main difference is that PS is replaced with one or a list of search se-
quences. If not existent, each search pattern is assigned a unique ID. For searches,
the mapTaskList contains <ki ,vi > pairs where ki is a partition prefix of and vi
consists of TRi and the search sequence(s). As for join, similarity search is per-
formed in the map phase.
4 Evaluation
1
www.ncbi.nlm.nih.gov/dbEST/
Scalable Sequence Similarity Search and Join in Main Memory 19
5 Related Work
Morrison [10] introduced prefix trees as an index structure for storing strings and
exact string matching. Shang et al. [14] extended prefix trees with dynamic pro-
gramming techniques to perform inexact matching. Prefix pruning was studied
in [14] and is based on the observation that edit distance can only grow with pre-
fix length. Aghili et al. [1] proposed character frequency distance based filtering
to reduce candidate sets for similarity-based string searches. Indexing methods
Scalable Sequence Similarity Search and Join in Main Memory 21
based on q-grams restrict search spaces efficiently for edit distance based opera-
tions. They take advantage of the observation that two strings are within a small
edit distance iff they share a large number of q-grams [15].
The MapReduce programming model for parallel data analysis was initially
proposed by Dean and Ghemawat [4]. Vernica et al. [17] present an algorithm
set-similarity string joins with distributed MapReduce. We could not compare
to their solution, since no in-memory version was available. Ranger et al. [11] de-
veloped a MapReduce based programming framework for shared-memory multi-
core servers with a scalability almost reaching hand-coded solutions.
A main application for similarity-based string searches and joins in bioinfor-
matics is read alignment. Almost all tools follow the seed-and-extend approach.
BLAST [2] seeds the alignment with hash-table indices and extend the initially
ungapped seeds with a banded local alignment algorithm. However, algorithms
that use only ungapped seeds might miss some valuable alignments. BWA-SW [8]
is one tool that allows gap and mismatches in the seeds. We also applied PeARL
for read alignment and compared the execution times to BWA-SW. BWA-SW
significantly outperforms PeARL (data not shown), but it must be noted that it
is a heuristic that misses solutions, while PEARL solves the alignment problem
exactly. CloudBurst [13] is another another tool for read alignment using MapRe-
duce on top of Hadoop [3]. A comparison between PEARL and CloudBurst is
pending.
22 A. Rheinländer and U. Leser
References
1. Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: Bit Filtration Technique for Ap-
proximate String Join in Biological Databases. In: Nascimento, M.A., de Moura,
E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer,
Heidelberg (2003)
2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local align-
ment search tool. J. Molecular Biology 215(3), 403–410 (1990)
3. Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop,
http://hadoop.apache.org/
4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters.
Communications of the ACM 51(1), 107 (2008)
5. Fickett, J.W.: Fast optimal alignment. Nucl. Acids Res. 12(1Part1), 175–179 (1984)
6. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Sri-
vastava, D.: Approximate string joins in a database (Almost) for free. In: Proc.
VLDB, pp. 491–500 (2001)
7. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents.
J. American Society for Information Science and Technology 54, 203–215 (2003)
8. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows–Wheeler
transform. Bioinformatics 26(5), 589–595 (2010)
9. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation
sequencing. Briefings in Bioinformatics 11(5), 473–483 (2010)
10. Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in
alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
11. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G.R., Kozyrakis, C.: Evalu-
ating mapreduce for multicore and multiprocessor systems. In: Proc. HPCA, pp.
13–24 (2007)
12. Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for
Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher,
B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)
13. Schatz, M.C.: Cloudburst. Bioinformatics 25, 1363–1369 (2009)
14. Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE TKDE 8(4),
540–547 (1996)
15. Sutinen, E., Tarhio, J.: Filtration with q-Samples in Approximate String Matching.
In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63.
Springer, Heidelberg (1996)
16. Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering
Practices. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.)
EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
17. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapre-
duce. In: Proc. SIGMOD, pp. 495–506 (2010)
Enabling Data and Compute Intensive Workflows
in Bioinformatics
Gaurang Mehta1, Ewa Deelman1, James A. Knowles2, Ting Chen3, Ying Wang3,5,
Jens Vöckler1, Steven Buyske4, and Tara Matise4
1
USC Information Sciences Institute
2
Keck School of Medicine of USC
3
University of Southern California
4
Rutgers University
5
Xiamen University, P.R. China
{gmehta,deelman}@isi.edu
1 Introduction
Advances in the fields of molecular chemistry, molecular biology, and computational
biology have resulted in accelerated growth in bioinformatics research. In the last
decade there have been rapid developments in genome sequencing technology,
enabling large volumes of RNA and DNA to be sequenced from humans, animals,
and plants. Advances in biochemistry have also enabled protein analysis and bacterial
RNA studies to be carried out on larger scale than ever before. A sharp drop in the
cost of genome sequencing instruments is enabling a larger number of scientists to
sequence genomes from a wide variety of species.
These developments have resulted in petabytes of raw data being generated in
individual laboratories. These massive data need to be analyzed quickly and in an
easy, efficient manner. At the same time, there is an increase in the availability of
large-scale clusters at most universities as well as national grid infrastructures, and
cheap and easily accessible cloud computing resources. Thus, scientists are looking
for simple tools and techniques to manage and analyze their data to produce scientific
results along with their provenance. This paper provides the motivation for the use of
workflow technologies in bioinformatics, followed by a description of the Pegasus
Workflow Management System (WMS) [1,2,28] and its application to the data
management and analysis issues arising in a few bioinformatics projects. The paper
concludes with related work and future plans.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 23–32, 2012.
© Springer-Verlag Berlin Heidelberg 2012
24 G. Mehta et al.
2 Motivation
3 Workflow Technology
Workflows are defined as a collection of computational tasks linked via data and
control dependencies. Each task in a workflow is either a single invocation of an
executable or a sub-workflow containing more tasks. Several workflow technologies
have been developed over the last decade, each tackling different problems [22].
Business workflows attempt to coordinate business processes and are generally highly
customized for a specific company. Scientific workflows, on the other hand, tend to
be shared more frequently with collaborators and run on various types of platforms.
To enable scientific workflows, there are a wide variety of software systems from
GUI-based drag and drop workflow systems [19,20,21] to web services-based
workflow enactors [19,21]. Pegasus WMS was originally developed to enable large-
scale physics experiments in the GriPhyN project [24]. As the scale of data and
analysis of bioinformatics applications have grown it has been a natural fit to apply
the experiences and technology of Pegasus to these projects as well.
The Pegasus Workflow Management System is a software system that supports
the development of large-scale scientific workflows and manages their execution
across local, grid [1,2,28], and cloud [3] resources simultaneously. Pegasus provides
API’s in Java, Python, and Perl to create workflow descriptions in the Abstract
Directed Acyclic Graph in XML (DAX) format. A DAX contains information about
all the steps or tasks in the workflow, including the arguments used to invoke the task,
the input and output datasets used and generated, as well as any relationships between
the tasks. DAXes are abstract descriptions of the workflow that are agnostic of the
resources available to run it, and the location of the input data and executables.
Pegasus compiles these abstract workflows into executable workflows by querying
information catalogs that contain information about the available resources and
sending computations across local and distributed computing infrastructures such as
Enabling Data and Compute Intensive Workflows in Bioinformatics 25
the Teragrid [29], the Open Science Grid [30], campus clusters, emerging commercial
and community cloud environments [31] in an easy and reliable manner using Condor
[5] and DAGMan [6]. Fig. 1 shows the block diagram of Pegasus WMS.
Pegasus WMS optimizes data movement by leveraging existing grid and cloud
technologies via flexible, pluggable interfaces. It provides advanced features to
manage data transfers, data reuse, and automatic cleanup of data generated on remote
resources. It also provides for optimization of the execution by allowing several small
tasks to be clustered into larger jobs, thus minimizing execution overheads. Pegasus
interfaces with several job-scheduling systems via the Condor-G [4] interface,
allowing the various tasks in the workflow to be executed on a variety of resources.
Reproducibility is a very important part of computational science. To enable
scientists to track the progress of their workflows and tackle data reproducibility
issues, Pegasus captures all the provenance of the workflow from the compilation
stage to the execution of the generated data. Pegasus also monitors and captures
statistics during the run of the workflow allowing scientists to accurately measure the
performance of their workflow.
Pegasus WMS also supports the use of hierarchal workflows allowing users to
divide large pipelines into several smaller, more manageable sub-workflows. Each
sub-workflow is planned and executed only when all the necessary dependencies for
that sub-workflow have been satisfied. As a result an application can induce different
sub-workflows to execute based on previous analysis in the upper level workflow.
Pegasus WMS is a very reliable and robust system with several options for failure
recovery. Cloud and grid environments are inherently unreliable, as are the applications
themselves. In order to manage this, Pegasus automatically resubmits tasks that fail to the
same, or another resource several times before the task completely fails. Pegasus will
also finish as many tasks and sub-workflows as possible regardless of one or more failed
tasks. When the workflow can proceed no further, a rescue workflow is created that can
be resubmitted after fixing whatever caused the failures. If re-planning of the workflow is
required (e.g. to make use of additional or new resources), Pegasus will reduce the
original workflow, eliminating tasks that have completed successfully, leaving only those
tasks that previously failed or were not submitted due to dependencies on the failed tasks.
26 G. Mehta et al.
4 Workflows in Bioinformatics
Recently, an ever-increasing number of bioinformatics applications have started
adopting workflows and workflow technologies to help them in their continuous
analysis of the large-scale data generated by scientific studies. Below we present a
variety of bioinformatics projects, including RNA sequencing, protein studies, and
quality control in population epidemiology studies, which are among the many
bioinformatics projects that use Pegasus WMS for their work.
Fig. 2. a) Pegasus workflow template. b) Implementation of workflow for five shotgun proteomic
data sets. c) Hierarchical cluster analysis of shotgun proteomic data.
The MassMatrix workflow was generated using the Pegasus Python API, which
produced the required XML workflow description, and executed on the available
distributed resources [8], which includes high-performance clusters at the Ohio State
University and Amazon EC2. Fig. 2 shows a MassMatrix workflow template, its
instantiation for 5 shotgun datasets, and the final result shown as a hierarchal cluster
analysis. Currently MassMatrix is looking at ways to optimize the allocation and
efficient usage of computational resources for executing these workflows on a larger
scale by balancing the costs and execution time requirements as well as dynamically
modifying the parallelism in the workflows [1].
Enabling Data and Compute Intensive Workflows in Bioinformatics 27
The production run computed approximately 225 lanes of Brain RNA sequences,
using about 50 days worth of CPU time and producing approximately 10 TB of data.
Table 1 shows the number of lanes, files used and generated, and data size from the
workflow runs. A production pipeline using PERM that aligns sequences to the
transcriptome and the human genome, and computes advanced differential analysis
[12] is currently being run.
Fig. 4. Cancer Atlas RNA Seq Alignment and Variant Calls using Pegasus in SeqWare
One of the requirements of SeqWare for running their workflows is the capability to
easily run similar workflows on the local campus cluster, on Amazon EC2, or inside a
simple Virtual Machine, enabling the user to scale the analysis in a flexible way. Also
due to strict data privacy issues, SeqWare wanted to use their own mechanisms for data
Enabling Data and Compute Intensive Workflows in Bioinformatics 29
and views are created as necessary for later QC steps. Each of these QC steps
comprises a sub-workflow containing several steps to verify the submitted data.
Failure of some steps is considered a critical failure resulting in rejection of the
submitted data while other steps may flag interesting data that requires verification by
the study. Additionally, the QC for association results is only performed if the QC for
SNP summaries and phenotypes succeeded. Finally an aggregated report for each
study data set submitted is produced and provided to the study for further manual
analysis and verification.
A large number of bioinformatics projects deal with human data. These data have
strict requirements regarding who can access the data, how it must be stored, etc.
Because of these restrictions it can be difficult to have a hosted workflow service for
users where they can upload their datasets for analysis. In order to provide users with
an easy way to utilize existing workflows for analyzing their data, we have bundled
Pegasus WMS with several workflow pipelines [12] that users can install and run
directly on their laptops, desktops, or in a cloud environment. The virtual machine
(VM) image is built and shipped as a vmdk file. This file can be used directly using
Virtual Box [16], VMware [17] or kvm [18] software. Simple scripts are provided to
upload data into the VM, configure the workflows and execute them in a few steps.
Users can also use these virtual machines as an easy way to evaluate several
different algorithms for their analysis, or as a way to get their application code and data
ready to be used for cloud environments. Currently we have two virtual machines
available: one with two RNA sequence analysis workflows, and the other with a portal
interface that includes several smaller workflows such as copy number variation
detection, association test, imputation etc.
Enabling Data and Compute Intensive Workflows in Bioinformatics 31
6 Related Works
Several workflow systems [22] provide a way to automate bioinformatics pipelines to
aid the burgeoning field of bioinformatics. A few of the ones that are most popular are
mentioned below. Galaxy [20] is a Python based GUI that allows a user to create
bioinformatics pipelines by creating Python wrapper modules. Galaxy is primarily a
desktop tool but now support is available to run Galaxy on clusters and clouds.
Galaxy only supports scheduling tasks on a single set of resources that it is
preconfigured to use. Taverna [21] is a GUI-based workflow manager that primarily
supports web services-based pipelines. Recent support for non-web services
workflows has been added by providing automatic wrappers around non-web service
executables. While several bioinformatics projects have used Taverna to create and
share small workflows, it has not been suitable for creating and running large-scale
pipelines. Kepler [19] a workflow framework based on Ptolemy2 [27] provides both a
GUI interface and a command-line interface to create and run workflows.
References
1. Deelman, E., Mehta, G., Singh, G., Su, M.H., Vahi, K.: Pegasus: Mapping Large-Scale
Workflows to Distributed Resources. In: Workflows for e-Science (2007)
2. Deelman, E., et al.: Pegasus: a Framework for Mapping Complex Scientific Workflows
onto Distributed Systems. Scientific Programming Journal 13, 219–237 (2005)
3. Juve, G., Deelman, E., Vahi, K., Mehta, G., et al.: Data Sharing Options for Scientific
Workflows on Amazon EC2. In: Proceedings of the 2010 ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis (2010)
4. Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: a computation
management agent for multi-institutional grids. In: Proceedings 10th IEEE International
Symposium on High Performance Distributed Computing, vol. 5(3), pp. 55–63 (2002)
5. Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A Hunter of Idle Workstations. In: 8th
International Conference on Distributed Computing Systems (1988)
32 G. Mehta et al.
6. Couvares, P., Kosar, T., Roy, A., et al.: Workflow in Condor. In: Taylor, I., Deelman, E.,
et al. (eds.) Workflows for e-Science. Springer Press (January 2007)
7. Xu, H., Freitas, M.A.: Bioinformatics 25(10), 1341–1343 (2009)
8. Freitas, M.A., Mehta, G., et al.: Large-Scale Proteomic Data Analysis via Flexible Scalable
Workflows. In: RECOMB Satellite Conference on Computational Proteomics (2010)
9. Transcriptional Atlas of the Developing Human Brain,
http://www.brainspan.org/
10. Illumina Eland Alignment Algorithm, http://www.illumina.com
11. Chen, Y., Souaiaia, T., Chen, T.: PerM: Efficient mapping of short sequencing reads with
periodic full sensitive spaced seeds. Bioinformatics 25(19), 2514–2521 (2009)
12. Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., et al.: RseqFlow: Workflows for
RNA-Seq data analysis. Submission: Oxford Bioinformatics-Application Notes
13. O’Connor, B., Merriman, B., Nelson, S.: SeqWare Query Engine: storing and searching
sequence data in the cloud. BMC Bioinformatics 11(suppl. 12), S2 (2010)
14. Matise, T.C., Ambite, J.L., et al.: For the PAGE Study. Population Architecture using
Genetics and Epidemiology. Am. J. Epidemiol (2011), doi:10.1093/aje/kwr160
15. Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., et al.: The NCBI dbGaP
Database of Genotypes and Phenotypes. Nat Genet. 39(10), 1181–1186 (2007)
16. Virtual Box, http://www.virtualbox.org/
17. VMware, http://www.vmware.com/
18. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: kvm: the Linux virtual machine
monitor. In: OLS 2007: The 2007 Ottawa Linux Symposium, pp. 225–230 (July 2007)
19. Ludascher, B., Altintas, I., Berkley, C., et al.: Scientific Workflow Management and the
Kepler System. Concurrency and Computation: Practice & Experience (2005)
20. Blankenberg, D., et al.: Galaxy: a web-based genome analysis tool for experimentalists. In:
Current Protocols in Molecular Biology, ch. 19, Unit 19.10.1-21 (2010)
21. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., et al.: Taverna: a tool for building and
running workflows of services. Nucleic Acids Research 34, 729–732 (2006)
22. Romano, P.: Automation of in-silico data analysis processes through workflow
management systems. Briefings in Bioinformatics 9(1), 57–68 (2008)
23. Nakata, K., Lipska, B.L., Hyde, T.M., Ye, T., et al.: DISC1 splice variants are upregulated
in schizophrenia and associated with risk polymorphisms. PNAS, August 24 (2009)
24. Deelman, E., Kesselman, C., Mehta, G., et al.: GriPhyN and LIGO, Building a Virtual
Data Grid for Gravitational Wave Scientists. In: 11th Int. Symposium HPDC, HPDC11
2002, p. 225 (2002)
25. Eng, J.K., McCormack, A.L., Yates III, J.R.: An Approach to Correlate Tandem Mass
Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc.
Mass. Spectrom. 5(11), 976–989 (1994)
26. Perkins, D.N., Pappin, D.J., et al.: Probability-based protein identification by searching
sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
27. Eker, J., Janneck, J., Lee, E.A., Liu, J., et al.: Taming heterogeneity - the Ptolemy
approach. Proceedings of the IEEE 91(1), 127–144 (2003)
28. Pegasus Workflow Management System, http://pegasus.isi.edu/wms
29. Teragrid, http://www.teragrid.org
30. Open Science Grid, http://www.opensciencegrid.org
31. FutureGrid, http://www.futuregrid.org
32. Nagavaram, A., Agrawal, G., et al.: A Cloud-based Dynamic Workflow for Mass
Spectrometry Data Analysis. In: Proceedings of the 7th IEEE International Conference on
e-Science (e-Science 2011) (December 2011)
Homogenizing Access to Highly
Time-Consuming Biomedical Applications
through a Web-Based Interface
1 Introduction
Complex diseases are explained by the interaction of many genetic factors to-
gether with the environment. To shred light about which factors increment the
risk of developing a complex disease and how they interact to each other, genome-
wide association studies [1] as well as gene expression profiling [2] or a combina-
tion of both [4] are currently being used.
The main feature of these data is their large size, which makes an analysis to
be a highly time-consuming task. As an example, genome-wide data sets with
thousand gigabytes are becoming a common source of data to be analyzed to
detect genetic factors of complex diseases. Analyses are usually performed in high
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 33–42, 2012.
c Springer-Verlag Berlin Heidelberg 2012
34 L. Grasso et al.
easily adding every new software that can be useful and removing those that are
not used any more; and (4) simplicity: as the tool has been designed to be used by
non-expert users. Compared with more complex frameworks, the tool functional-
ity and design is simple. Therefore, for it to run in a grid configuration there must
be a software layer between the Web server and the OS with high-level networking
protocols and more stringent security capacities.
Section 2 is devoted to explain the main features of BioBench, the framework
developed for the automatic construction of Web-based computational work-
benchs. In Section 3 we introduce BiosH, a workbench (http://bios.ugr.es/BiosH)
that has been created to provide and use the software applications through a
Web-based user interface. The main conclusions appear in Section 4.
2.1 Description
Visitors may see their account information and ask to be promoted as an stan-
dard or expert user. Standard users can also run any software already registered
in the Web-based workbench, see information about other users registered and
manage their own user profiles. As standard users may want to know which other
users use the applications in the server for a further collaboration, we have added
the possibility for them to get that information. In order to launch an applica-
tion, they have to choose the software to be run, the server path where the input
files are placed, the server path where the output files should be place, in case
this is necessary, and the software arguments. Expert users have the same rights
of standard users plus the ability to register a new application and to delete or
modify software created by the same user. The software to be registered must
be already installed in the server. The expert user has to provide several data
to the system, such as the server path where the software can be found or a
description of the parameters required for the software to be run. Additional
responsibilities of system administrator are to promote/step-down a user and to
install and export BioBench. Its main functions are summarized in Table 1.
2.2 Design
The logic architecture of BioBench is structured in three main layers, according
to the model-view-controller design pattern, in order to separate responsibilities
1
This is the only function that has to be performed through a text-based user interface
(unix terminal).
Homogenizing Access to Biomedical Applications 37
and ensure that the code implementing the software functionality and accessing
the data (the model) is independent of the user interface provided to access to
this information (the view). The controller layer updates the view every time
a change is performed in the model. The model layer is divided into two sub-
systems: the user subsystem and the application subsystem. Figure 2 shows the
architecture of the model layer. The user subsystem contains the user model,
responsible for the representation and management of users. The application
subsystem is subsequently divided into three models: (1) the application model,
in charge of all the software tools for which BioBench provides a unified Web
interface, (2) the parameters model, in charge of the management of the param-
eters for each software application and (3) the folder model, which represents
and controls disk units and folders accessed by users and applications.
The physical architecture of BioBench and its interaction with users, other
software and hardware is shown in Fig. 3. BioBench can be used to create a
Web-based computational workbench in any server with a unix-like OS, Apache
server, php and MySQL. However, for it to work in a cluster or even more, a grid
configuration, so that applications and/or data from more than one computer can
be accessed by it, other software and additional security procedures are required.
Therefore, the use of the Apache capability has to be complemented with an
extended version named General Remote Access, which considers the URI of
the linked servers (cluster nodes or grids), the list of granted users and their
credentials. According to this architecture, we need to install a small application
on each linked server to act as a client software that interacts with the main
server. This also enables the possibility of monitoring the processes running in
the server.
html code from the php functions. We also used Prototype library (version 1.6)
in order to benefit from the high potential of its functionality and simplify the
implementation task. The application requires a database to store all the data
such as information about all the applications and their parameters, the system
users and to set up a permission protocol to model relationships between actions
and user roles. BioBench uses a relational database with tables created from a
set of 9 entities: Action, Application, Args, Description-App, Description-Arg,
Labels-act, Labels-arg, Types and User. As a database management system we
chose to use MySQL. Each php object creates a connection with the database us-
ing ADOdb, Database Abstraction Library for PHP (and Python) and MySQL.
We have adapted a simple library, called eXplorer, which allows a remotely man-
age of folders and files interacting with the Xajax library. BioBench has been
developed under the GNU General Public License (GNU GPL) 3.0. A Web site
(http://bios.ugr.es/BioBench) from which the application can be downloaded,
has been built at bios.ugr.es, a linux server where several bioinformatic applica-
tions have being built for biomedical analyses.
Input I1. Text file with transcriptions for a set of i individuals (columns)
and g genes (rows)
I2. Text file (makeped format) with genotypes from the same population
I3. Text file (makeped format) with genotypes from another population
to select major alleles
I4. One-column text file with p rows with genetic positions to
compute Spearman correlations
I5. The amount of permutations to be performed in order to
assess statistical evidence
Output O1. Text file with gene Spearman correlation coefficients and p values
O2. Text file with [I5] rows and g × p columns with the Spearman
value for each permutation
Applications 1. ImportFormat PED [I2] [I2].gou
2. ImportFormat PED [I3] [i3].gou
3. SelectCommonSNPs [I2].gou [I2]Selection.gou I4
4. SelectCommonSNPs [I3].gou [I3]Selection.gou I4
5. Genetranassoc [I1] [I2]Selection.gou [I3]Selection.gou [O1].t
6. Transpose [O1].t.csv [O1]
40 L. Grasso et al.
and was added by the Web administrator using the option under the ‘Settings’
link available to administrators to add users. Figure 4 shows the screenshot with
the first form that was filled to add Genetranassoc to BiosH. The main infor-
mation that had to be provided, besides the application name and path where
it is installed, were whether the application has to be run in background, the
arguments required and their type and description. On the left of the figure, we
can observe all the functionality an expert user has regarding the applications
(named programs in the workbench). As Genetranassoc can accept 5 arguments,
other 5 forms collecting information for each parameter were filled for the ap-
plication to be used through BiosH. Once all the applications required for the
task in Tab. 2 were in BiosH, the biologist at the laboratory interested in that
Fig. 4. Sreenshot showing the first form filled to add the application Genetranassoc to
BiosH
Homogenizing Access to Biomedical Applications 41
Fig. 5. Sreenshot showing the form that has to be filled in order to run the application
SelectCommonSNPs through BiosH
task was able to do it without any help and under the role of standard user, pro-
vided that she had a user account and enough disk space to store output results
in the linux system were BiosH is installed. Figure 5 shows as an example the
screenshot with the form that was filled by her to perform the step 3 described
in Tab. 2.
4 Conclusions
The quick growth of scientific research in the biomedical field and the huge
amount of data from genomes, transcriptomes, etc. that has to be processed is
significantly changing the way researchers process them. Hand-made processing
is not conceivable any more and software applications are constantly developed to
be used in the laboratory. These applications are usually run in high-performance
computers with several processors and a large central memory storage under
unix-like OS. However, the high increase in work load that bioinformaticians,
biostatisticians or any other researchers have as software developers is forcing
them to write applications with a simple text-based user interface and no user
documentation which are usually only understood and used by their creators.
Moreover, many biomedical researchers are not used to text-based interfaces
42 L. Grasso et al.
under unix servers and they usually ask somebody to run the applications for
them. Therefore, to ease the use of bioinformatic applications is being a very
demanded task by biomedical researchers. This way, they instead of the software
developers may run these applications. BioBench reaches this goal as a tool to
create workbenchs able to provide a friendly and homogeneous Web interface
to the applications installed by any user in a server. Opposite to the very few
similar tools, BioBench can be installed in any unix-like OS with a mysql+php
Web server and every user can add their self-written software so that it can be
easily shared.
References
1. Abad-Grau, M.M., Medina-Medina, N., Montes-Soldado, R., Moreno-Ortega, J.,
Matesanz, F.: Genome-wide association filtering using a highly locus-specific trans-
mission/disequilibrium test. Human Genetics 128(3), 325–344 (2010)
2. Alekseev, O.M., Richardson, R.T., Alekseev, O., O’Rand, M.G.: Analysis of gene
expression profiles in hela cells in response to overexpression or sirna-mediated
depletion of nasp. Reprod. Biol. Endocrinol. 7, 7–45 (2009)
3. Blanchet, C., Combet, C., Daric, V., Deléage, G.: Web Services Interface to Run
Protein Sequence Tools on Grid, Testcase of Protein Sequence Alignment. In:
Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds.) ISBMDA 2006.
LNCS (LNBI), vol. 4345, pp. 240–249. Springer, Heidelberg (2006)
4. Dimas, A.S., Deutsch, S., Stranger, B.E., Montgomery, S.B., Borel, C., Attar-
Cohen, H., Ingle, C., Beazley, C., Arcelus, M.G., Sekowska, M., Gagnebin, M.,
Nisbett, J., Deloukas, P., Dermitzakis, E., Antonarakis, S.E.: Common regula-
tory variation impacts gene expression in a cell type dependent manner. Sci-
ence 325(5945), 1246–1250 (2001)
5. Fox, J.A., McMillan, S., Ouellete, B.F.: A compilation of molecular biology web
servers: 2006 update on the bioinformatics links directory. Nucleic Acids Research
34, W3–W5 (2001)
6. Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale popu-
lation genotype. data: Applications to inferring missing genotypes and haplotypic
phase. Am. J. Hum. Genet. 78, 629–644 (2006)
7. Sebastiani, P., Abad-Grau, M.M.: Bayesian estimates of linkage disequilibrium.
BMC Genetics 8, 1–13 (2007)
8. Sebastiani, P., Abad-Grau, M.M., Alpargu, G., Ramoni, M.F.: Robust Transmis-
sion Disequilibrium Test for incomplete family genotypes. Genetics 168, 2329–2337
(2004)
9. Slottow, J., Korambath, P., Jin, K.: The integration of ajax, interactive x windows
applications and application input generation into the ucla grid portal. In: Proceed-
ings of the IEEE International Symposium on Parallel and Distributed Processing
(2008)
10. Stephens, M., Smith, N.J., Donelly, P.: A new statistical method for haplotype
reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001)
Distributed Management and Analysis
of Omics Data
1 Introduction
The omics term refers to different biology disciplines such as, for instance, ge-
nomics, proteomics, or interactomics. The suffix -ome is used to indicate the
objects of study of such disciplines, for instance the genome, proteome, or in-
teractome, and usually refers to a totality of some sort. Main omics disciplines
are thus genomics, proteomics, and interactomics, that respectively study the
genome, proteome and interactome. The term omics data is used here to re-
fer to experimental data regarding the genome, proteome or interactome of an
organism.
The development of novel technologies for the investigation of the omics disci-
plines had caused the increased availability of omics data. Consequently the need
of both support and spaces for data storing as well as procedures and structures
for data exchanging arises. The resulting scenario is thus characterized by the
introduction of a set of methodologies and tools enabling the management of
data stored in geographically distributed databases using distributed tools often
implemented as services.
Corresponding author.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 43–52, 2012.
c Springer-Verlag Berlin Heidelberg 2012
44 M. Cannataro and P.H. Guzzi
– the introduction of a common shared data model able to capture both raw
data of the experiment and related metadata;
– the definition of an uniform and widely accepted access and manipulation
strategy for such large datasets;
– the design of algorithms that are aware of data distribution and thus may
improve their performance;
– the design of ad-hoc infrastructures for efficient data transfer.
For instance the distributed processing of protein interaction data involves the
following activities: (i) Sharing and dissemination of PPI data among different
databases; (ii) Collection of data stored in heterogeneous databases; and (iii)
Parallel and distributed analysis of data.
The first activity requires the development of both standards and tools to
manage the process of data curation and exchange between interaction databases.
Currently there is an ongoing project, namely the International Molecular Ex-
change Consortium (IMEx)1 , that aims to standardize the exchange of inter-
actomics data. The second activity requires to solve the classical bioinformatic
problem of linking identical data identified with different primary keys. Finally
the rationale for the third activity is due to the algorithmic nature of problems
regarding graphs. A big class of algorithms that mine interaction data can be
re-conducted to classical problems of graph and subgraph isomorphism that are
computationally hard. So the need for high-performance computational plat-
forms as well as parallel algorithms arises.
The rest of the paper is structured as follows. Section 2 discusses the manage-
ment issues of omics data and presents some omics databases. Section 3 recalls
main techniques for analysing omics data, while Section 4 describes some paral-
lele and distributed bioinformatics tools for the analysis of omics data. Finally.
conclusions and future work are reported in Section 5.
These databases store information about the primary sequence of proteins. Each
sequence is generally annotated by several information, e.g. the name of the sci-
entist that discovered the sequence or about the post translational modifications.
User can query these databases by using a protein identifier or a fragment of
sequence in order to retrieve the most similar proteins.
1
http://imex.sourceforge.net
Distributed Management and Analysis of Omics Data 45
The typical dimension of microrarray dataset is growing for two main reasons:
the dimension of files generated when using a single chip and the number of
the arrays involved in a single experiment are increasing. Let us consider, for
instance, two common Affymetrix microarray files (also known as CEL files): the
older Human 133 Chip CEL file that has a dimension of 5MB and contains 20000
different genes and the newer Human Gene 1.0 st that has a typical dimension
of 10 MB and contains 33000 genes. Moreover a single array of the Exon family
(e.g. Human Exon or Mouse Exon) can have up to 100 MB of size. Moreover
the recent trend in genomics is to perform microarray experiments considering
a large number of samples (e.g. coming from patients and controls) [1].
From this scenario, the need for the introduction of tools and technologies to
process such huge volume of data in an efficient way arises. A possible way to
develop the efficient preprocessing of microarray data is represented by the par-
allelization of existing algorithms on multicore architectures. In such a scenario
the whole computation is distributed onto different processors, that perform
computations on smaller sets of data and results are finally integrated. Such sce-
nario requires the design of new algorithms for summarisation and normalisation
that take advantage of the underlying parallel architectures. Nevertheless a first
step in this direction can be represented by the replication on different nodes of
existing preprocessing softwares that runs on smaller datasets.
Despite its relevance, the parallel processing of microarray data is a rela-
tively new field. An important work is represented by affyPara [15], that is a
Bioconductor package for parallel preprocessing of Affymetrix microarray data.
It is freely available from the Bioconductor project. Similarly the μ-CS project
presents a framework for the analysis of microarray data based on a distributed
7
http://cbio.mskcc.org/software/cpath
48 M. Cannataro and P.H. Guzzi
Once that an interaction network is modeled by using graphs, the study of bio-
logical properties can be done using graph-based algorithms [6], and associating
graph properties to biological properties of the modeled PPI. Algorithms for the
analysis of local properties of graphs may be used to analyze local properties
of PPIs networks, e.g. dense distribution of nodes in a small graph region may
be associated to proteins (nodes) and interactions (edges) relevant to represent
biological functions.
The rationale for the distributed analysis of PPI data is due to the algorithmic
nature of problems regarding graphs. A big class of algorithms that mine inter-
action data may be faced using classical algorithms for solving the graph and
subgraph isomorphism problems that are computationally hard. So the need for
high-performance computational platforms arises. Currently, different softwares
that mine protein interaction networks are available through web interfaces. For
instance NetworkBlast8 and Graemlin9 , that allow the comparison of multiple
interaction networks are both available through a web-interface. Alignment algo-
rithms usually employ different heuristics to face with the subgraph isomorphism
problem. Although this, they are usually time consuming and the dimension of
input data is still growing, so the development of high performance architectures
will be an important challenges in the future.
8
http://www.cs.tau.ac.il/~ bnet/networkblast.htm
9
http://graemlin.stanford.edu
Distributed Management and Analysis of Omics Data 49
4.2 MS-Analyzer
The analysis of Mass Spectrometry proteomics data requires the combination of
large storage systems, effective preprocessing techniques, and data mining and
visualization tools. The collection, storage and analysis of huge mass spectra pro-
duced in different laboratories can leverage the services of Computational Grids,
that offer efficient data transfer primitives, effective management of large data
stores, and large computing power. MS-Analyzer [7] is a software platform that
uses ontologies and workflows to combine spectra preprocessing tools, efficient
spectra management techniques, and off-the-shelf data mining tools to analyze
proteomics data on the Grid. Domain ontologies are used model bioinformat-
ics knowledge about: (i) biological databases; (ii) experimental data sets; (iii)
bioinformatics software tools; and (iv) bioinformatics processes. MS-Analyzer
adopts the Service Oriented Architecture and provides both specialized spectra
management services and public available off-the-shelf data mining and visu-
alization software tools. Composition and execution of such services is carried
out through an ontology-based workflow editor and scheduler, and services are
discovered with the help of the ontologies. Finally, spectra are managed by a
specialized database.
4.3 IMPRECO
Starting from protein interaction data, a number of algorithm for the individua-
tion of biologically meaningful modules has been introduced such as algorithms
for prediction of protein complexes. Protein complexes are a set of mutually
interacting proteins that play a common biological role. The individuation of
50 M. Cannataro and P.H. Guzzi
4.4 OntoPIN
PPI databases are often publicly available on the Internet offering to the user
the possibility to retrieve data of interest through simple querying interfaces.
Users, in fact, can conduct a search through the insertion of: (i) one or more
protein identifiers, (ii) a protein sequence, or (iii) the name of an organism.
Results may consist of, respectively, a list of proteins that interact directly with
the seed protein or that are at distance k from the seed protein, or the list of all
the interactions of an organism. Often it is impossible to formulate even simple
queries involving biological concepts, such as all the interactions that are related
to glucose synthesis.
The OntoPIN project [2], conversely, demonstrates the effectiveness of the use
of ontologies for annotating interaction starting from the annotation of nodes
and the subsequent use for querying interaction data. The OntoPIN project is
based on three main modules:
cellular process annotation, (iv) cellular compartment. The user can insert a
list of parameters that will be joined in a conjunctive way, i.e. the system will
retrieve interactions whose participants are annotated with all the selected
terms.
References
1. Guzzi, P.H., Cannataro, M.: Challenges in microarray data management and analy-
sis. In: Proceedings of the 24th IEEE International Symposium on Computer-Based
Medical Systems, Bristol, United Kingdom, June 27-30 (2011)
2. Cannataro, M., Guzzi, P.H., Veltri, P.: Using ontologies for querying and analysing
protein-protein interaction data. Procedia CS 1(1), 997–1004 (2010)
3. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., Apweiler, R.:
The GOA database in 2009–an integrated Gene Ontology Annotation resource.
Nucleic Acids Research 37, D396–D403 (2009)
4. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Gen-
Bank. Nucleic Acids Research 36(Database issue) (2008)
5. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C.C., Estreicher, A.,
Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout,
S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
6. Cannataro, M., Guzzi, P.H., Veltri, P.: Protein-to-protein interactions: Technolo-
gies, databases, and algorithms. ACM Comput. Surv. 43 (2010)
7. Cannataro, M., Guzzi, P.H., Mazza, T., Tradigo, G., Veltri, P.: Using ontologies
for preprocessing and mining spectra data on the grid. Future Generation Comp.
Syst. 23(1), 55–60 (2007)
8. Cannataro, M., Guzzi, P.H., Veltri, P.: Impreco: Distributed prediction of protein
complexes. Future Generation Comp. Syst. 26(3), 434–440 (2010)
9. Cerami, E., Bader, G., Gross, B.E., Sander, C.: Cpath: open source software for
collecting, storing, and querying biological pathways. BMC Bioinformatics 7(497),
1–9 (2006)
52 M. Cannataro and P.H. Guzzi
10. Chaurasia, G., Iqbal, Y., Hanig, C., Herzel, H., Wanker, E.E., Futschik, M.E.:
UniHI: an entry gate to the human protein interactome. Nucl. Acids Res. 35(suppl.
1), D590–D594 (2007)
11. The UniProt Consortium: The universal protein resource (UniProt) in 2010. Nu-
cleic Acids Research 38(suppl. 1), D142–D148 (2010)
12. Craig, R., Cortens, J.P., Beavis, R.C.: Open source system for analyzing, validating,
and storing protein identification data. Journal of Proteome Research 3(6), 1234–
1242 (2004)
13. Desiere, F., Deutsch, E.W., King, N.L., Nesvizhskii, A.I., Mallick, P., Eng, J., Chen,
S., Eddes, J., Loevenich, S.N., Aebersold, R.: The peptideatlas project. Nucleic
Acids Research 34(suppl. 1), D655–D658
14. Guzzi, P.H., Cannataro, M.: mu-cs: An extension of the tm4 platform to manage
affymetrix binary data. BMC Bioinformatics 11, 315 (2010)
15. Schmidberger, M., Vicedo, E., Mansmann, U.: Affypara: a bioconductor package
for parallelized preprocessing algorithms of affymetrix microarray data
16. Taylor, C.F., Hermjakob, H., Julian, R.K., Garavelli, J.S., Aebersold, R., Apweiler,
R.: The work of the human proteome organisation’s proteomics standards initiative
(HUPO PSI). OMICS 10(2), 145–151 (2006)
Managing and Delivering Grid Services (MDGS)
The aim of the MDGS workshop is to bring together Grid experts from the (Grid)
infra-structure community with experts in IT service management in order to
present and discuss the state-of-the-art in managing the delivery of ICT services
and how to apply these concepts and techniques to Grid envi-ronments. Up to
now, work in this area has proceeded mostly on a best effort basis. Little effort
has been put into the processes and approaches from the professional (often
commercial) IT service management (ITSM).
The workshop creates a platform for both the users of Grid-based services
(e.g., high performance distributed computing users) and the people involved
in contributing to Grids and their operation (e.g., members of grid initiatives,
resources providers) to share their views on the topic of managed service deliv-
ery and related re-quirements and constraints. This reveals the need for defined
service levels in the form of service level agreements (SLAs) in Grid environ-
ments. Based on this, the workshop provides insight into the ITSM frameworks,
and focus on the exchange of ideas on how the Grid community may adopt and
adapt the concepts and mechanisms of these frameworks (and the ITSM domain
in general) to take benefit from them. In this context, the specific features and
characteristics of Grid environments are taken into ac-count.
Contributions to the MDGS2011 describe on going work on various topics re-
lated to Service level Management in Grid based systems. The accepted papers
cover various topics such as current best practices in grid Service Level Manage-
ment, problems faced, potential models from commercial IT Service Management
to be adopted, and specific case studies to highlight the full complexities of the
situation.
Resource Allocation for the French National Grid
Initiative
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 55–63, 2012.
© Springer-Verlag Berlin Heidelberg 2012
56 G. Mathieu and H. Cordier
and a possible basis for establishing policies and procedures in the medium term
specific to France’s context and based on international collaboration.
projects should be revised to ensure the visibility and sustainability of the French Grid
Infrastructure. Beyond that, there is a clear need of accountability. Especially, France
Grilles needs to be able to:
- Assess how resources and services are delivered to the French community;
- Justify that resources delivered to international communities are not wasted,
and that there is a return on investment.
- New communities can join in and use resources without necessarily being
filtered, provided their needs are reasonable (filtering is done above a given
threshold in terms of how much the user asks – if asking for any precise
amount of resources)
- Established user communities provide the scientific expertise needed to
validate resource allocation above this threshold
- There is a unique point of contact for all users in demand of resources
- The complexity of the model is not visible to users
- The whole model allows to measure and report on resource usage for both
new and established communities, either French or international
Ask
resources
User Single point of contact ("Broker")
INFORMAL PROJECT
REQUEST: Resources
"I want to use request yes
Matching Transfer
the Grid"
community? request
no
Establishe
d VO
Send user no Is user "grid
to aware"?
VO based
yes resources
allocation
NGI/VO yes
Resources > Ask for
Training
threshold? validation
no
Scientific
Transfer committee
request
Fig. 1. A priori analysis for resource allocation requests from new users
PROJECT
Resources "Catch all"
NGI VO
request
Operations
Site
Sitebased
based Site
Sitebased
based
Services Resources
support
support
support support
proposals proposals
agreements
agreements agreements
agreements
negotiations
Service Resource
Providers Providers
Resource
allocation
User
Fig. 2. Establishment of an NGI based resource allocation agreement
Resource Allocation for the French National Grid Initiative 61
VO Sites
Strategic policies
Facilitation
Concrete agreements
5 Next Steps
Jointly to this work on a national resource allocation strategy, we are currently in the
implementation process of national VO for France Grilles. Establishing such a VO
addresses the need for an easier integration of new users by establishing a VO
supported nationwide and open to all. Our usage scenario is to add this national VO to
the already available offer provided by local and regional VOs, so as to provide the
French community with a larger spectrum of possibilities to answer their needs. This
way we also build upon the existing structure and manpower set-up in regional grids
to remain as close as possible from the end-user.
Resource allocation through the national VO can be seen as a possible implementation
of the NGI based resource allocation agreement described earlier (see Fig.2).
As mentioned in section 4, a deeper study of the modalities of an a posteriori
analysis is also needed to make any further progress. Part of our effort in the months
to come will be dedicated to that.
Resource Allocation for the French National Grid Initiative 63
Also, and as exposed earlier, the usage of a tool to monitor and follow negotiations
between any resource providers and user communities is currently under study. This
could lead to the set-up of an a posteriori usage dashboard and possibility to drive the
process for a priori allocations for resource allocation through the national VO. Also
such assessments could be used for the real time implementation of the resource
allocation.
References
1. EGI-Inspire web site, http://www.egi.eu/projects/egi-inspire/
2. EGEE web site, http://www.eu-egee.org
3. France Grilles web site, http://www.france-grilles.fr
4. Ferrari, T.: EGI Operations Architecture, EU deliverable D4.1,
https://documents.egi.eu/public/ShowDocument?docid=218
5. GENCI web site, http://www.genci.fr
6. Rivière, C.: GENCI: Grand Equipement National de Calcul Intensif. Rencontre GENCI
ORAP, Paris (2007), http://www.genci.fr/spip.php?article13
7. Rivière, C.: Rapport annuel 2009 de GENCI,
http://www.genci.fr/spip.php?article92
8. WLCG web site, http://lcg.web.cern.ch
9. WLCG MoU, Annex 9 “Rules of Procedure for the Resources Scrutiny Group (RSG)”,
http://lcg.web.cern.ch/LCG/mou.htm
10. PL-Grid web site, http://www.plgrid.pl
11. Szepieniec, T., Radecki, M., Tomanek, M.: A Resource Allocation-centric Grid Operation
Model. In: Proceedings of the ISGC 2010 Conference, Taipei, Taiwan (2010)
12. Bazaar Project Web Page, http://grid.cyfronet.pl/bazaar
On Importance of Service Level Management
in Grids
1 Introduction
Since the 1990’s, when the term ‘Grid’ was coined, Grids have changed from early
prototype implementations to production infrastructures. However, despite ma-
turing considerably during this time, Grids still suffer from the lack of service
management solutions that would be suited to an infrastructure of the size and
user base of the current Grid. The maturing Grid technologies need to incor-
porate understanding of the business models of the users and service providers.
When possible, they should be composed from standard business solutions that
support service management and delivery.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 64–75, 2012.
c Springer-Verlag Berlin Heidelberg 2012
On Importance of Service Level Management in Grids 65
5 Actors Perspective
In this section we analyse how each actor of the SLM model for Grids would
benefit from introducing Service Level Management solutions. We will also assess
the cost of such operation.
From the beginning of the realisation of the Grid concepts, the key Grid cus-
tomers were large-scale projects with worldwide collaboration. For such projects,
implementation of at least some processes from Service Level Management seems
unavoidable. Giving the example of the main customers of the European Grid
Initiative (EGI) [7], we show how SLM was necessary for them and how it was
realised.
The most representative example of a large European project using Grids is
Large Hadron Collider (LHC) built in CERN. LHC is the largest research de-
vice worldwide, gathering thousands of researchers in four different experiments.
Each of these experiments requires petabytes of storage space and thousands of
CPU cores to process the data. The data produced by the experiments are pro-
duced continuously while the LHC is running. Therefore, there should be enough
resources capable of handling large data amounts and throughput supplied, both,
on the short- and on the long- term. This includes computational and storage
resources, as well as network facilities. Thus, the long-term goals require special
focus on infrastructure planning.
The LHC way of defining contracts related to resources was to launch the
process of Memorandum of Understanding (MoU)preparation and signing. The
process was extremely hard and problematic - it required many face-to-face
72 T. Szepieniec et al.
meetings and took several months. MoUs had to be signed by each Resource
Provider. It was possible to agree only on very general metrics related to capacity
of resources provided in the long term. The process of MoUs signing was planned
to be a single action. However, they required fulfilling other quality metrics
defined by OLAs acknowledged by the sites entering the Grid infrastructure.
These metrics were not related to any specific customer.
Even with these simple means, the result of signing of the MoUs was a con-
siderable increase of job success rate (a factor describing the amount of tasks
submitted to the Grid that could be completed normally) [9].
7
https://twiki.cern.ch/twiki/pub/ArdaGrid/ITUConferenceIndex/
C5-May2006-RRC06-2.ppt
8
http://www.urbanflood.eu
On Importance of Service Level Management in Grids 73
7 Related Works
assessment and report the SLA fulfilment to the customer. Also, these level-
3 processes cover the lifecycle of the process of management and resolution of
eventual SLA violations.
Performance management is not only considered in the relationship between the
service provider and their customers, but also between the service provider and
their partners/suppliers. In this sense, it is worth to mention the level-2 process
called Supplier/Partner Performance Management. It decomposes further into five
level-3 processes covering aspects like the performance assessment, its reporting
and eventual actions to undertake in case that the contracted quality drops below
established thresholds. The performance of the service to be provided by a supplier
or a partner is also collected in SLAs (Supplier/Partner SLA).
9 Summary
References
1. Schwiegelshohn, U., et al.: Perspectives on Grid computing. Future Generation
Grid Computing 26(8), 1104–1115 (2010)
2. Taylor, S., Lloyd, V., Rudd, C.: ITIL v3 - Service Design, Crown, UK (2007)
3. ISO/IEC 20000-1:2011 IT Service Management System Standard
4. Foster, I.: What is the Grid: A Three Point Checklist, Grid Today, July 20 (2002)
5. Plaszczak, P., Wellner, R.: Grid computing: the savvy manager’s guide. Elsevier
(2006) ISBN: 978-0-12-742503-0
6. Leff, A., Rayfield, J.T., Dias, D.M.: Service-Level Agreements and Commercial
Grids. IEEE Internet Computing 7(4), 44–50 (2003)
7. Candiello, A., Cresti, D., Ferrari, T., et al.: A Business Model for the Establishment
of the European Grid Infrastructure. In: The Proc. of the 17th Int. Conference on
Computing in High Energy and Nuclear Physics (CHEP 2009), Prague (March
2009)
8. Kryza, B., Dutka, L., Slota, R., Kitowski, J.: Dynamic VO Establishment in Dis-
tributed Heterogeneous Business Environments. In: Allen, G., Nabrzyski, J., Seidel,
E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part II. LNCS,
vol. 5545, pp. 709–718. Springer, Heidelberg (2009)
9. Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on
the Grid: late job binding with lightweight User-level Overlay. Accepted for pub.
in FGCS (2011)
10. Business Process Framework, Release 8.0, GB921, TMForum (June 2009)
11. SLA Management Handbook, Release 3.0, GB917. TMForum (May 2010)
12. WS-Agreement Specification, version 1.0. Approved by OGF,
http://forge.gridforum.org/
13. Szepieniec, T., Tomanek, M., Twarog, T.: Grid Resource Bazaar. In: Cracow Grid
Workshop 2009 Proc., Krakow (2010)
On-Line Monitoring of Service-Level
Agreements in the Grid
Grid infrastructures federate resources from different providers [11], hence Ser-
vice Level Agreements between computing centers comprising the Grid, and
users running jobs, are needed to ensure the desired quality of service [10,7]. An
essential phase in SLA management is the monitoring of SLA fulfillment. The
prevailing approach is off-line SLA monitoring: data about resource usage and
performance is periodically sampled, stored, and subsequently analyzed for SLA
violations, like in the European EGI/EGEE infrastructure [13]. In on-line SLA
monitoring, on the other hand, resource usage and performance are analyzed
on the fly which allows for immediate alerts or corrective actions when an SLA
violation is detected or predicted.
We present a framework for on-line monitoring of SLA contracts in the Grid.
The solution is based on leveraging Complex Event Processing for on-line mon-
itoring in the Grid – GEMINI2 [1]. In this approach, basic SLA performance
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 76–85, 2012.
c Springer-Verlag Berlin Heidelberg 2012
On-Line Monitoring of Service-Level Agreements in the Grid 77
metrics are collected on-line, while complex SLA metrics can be defined on-
demand as queries in a general-purpose continuous query languages EPL, and
calculated in real-time in a CEP engine. Advanced query capabilities are afforded
by this approach: value aggregations, filtering, distributed correlations, joining
of multiple streams of basic metrics, etc. Furthermore, client-perspective SLA
monitoring is made possible. The capabilities of the solution are demonstrated
in a case study: SLA monitoring of data-intensive Grid jobs.
This paper is organized as follows. Section 2 presents related work. Section 3
describes the framework for on-line SLA Monitoring in the Grid. In section 4,
SLA monitoring of data intensive jobs is studied. Section 5 concludes the paper.
2 Related Work
In [9], the authors propose the timed automata formalism to express SLA
violations, and automatically generate monitors for these violations. Exactly the
same can be achieved with Complex Event Processing: continuous query lan-
guage enables one to express SLA violations, while installing a query in a generic
CEP engines is equivalent to creating a new monitor. However, CEP has the
advantage of availability of mature and efficient technologies. Moreover, a con-
tinuous query engine is more user-friendly and arguably no less expressive than
a timed automata. In fact, automata are formalisms often used in the implemen-
tation of CEP engines [8].
3.1 Architecture
Fig. 1 presents a high-level view over the architecture of the on-line SLA mon-
itoring framework. The SLA Monitoring Service and the Resource Information
Registry are the core components of the framework. Also shown are Resources
of the Grid Infrastructure (computers, storage devices, software services), a Re-
source Provider, and a Service-Level Management Service which uses the SLA
Monitoring Service to define SLA Metrics, and makes corrective actions when
an SLA violation takes place or is predicted.
The resources of the Grid infrastructure provide event streams of basic SLA
metrics, such as current CPU load, current memory consumption, current data
transfer rate, response time to the latest client service request, etc. Additional
metrics can also be provided by the client side (response times, transfer rates
measured by the client, etc.).
The streams of basic metrics are consumed by the SLA Monitoring Service
wherein they can be transformed into composite metrics derived from one or
more basic streams. The composite metrics are defined on demand using the
continuous query language, and calculated in real-time in the CEP engine. Ex-
amples of composite metrics include:
The SLA Monitoring Service is designed and implemented on the basis of the
GEMINI2 monitoring system [1]. GEMINI2 provides a framework for on-line
monitoring which encompasses a CEP-based monitoring server (GEMINI2 Mon-
itor) and local sensors (GEMINI2 Sensors). Monitoring data is represented as
events (collections of name – value pairs) which typically contain at least a unique
resource identifier (e.g. a host name), and a set of associated metrics (e.g. current
CPU load on the host).
80 B. Balis et al.
Sensors are responsible for measuring the metrics and publishing the associ-
ated events to a Monitor. The Monitor contains a CEP engine (Esper [2]) and
exposes a service to formulate queries in the Event Processing Language (EPL).
The event streams from Sensors are processed against the queries in the CEP
engine which results in derived complex metrics returned to the requester.
Besides monitoring event streams, Sensors also periodically publish Advertise-
ment events in the Monitor. These events register a resource with the Monitor,
along with their static attributes.
This request selects attributes from two streams: HostMs (which contains host
name and host metrics such as the current CPU load), and ProcessMs (which
contains a process identifier, host name on which the process is running, and
metrics, such as the CPU usage). The streams are joined with the value of the
common attribute: the host name.
3.4 Registry
Registry is a database associated with a Monitor which contains information
about resources, specifically their static attributes (metadata) which are not
published in the monitoring event streams.
In order to combine data from the event streams and the Registry, the EPL
request can contain an SQL query. For example the request from section 3.1 is
expressed in EPL as follows:
Fig. 2. Example deployment of resources and SLA Monitoring system components for
the monitoring of data-intensive jobs scenario
1. Return average read transfer rate for a disk array with particular ID for the
last 80 minutes.
select avg ( c u r r e n t R e a d T r a n s f e r R a t e)
from DAMs ( id = ’ IP : mountDir ’) . win : time (80 min ) ;
2. Every 5 minutes return average read transfer rate for those disk arrays for
which it exceeded 100MB/s within the last 40 minutes.
select serverName , id , avg ( c u r r e n t R e a d T r a n s f e r R a t e)
from DAMs . win : time (40 min ) group by serverName , id
having avg ( c u r r e n t R e a d T r a n s f e r R a t e) > 100
output all every 5 minutes ;
3. Return current free capacity and average write transfer rate for all disk ar-
rays managed by server zeus.cyfronet.pl. This request may be useful, e.g., to
predict the running out of the disk space.
select id , freeCapacity , avg ( c u r r e n t W r i t e T r a n s f e r R a t e)
from DAMs ( s e r v e r N a m e= ’ zeus . cyfronet . pl ’) . win : time (5 min ) ,
group by id
output all every 5 minutes ;
The next example shows a metric which combines data from event streams and
the Registry. The request selects HSM devices which currently undergo high
write transfer rates. In addition, the historical average for the device is returned.
select hsm . id , avg ( hsm . c u r r e n t W r i t e T r a n s f e r R a t e) , hsmreg . a v g W r i t e T r a n s f e r R a t e
from HSMMs . win : time (5 min ) as hsm ,
sql : Registry [ ’ ’ select a v g _ w r i t e _ t r a n s f e r _ r a t e as a v g W r i t e T r a n s f e r R a t e
from HSM
where res_id = $ { hsm . id } ’ ’]
having avg ( hsm . c u r r e n t W r i t e T r a n s f e r R a t e) > 60
On-Line Monitoring of Service-Level Agreements in the Grid 83
Finally, the following example demonstrates SLA Monitoring that includes client-
side metrics. Let us assume that the user running and steering the simulation
would like that two requirements are satisfied:
– The simulation is sufficiently responsive to user steering actions.
– The simulation results are delivered to GUI with transfer rate large enough
for real-time visualization.
Consequently, the following SLA could be requested: (a) average response time
of user interactions does not exceed 100ms, AND (b) average data transfer rate
from the processing job to the GUI does not drop below 128KB/s. Expressed in
EPL:
select avg ( a . responseTime , 90) , avg ( b . i n T r a n s f e r R a t e)
from pattern [ every ( a = C l i e n t P e r f M s( appId = ’ app1 ’) or
( b = D a t a T r a n s f e r P e r f M s( port = ’ 1111 ’) ) )
]. win : time (5 min )
having avg ( a . responseTime , 90) > 100 or
avg ( b . i n T r a n s f e r R a t e) < 128
This request consumes two event streams mentioned earlier: ClientPerfMs, which
contains, among others, response time of the latest simulation steering request;
DataTransferPerfMs which contains performance metrics of data transfers
to/from a host. The first stream also contains attribute appId which identifies
the particular simulation session, and which is used to filter the stream. The sec-
ond stream is also filtered against port number 1111 on which the GUI receives
the simulation results. The request defines an event pattern ‘AorB’ – fulfilled if
either of two event happens.
5 Conclusion
This paper presents a novel and generic solution for efficient, near real time mon-
itoring of Service Level Agreements in the Grid. This solution is based on the
application of Complex Event Processing principles and supporting technologies.
We have elaborated a generic framework in which event streams represent indi-
vidual performance metrics which, in turn, can be combined into high-level com-
posite metrics. The main features of the monitoring framework are: on-demand
definition of SLA metrics using a high-level query language, real-time calcula-
tion of the defined SLA metrics and advanced query capabilities which allow for
defining high-level complex metrics derived from basic metrics. Resource infor-
mation registry complements the functionality of the framework by providing
a space for storing historical or long-term metrics, as well as resource metadata.
The information from the Registry can also be used in continuous queries, fur-
ther enhancing the capabilities of the framework in terms of definition of complex
SLA metrics. The case study of the data-intensive application have demonstrated
the feasibility of this approach.
Future work involves the investigation of an efficent way of mapping of high-
level metrics into SLA obligations, improvement of performance of the frame-
work, and investigation of other on-line SLA monitoring use cases.
84 B. Balis et al.
References
1. Balis, B., Kowalewski, B., Bubak, M.: Real-time Grid monitoring based on complex
event processing. Future Generation Computer Systems 27(8), 1103–1112 (2011),
http://www.sciencedirect.com/science/article/pii/S0167739X11000562
2. Berhardt, T., Vasseur, A.: Complex Event Processing Made Simple Using Esper
(April 2008), http://www.theserverside.com/news/1363826/
Complex-Event-Processing-Made-Simple-Using-Esper
(last accessed June 30, 2011)
3. Gorla, A., Mariani, L., Pastore, F., Pezzè, M., Wuttke, J.: Achieving Cost-Effective
Software Reliability Through Self-Healing. Computing and Informatics 29(1), 93–
115 (2010)
4. Litke, A., Konstanteli, K., Andronikou, V., Chatzis, S., Varvarigou, T.: Manag-
ing service level agreement contracts in OGSA-based Grids. Future Generation
Computer Systems 24(4), 245–258 (2008)
5. Menychtas, A., Kyriazis, D., Tserpes, K.: Real-time reconfiguration for guarantee-
ing QoS provisioning levels in Grid environments. Future Generation Computer
Systems 25(7), 779–784 (2009)
6. Michlmayr, A., Rosenberg, F., Leitner, P., Dustdar, S.: Comprehensive QoS moni-
toring of Web services and event-based SLA violation detection. In: Proceedings of
the 4th International Workshop on Middleware for Service Oriented Computing,
pp. 1–6. ACM (2009)
7. Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on
the grid: Late job binding with lightweightuser-level overlay. Future Generation
Computer Systems 27(6), 725–736 (2011),
http://www.sciencedirect.com/science/article/pii/S0167739X11000057
8. Mühl, G., Fiege, L., Pietzuch, P.: Distributed Event-Based Systems. Springer (Au-
gust 2006)
9. Raimondi, F., Skene, J., Emmerich, W.: Efficient online monitoring of web-service
slas. In: Proceedings of the 16th ACM SIGSOFT International Symposium on
Foundations of Software Engineering, pp. 170–180. ACM (2008)
10. Sahai, A., Graupner, S., Machiraju, V., van Moorsel, A.: Specifying and Monitoring
Guarantees in Commercial Grids through SLA. In: CCGRID 2003: Proceedings of
the 3st International Symposium on Cluster Computing and the Grid, p. 292. IEEE
Computer Society, Washington, DC (2003)
11. Schwiegelshohn, U., Badia, R.M., Bubak, M., Danelutto, M., Dustdar, S.,
Gagliardi, F., Geiger, A., Hluchy, L., Kranzlmüller, D., Laure, E., Priol, T., Reine-
feld, A., Resch, M., Reuter, A., Rienhoff, O., Rüter, T., Sloot, P., Talia, D., Ull-
mann, K., Yahyapour, R., von Voigt, G.: Perspectives on grid computing. Future
Generation Computer Systems 26(8), 1104–1115 (2010),
http://www.sciencedirect.com/science/article/pii/S0167739X10000907
12. Smith, M., Schwarzer, F., Harbach, M., Noll, T., Freisleben, B.: A Streaming Intru-
sion Detection System for Grid Computing Environments. In: HPCC 2009: Pro-
ceedings of the 2009 11th IEEE International Conference on High Performance
Computing and Communications, pp. 44–51. IEEE Computer Society, Washing-
ton, DC (2009)
On-Line Monitoring of Service-Level Agreements in the Grid 85
13. Szepieniec, T., Tomanek, M., Twaróg, T.: Grid Resource Bazaar: Efficient
SLA Management. In: Proc. Cracow Grid Workshop 2009, pp. 314–319. ACC
CYFRONET AGH, Krakow (2009)
14. Truong, H.L., Fahringer, T.: SCALEA-G: a Unified Monitoring and Performance
Analysis System for the Grid. Scientific Programming 12(4), 225–237 (2004)
15. Truong, H., Samborski, R., Fahringer, T.: Towards a framework for monitoring and
analyzing QoS metrics of grid services. In: Second IEEE International Conference
on e-Science and Grid Computing, e-Science 2006, p. 65. IEEE (2006)
16. Wright, H., Crompton, R., Kharche, S., Wenisch, P.: Steering and visualization:
Enabling technologies for computational science. Future Generation Computer Sys-
tems 26(3), 506–513 (2010)
Challenges of Future e-Infrastructure
Governance
Dana Petcu
1 Introduction
E-Infrastructure landscape is changing to comply with the service oriented pa-
radigm, which enables increased innovation potential and cost-efficient access
from a widening range of users, thereby strengthening the socio-economic im-
pact. On another hand, sustainability of current e-Infrastructures has become a
global concern and the key role is played by their governance. Efficient, effec-
tive, transparent and accountable operations are nowadays the main topics of
e-Infrastructures governance. These trends are recognized at national and Eu-
ropean levels, with forceful e-Infrastructure agendas or strategies to promote
an efficient governance for the research ecosystem. Further strategic develop-
ment of e-Infrastructures should respond to the demand for and the necessity of
Green IT, the need for massive computational power (exascale computing), the
increasing amount of data, the seamless access to services for users, the interna-
tionalization of scientific research and the involvement of the user communities
in governance of e-Infrastructures. Aligned to these efforts and requirements, e-
Infrastructure Reflection Group (e-IRG) has recently analyzed the structures as
well as organizational and relational aspects of current e-Infrastructures together
with the governance process, distinguishing strategic processes and operational
management and the various functional aspects of governance, e.g. the support-
ing legal and financing structures.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 86–95, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Challenges of Future e-Infrastructure Governance 87
2 e-IRG Recommendations
The topics that are presented in the e-IRG’s white paper address both sev-
eral questions related to e-Infrastructure, like: (1) what are the appropriate
governance models for e-Infrastructures; (2) how to advance research networks;
(3) how to facilitate access; (4) how to deal with the increasing energy demands
of computing; (5) what software is needed to fully harness the power of future
HPC systems; (6) how to adopt and implement new e-Infrastructure services;
(7) how to discover and share of large and diverse sources of scientific data. Each
question is treated in what follows.
In the context of on open ecosystem functionality, one of the objective of the gov-
ernance of an authentication and authorization infrastructure (AAI) is to estab-
lish and maintain the level of mutual trust amongst users and service providers.
The current requirements are, according e-IRG study: (1) improved usability,
lowering the threshold for researchers to use the services; (2) improved security
and accountability (often conflicting with usability requirement); (3) leverag-
ing of existing identification systems; (4) enhanced sharing, allowing users to
minimize the burden of policy enforcement; (5) reduced management costs, free-
ing resources for other service or research activities, and providing a basis for
accounting; (6) improved alliance with the commercial Internet, which also im-
proves interaction between scientists and society.
In the case of identity recognition, there are several models. European NRENs
operate identity federations, and provide services to a large number of users
within academic and research communities. Based on open standards, these na-
tional identity federations focus on providing access to web-based resources, such
as data repositories. The user typically acts as a consumer. A full e-Infrastructure
should also allow the user to act as a producer of information. In this con-
text, clear and simple mechanisms for accessing and managing authorization
policies are required. Moreover, the connection between different national iden-
tity federations into a common identity space that supports real-time access to
web resources across Europe is an ongoing task as the maturity of the national
AAIs differs substantially between countries. On another hand, players outside
academia include providers of user-centric identity management models (like
OpenID used in web 2.0 applications), as well as governments offering identity
infrastructures rooted to a legally recognized and authoritative framework.
Several other technical problems are needed to be solved fast: (a) support for
the management of distributed dynamic virtual organizations; (b) robust and
open accounting solutions to monitor e-Infrastructure services; (c) integration of
user-centric and governmental infrastructures with academic AAIs.
In this context, e-IRG recommends:
The massive increases in the quantity of digital data leads to the urgent need to
integrate data sources in order to build a sustainable way of providing a good
level of information and knowledge – this feature is currently missing from the
92 D. Petcu
4 Conclusions
While the topics presented in this paper are referring a variety of concerns re-
lated to future e-Infrastructures, a general trend towards service orientation
can be concluded. Only this orientation can ensure that future e-Infrastructures
will reach a wider European community of users. This vision has been catch in
Challenges of Future e-Infrastructure Governance 95
e-IRG’s recent white paper that was exposed and re-interpreted in this paper
from the perspective of the researchers to be involved in develop, deliver or use
the future e-Infrastructures.
References
1. European Commission, A Digital Agenda for Europe (2010),
http://ec.europa.eu/information_society/digital-agenda/
2. European Commission, Work Programme 2011-2012. Cooperation. Theme 3. ICT -
Information and Communication Technologies (2011),
http://cordis.europa.eu/fp7/ict/
3. European Commission, Work Programme 2011. Capacities. Part 1: Research Infras-
tructures (2010), http://cordis.europa.eu/fp7/ict/e-infrastructure/
4. European Strategy Forum on Research Infrastructures, Strategy Report and
Roadmap Update (2010), http://ec.europa.eu/research/infrastructures/
5. e-Infrastructure Reflection Group, White paper (2011), http://www.e-irg.org
Influences between Performance Based Scheduling
and Service Level Agreements
1 Introduction
1
http://www.wrf-model.org/
2
http://www.gromacs.org/
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 96–105, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Influences between Performance Based Scheduling and SLA 97
Figure 1 shows the general principle of job submissions in the context addressed
here (see also [1]). The job submitted by a customer/user is placed in the queue of a
global scheduler. The main goal of the global scheduler is to decide on which infra-
structure component this job should be computed. As of now, this is often done taking
into account only the current filling state of local queues of all available resources and
the very coarse grained classification of these resources, e.g., CPU- or GPU-based
computation unit. After the decision is taken, the job is moved from the global queue
to the local queue of the selected computation unit.
Job
Global queue
Global scheduler
Job Job Job
were encountered. However, this will require either a notification system or the
benchmarks must be started manually. An alternative strategy is to start the bench-
marks periodically. This eliminates the necessity of an event messaging system, but it
bears the risk of possible interferences with productive jobs. Therefore, this strategy is
often combined with additionally defined policies, e.g., to schedule benchmarks only
in the case of empty local queues. For our work, both approaches could be adopted
and we abstain from recommendations and further discussions of this topic.
The extended scheduling engine is outlined in Figure 2.b). The result prediction
component is the core of the engine. In the first place, it takes into account the infor-
mation about fine grained resource performance, the states of the local queues, and
the job description. During submission phase, job requests have to specify; the job
description should include a specification to which class of computations this particu-
lar job belongs. This information is needed in order to perform a better match with the
benchmark tests used for the resource ranking. Based on the information and schedul-
ing policies the device for executing the job is selected. After the job is scheduled the
performance evaluator component is in charge of qualitatively monitoring the job
execution. This information can be used for the verification of performance goals as
stated in SLA. Further, the evaluation of the job execution performance – together
with the previous predictions – should be used in the prediction verification compo-
nent. The purpose of this component is to determine the deviation of the results from
their predictions. The deviation in turn can be used in the result prediction component
to reduce the prediction error before signing any SLA.
Therefore, in order to fulfill the end-user requirements specified in SLA, it is ne-
cessary to take into account two main information: estimated execution time at differ-
ent available resources and the estimated waiting time of the related queues. For both
100 A. Galizia et al.
Figure 1. It is reasonable to base the job allocation strategy on the classical round-
robin procedure. We further considered the rank of the resources based on an
established application benchmark, i.e., ISO and HPL ranks.
To test the two components added to the global scheduler, we collected perform-
ance values of five resources under our domain/access, considering both level of
benchmarks. To simulate the chosen scenarios and to compare the scheduling strate-
gies we employed the Java Modelling Tools [18], an open source tool for perform-
ance evaluation and workload characterization of computer and communication sys-
tems based on queuing networks. In the reminder of this section, we present the re-
sources and experimental results. Please note that in order to focus on the evaluation
of the overall concept we simplify the job allocation component via removing the
feedback loop consisting of prediction and verification components.
The double-level benchmark was run to gain a precise description of the actual per-
formance offered by the computational systems along different metrics axes. Figures
3 and 4 depict the performance values of the respective micro and application bench-
marks, we briefly discuss them in the following.
As Figure 3 outlines, the resources provide different performances with respect to
the considered benchmarks. For example, SC1458 achieves almost the best ranks for
the aggregated values and interconnection performance but performs poorly consider-
ing the ranks of the single cores. For the benchmarks michelangelo and ibm performs
better.
102 A. Galizia et al.
Figure 4 reports the relative performance of ISO and HPL, each resource is tagged
with a value in the range [1,…,5], where greater values correspond to worse perform-
ance (e.g., ibm and SC1458 rank first according to ISO and HPL respectively). The
ranking was based on the execution Wall Clock Time (WCT).
Figures 3 and 4 show that, as expected, none of the resources is the best in all
cases, therefore the importance of an accurately designed performance-aware schedul-
ing of the jobs is essential for fulfilling the SLA.
Influences between Performance Based Scheduling and SLA 103
In Figure 5 the response times of each strategy at increasing workloads are shown.
It is immediately clear that the proposed performance-based SLA outperforms the
other schedulers. This is not surprising since each resource is exploited as its best
respect to the incoming workloads, i.e. each application is allocated to the resources
that execute the code in the most efficient way, in our analysis with minor execution
time. It leads to faster execution and lower waiting time. Both parameters impact (in
this case positively) on the response time. An increase of computation intensive work-
loads also influences our scheduling mechanism, however the growth of response
time is moderate compared with other tested strategies.
5 Conclusion
Acknowledgements. The authors would like to thank the members of the Munich
Network Management (MNM) Team for their support and many useful discussions.
As a group of researchers from the Ludwig-Maximilians-Universität München, the
Technische Universität München, the University of the German Federal Armed
Forces, and the Leibniz Supercomputing Centre of the Bavarian Academy of Science
and Humanities, the MNM Team focuses on computer networks, IT management,
High Performance Computing, and inter-organizational distributed systems. The team
is directed by Prof. Dr. Dieter Kranzlmüller and Prof. Dr. Heinz-Gerd Hegering. For
more information please visit http://www.mnm-team.org.
This work has partially been funded by the Seventh Framework Program of the
European Commission (Grants 246703 (DRIHMS) and 261507 (MAPPER)), and by
the project REsource brokering for HIgh performance, Networked and Knowledge
based applications (RE-THINK), P.O.R. Liguria FESR 2007-2013.
References
1. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure, 2nd
edn. Elsevier (2004)
2. Distributed European Infrastructure for Supercomputing Applications (May 10, 2011),
http://www.deisa.eu/science/benchmarking
Influences between Performance Based Scheduling and SLA 105
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 106–115, 2012.
c Springer-Verlag Berlin Heidelberg 2012
User Centric Service Level Management in mOSAIC Application 107
improving the quality perceived from users, while resource administration and
optimization assume the role of acquiring the right amount of resources, which
are compliant with the needing of the final users, instead of optimizing the usage
of already acquired ones.
Service Level Agreements (SLAs) aim at offering a simple and clear way to
build up an agreement between the final users and the service provider in order
to establish what is effectively granted. A Service Level Agreement (SLA) is
an agreement between a Service Provider and a Customer, that describes the
Service, documents Service Level Targets, and specifies the responsibilities of
the Provider and the Customer.
From User point of view a Service Level Agreement is a contract that grants
him about what he will effectively obtain from the service. From Application
Developer point of view, SLAs are a way to have a clear and formal definition
of the requirements that the application must respect.
But, in such a context, how it is possible for an application developer to take
into account the quality perceived by EACH final user of his application? This
problem is solved by the adoption of SLA templates (i.e. predefined agreements
offered to final users). This means that developers must identify at design time
the constraints to be fulfilled, the performance index to be monitored and the
procedure to be activated in case of risky situations (i.e. when something happens
that may lead to disrespect the agreement).
At the state of art many research efforts have been spent in order to de-
fine standards for SLA description (WS-Agreement [1], WSLA [6]) or operative
framework for SLA management (SLA@SOI[10,2], WSAG4J[11]). As it is shown
more in detail in the related work section 2, it is fully recognized the needing
of SLA management in the cloud context, but there are, at the state of art, no
clear proposals about an innovative approach to SLA management that takes
into account the User Centric view, which is typical of the cloud environment.
In this context the mOSAIC project [4,7] proposes a new, enhanced program-
ming paradigm that is able to exploit the cloud computing features, building
applications which are able to adapt themselves as much as possible to the avail-
able resources and to acquire new ones when needed (more details in section
3). mOSAIC offers together an API for development of cloud applications which
are flexible, scalable, fault tolerant, provider-independent and a framework for
enabling their execution and access to different technologies.
The key idea of SLA management in mOSAIC is that it is impossible to offer
a single, static, general purpose solution for SLA management of any kind of
applications, but it is possible to offer a set of micro-functionalities that can be
easily composed in order to build up a dedicated solution for the application
developer problem. In other words, thanks to mOSAIC API approach (which
enables easy interoperability between moSAIC components) it will be possible
to build up applications with user-oriented SLA management features, from the
very early development stages.
The reminder of this paper is organized as follows: next section (section 2)
summarizes the state of art of SLA management solutions, while the following
108 M. Rak et al.
one briefly summarizes the main concepts related to mOSAIC API and how it
is possible to develop applications using mOSAIC. Section 4 proposes the vision
of the SLA problem in the context of cloud applications, which is detailed in
the section dedicated to the architectural solution 5. A brief section dedicated
to examples (section 6) shows how the approach has been applied in simple case
studies. The paper ends with a section dedicated to the current status, future
work and conclusions.
2 Related Work
time, to have a clear and full notion of the state of the system allowing him to
take decisions, which lead to offer the grants needed in SLA management.
3 mOSAIC API
In mOSAIC a Cloud Application is developed as a composition of inter-connected
building blocks. A Cloud ”Building Block” is any identifiable entity inside the
cloud environment. It can be the abstraction of a cloud service or of a software
component. It is controlled by user, configurable, exhibiting a well defined behav-
ior, implementing functionalities and exposing them to other application compo-
nents, and whose instances run in a cloud environment consuming cloud resources.
Simple examples of components are: a Java application runnable in a plat-
form as a service environment; or a virtual machine, configured with its own
operating system, its web server, its application server and a configured and
customized e-commerce application on it. Components can be developed fol-
lowing any programming language, paradigm or in-process API. An instance of
a cloud component is, in a cloud environment, what an instance of an object
represents in an object oriented application.
Communication between cloud components takes place through cloud re-
sources (like message queues – AMPQ, or Amazon SQS) or through non-cloud
resources (like socket-based applications).
Cloudlets are the way offered to developers by mOSAIC API to create compo-
nents. Cloudlet runs in a cloudlet container that is managed by the mOSAIC Soft-
ware platform. A Cloudlet can have multiple instances, but it is impossible at run-
time to distinguish between two cloudlet instances. When a message is directed to
a cloudlet, it can be processed by anyone of the cloudlet instances. The number of
instances is under control of the cloudlet container and is managed in order to grant
scalability (respect to the cloudlet workload). Cloudlet instances are stateless.
Cloudlets use cloud resources trough connectors. Connectors are an abstrac-
tion of the access model of cloud resources (of any kind) and are technology in-
dependent. Connectors control the cloud resource trough technology-dependent
Drivers. As an example, a cloudlet is able to access to Key-Value store systems
trough a KVstore Connector, it uses an interoperability layer in order to control
a Riak, or a MemBase KV driver.
Therefore a Cloud Application is a collection of cloudlets and cloud compo-
nents interconnected trough communication resources.
Details about the mOSAIC programming model and about the Cloudlet con-
cept, whose detailed description is out of the scope of this paper can be found
in [4,5,7].
SLA User Negotiation. This module contains all the cloudlets and compo-
nents which enable interactions between user and the application in terms
of SLA negotiation.
SLA Monitoring/Warning System. This module contains all the cloudlets
and components needed to detect the warning conditions and generates
alerts about the difficulty to fulfill the agreements. It should address both
resource and application monitoring. It is connected with the Cloud Agency.
SLA Autonomic System. This module includes all the cloudlets and com-
ponents needed to manage the elasticity of the application, and modules
that are in charge of making decisions in order to grant the respect of the
acquired needed to fulfill the agreements.
112 M. Rak et al.
The complexity of the application depends on the grants offered by the SLA
and on the kind of target application running on the top of the job submission
system (as an example: is it easy to predict its response time?). The complexity
due to the application behaviour (its predictability, the action to take in order
to grant application dependent parameters, ..) cannot be defined in general. On
the other side the management of the SLA toward the user (negotiation), the
monitoring of resource status, the management of the SLA storage are common
to all the applications. In the following we will focus on the components offered
in mOSAIC for such requirements.
As a first step we design the SLA that the developer aims at offering to
the users, in order to simplify the approach we model it just by two simple
parameters: the maximum amount of Credits the final user wants to pay and
the maximum number of requests the user is allowed to submit. Moreover the
application assures that the services will be offered on dedicated resources (they
will not sell the same resources to two users). This agreement is represented as
a WS-Agreement template, some pieces of them are described in listing 1.1
Note that the monitoring of the SLA can be done independently from the state
of the acquired resources, but just by tracing the requests. This means that we
will not use the autonomic and monitoring modules of the SLA architecture.
114 M. Rak et al.
Once the SLA has been defined, we show briefly how to design the application,
whose behaviour includes the SLA negotiation and agreement storing, once it
has been signed. For each user’s request the application evaluates the acquired
resources, the available credit, eventually starts new resources and then submits
the job to the acquired VC.
Following the microfunctionalities approach the application can be designed
as in picture 2. The mOSAIC API offers a simple SLAgw component, which im-
plements the WS-Agreement protocol (toward the final user) and sends messages
on predefined queues in order to update the application. As a consequence the
programmer has to develop few cloudlets: an Agreement Policy Cloudlet,
which has the role to accept or not an SLA, a Request Cloudlet, which has
the role of fowarding the user requests to the job suibmission system, and two
cludlets,Resource Policy Cloudlet and Guarantee Policy Cloudlet, which
have respectively the role of tracing the acquired resources and generate warning
for risky conditions. Cloudlets cooperate only trough message exchange, coor-
dinating their actions. As an example, Agreement Policy Cloudlet receives the
messages from SLAgw each time a new SLA requests takes place. Moreover it
sends messages to the SLAgw in order to update the agreement state and to
query about the status of the Agreements. Messages data are represented in
JSON (that helps when data need to be stored in a KV store). As an example
the messages sent by the SLAgw are JSON representations of ServiceTypes and
GuaranteeTerms extracted from the WS-Agreement. Note that they can be cus-
tomized by the final user (WS-Agreement standard is open to this) and only final
user knows how to represent them. The Monitoring cloudlet regularly checks the
status of each user and eventually applies penalties (not reported in the WSAG
for simplicity and sake of space).
7 Conclusions
As outlined in section 2, the management of Service level agreement is an hot
topic in cloud environment. In the mOSAIC project, which aims at designing
User Centric Service Level Management in mOSAIC Application 115
References
1. Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T.,
Pruyne, J., Rofrano, J., Tuecke, S., Xu, M.: Web services agreement specification
(ws-agreement). In: Global Grid Forum. The Global Grid Forum, GGF (2004)
2. Comuzzi, M., Kotsokalis, C., Rathfelder, C., Theilmann, W., Winkler, U., Za-
cco, G.: A Framework for Multi-level SLA Management. In: Dan, A., Gittler,
F., Toumani, F. (eds.) ICSOC/ServiceWave 2009. LNCS, vol. 6275, pp. 187–196.
Springer, Heidelberg (2010),
http://dx.doi.org/10.1007/978-3-642-16132-2-18,
doi:10.1007/978-3-642-16132-2-18
3. CONTRAIL: Contrail: Open computing infrastructres for elastic computing
(2010), http://contrail-project.eu/
4. Leymann, F., Ivanov, I., van Sinderen, M., Science, B.S.S., Publications, T. (eds.):
Towards a cross platform Cloud API. Components for Cloud Federation (2011)
5. IEEE (ed.): Building an Interoperability API for Sky Computing (2011)
6. Keller, A., Ludwig, H.: The wsla framework: Specifying and monitoring service level
agreements for web services. Journal of Network and Systems Management 11(1),
57–81 (2003)
7. mOSAIC: mosaic: Open source api and platform for multiple clouds (2010),
http://www.mosaic-cloud.eu
8. optimis: Optimis: the clouds silver lining (2010),
http://www.optimis-project.eu/
9. Venticinque, S., Aversa, R., Di Martino, B., Petcu, D.: Agent based cloud provi-
sioning and management, design and prototypal implementation. In: Leymann, F.,
et al. (eds.) 1st Int. Conf. Cloud Computing and Services Science (CLOSER 2011),
pp. 184–191. ScitePress (2011)
10. Theilmann, W., Yahyapour, R., Butler, J.: Multi-level SLA Management for
Service-Oriented Infrastructures. In: Mähönen, P., Pohl, K., Priol, T. (eds.) Ser-
viceWave 2008. LNCS, vol. 5377, pp. 324–335. Springer, Heidelberg (2008)
11. Waeldrich, O.: Wsag4j (2008),
https://packcs-e0.scai.fraunhofer.de/wsag4j/
Service Level Management for Executable
Papers
1 Introduction
The idea of interactive paper is not new; the very first steps in this field were
introduced by with HyperText Markup Language [1]. A reader of a web page
was able to navigate from page to page by simply clicking on the link-associated
with a certain concept. The technical details of the systems supporting Hyper-
Text Markup Language is rather complex, however, the way HyperText Markup
Language are exposed to both the readers and writer of web pages is intuitive,
for the reader its just a colour encoded text, while for the writer it is just a
simple line of code with a very simple syntax. When applets and ECMAScript
(http://en.wikipedia.org/wiki/ECMAScript) were introduced the concept of hy-
pertext has been pushed further readers of web document were execute small
applets and client-side scripts to run simple application. The Executable Paper
(EP) Grand Challenge organized by the Elsevier in the context of International
Conference on Computational Science (http://www.iccs-meeting.org/) is to push
this concept one step further to include scientific publications. However, this is
not a trivial transition as many scientific publications are about complex experi-
ments, which are often computing and data intensive, or require special software
and hardware. Propriety software used by experiments is also subject to strict
licensing rules. The papers published in the grand challenge workshop propose
various solutions to realize the executable paper concepts [6,7,8,9]. The papers
focuses on the technical details and technology choices but give little attention
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 116–123, 2012.
c Springer-Verlag Berlin Heidelberg 2012
SLM for Executable Papers 117
to the operational aspects associated with the deployment of such a service and
what would be the impact on the stakeholders to provide a reliable and scalable
service allowing re-execution of published scientific.
The rest of the paper is organized as follows Section 2 describes the executable
paper life cycle, Section 3, discussion the exploitation of executable papers, Sec-
tion 4 describes the implementation of executable paper using Cloud approach,
and Section 5 discusses SLM needed achieve a certain QoS.
The concept of EPs is feasible only if the lifecycle governing this concept is clear
and the role of the different actors is well defined through the entire lifecycle
of the production of the executable paper. This lifecycle starts from the time
authors decide to write the paper, going through the review process, and ending
by the publication of the paper. The role of the authors, in the current publi-
cation cycle, finishes when the paper is accepted for publication. The publisher
is the second actor, as he makes the paper available and accessible to potential
readers. The third actor is not directly active in the creation of the paper but
still very important as it provides the needed infrastructure to the author to per-
form the experiment to be included in the paper. The third actor is usually the
institution to which the author belongs at the time he is writing his paper. After
the publication of the paper, maintaining the infrastructure needed to reproduce
scientific experiments is not the primary interest of research institutions. A very
important question is then posed, which actor will take the role of providing
the needed logistic to keep the EP alive. We believe that the publisher is the
only actor that is capable to take over this task. However providing a service
that allows a reader to re-run experiments is completely different from provid-
ing a service that just give access to a digital version of the paper. In this case
the publisher will have to maintain a rather complex computing and storage
infrastructure that might be beyond the scope of the publisher actual interests
and expertise. Outsourcing this task to a specialized computing service provider
might be a possible solution where Service Level Agreements (SLAs) play a vi-
tal role in maintaining an EP and re-running experiments in a timely fashion
so as to maintain an acceptable reader experience. We will develop further this
solution in the rest of the paper.
Results
Submission
Comments
Submission
of revision Accepted for
publication
Fig. 1. Lifecycle of EP, experiment results trigger the writing of scientific papers, it is
thus important that readers of these paper are able to explore and re-execute if needed
these experiments
There are a couple of daily scenarios in science where the concept of executable
paper is indeed needed. The first one is the review process of scientific publications,
often reviewers selected by conference organizers and publishers to assess the qual-
ity of newly submitted papers have to verify the results published. For that, they
need to trace back the path to initial data or to verify parameters used in a specific
step of the proposed method and in certain cases even re-run part of the experiment.
The second most common scenario in an EP is while scientists are reading the
already published paper. Often they are interested in reusing part of the pub-
lished results whether these results are algorithms, methods or tools. Currently
this is done by contacting the authors and try to get the needed information but
often the authors are not reachable or their current research topics are different
from the one published in the paper.
From these exploration scenarios, we can identify the actors active during the
various phases of the lifecycle of the executable paper (Table 1).
With the emergence of reliable virtualization technologies, which are capa-
ble of hiding the intricacies of complex infrastructure, publishers can offer more
than just a static access to scientific publications [5,4]. The reader of a published
scientific publication should be able to re-execute part of the experiment. Figure
2 illustrates the interactions between various entities in the EP scenario. SLAs
between readers and publisher exist which define a certain QoS expected by the
reader such as maximum time for re-running experiments. Readers are often
affiliated to institutions for which an SLA between the institution and the pub-
lisher could exist. The publisher manages a set of SLAs with service providers
for outsourcing the re-execution of the experiment. Since experiments vary in
complexity, the SLAs would define which provider is capable of executing the
experiment within QoS parameters.
SLM for Executable Papers 119
Fig. 2. Interaction of various entities during the lifecycle of an EP. In (1) the author
creates an EP, (2) the reviewer reviews the paper and possibly re-run the experiment.
(3) A reader that reads the EP after publication and can also re-run the experiment.
(4) The publisher that upon request from the reviewer or reader can outsource the
execution of the experiment. Depending on the SLA between reviewer, reader and the
publisher, the publisher can choose amongst a set of SLA to pick the best service
provider which can deliver the QoS requested by the reviewer or reader.
Table 1. Main Actors involved in the realization and executable paper lifecycle
5 Discussions
Any solution for the EP has to be intuitive and should not add much further
burden on the actors involved in the EP lifecycle. A number of tools and services
SLM for Executable Papers 121
the architecture; i.e. that the data, QoS, outage requirements between the cus-
tomer and the service provider are fulfilled or the SLA consequence occurs. In
Table 2, we identify the steps needed to publish the paper or not in an executable
form. These steps describe the interaction between two actors: the publisher and
the authors. The publisher initiates this use case after the paper has been ac-
cepted for publication. Not all papers can be published as executable papers
because they are either very expensive to reproduce, need special hardware or
software (intellectual property issues), or request access to private data that are
not likely to be provided (privacy issues). In Table 3, we identify the steps needed
to execute an executable paper. These steps describe the interaction between two
actors: the publisher and the provider of the computing infrastructure. This use
case is initiated by a scientists who want to re-execute a published experiment.
6 Conclusion
We have identified a number of challenges facing the implementation of EP con-
cept, we have classified them into three categories: technical, administrative, and
intellectual property. In this positioning paper we have described one approach
to address the technical challenges and identified the role that each actor in-
volved in the lifecycle of EP. Among other issues we have stressed in this paper,
the issue of provisioning the needed infrastructure when the EP is published,
and pointed out a technique that can help to solve this problem which is the use
of new virtualization techniques to provide a working environment for the pub-
lished experiments. We discussed the feasibility of this technique and described
two scenarios related to the operational aspects associated with the deployment
of an executable paper service and the role of each actor throughout the exe-
cutable paper lifecycle.
We believe that publisher can play a key role in implementing the EP concept,
in our proposal the publisher does not have to develop in-house the expertise to
SLM for Executable Papers 123
References
1. Markup Languages, http://en.wikipedia.org/wiki/Markup_language
2. Jones, J.: Tracking Software Assets on Virtual Images Gains Momentum for Soft-
ware Asset Management Professionals,
http://blogs.flexerasoftware.com/elo/2010/07/
tracking-software-assets-on-virtual-images-gains-
momentum-for-software-asset-management-professional.html
3. Basant, N.S.: Top 10 Cloud Computing Service Providers of 2009 (2009),
http://www.techno-pulse.com/2009/12/
top-cloud-computing-service-providers.html
4. Hey, B.: Cloud Computing. Communications of the ACM (51) (2008)
5. Armbrust, et al: A View of Cloud Computing. Communications of the ACM 4(53)
(2010)
6. Strijkers, R.J., Cushing, R., Vasyunin, D., Belloum, A.S.Z., de Laat, C.,
Meijer, R.J.: Toward Executable Scientific Publications. In: ICCS 2011, Singa-
pore’s Nanyang Technological University, June 1–3 (2011)
7. Limare, N., Morel, J.M.: The IPOL Initiative: Publishing and Testing Algorithms
on Line for Reproducible Research in Image Processing. In: ICCS 2011, Singapore’s
Nanyang Technological University, June 1–3 (2011)
8. Kauppinen, T.J., Mira de Espindola, G.: Linked Open Science - Communicating,
Sharing and Evaluating Data, Methods and Results for Executable Papers. In:
ICCS 2011, Singapore’s Nanyang Technological University, June 1–3 (2011)
9. McHenry, K., Ondrejcek, M., Marini, L., Kooper, R., Bajcsy, P.: Towards a Univer-
sal Viewer for Digital Content. In: ICCS 2011, Singapore’s Nanyang Technological
University, June 1–3 (2011)
10. Strijkers, R., et al.: AMOS: Using the Cloud for On-Demand Execution of e-Science
Applications. In: IEEE e-Science 2010 Conference, December 7–10 (2010)
11. Kertesz, A., et al.: An SLA-based resource virtualization approach for on-demand
service provision. In: Proceedings of the 3rd ACM International Workshop on Vir-
tualization Technologies in Distributed Computing (2009)
12. Belloum, A., Inda, M.A., Vasunin, D., Korkhov, V., Zhao, Z., Rauwerda, H., Breit,
T.M., Bubak, M., Hertzberger, L.O.: Collaborative e-Science Experiments and Sci-
entific Workflows. In: IEEE Internet Computing (August 2010)
Change Management in e-Infrastructures
to Support Service Level Agreements
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 124–133, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Change Management in e-Infrastructures to Support SLAs 125
end
A‘
start
Collaborative process
private process of
organization I
A end
start
In this paper we will give an overview of our approach to address the chal-
lenges in the area of inter-organizational change management (ioCM). Our goal
was to design an ioCM process that can be adopted by all partners of PRACE,
a persistent European eIS. Thus, our concept incorporates extensions of well-
established best practices frameworks like ITIL for inter-organizational use and
adapts collaborative standards in the modeling field. In the following section
PRACE, a European eIS project, is introduced. Before presenting the major
concept areas of the proposed management process, we will give a brief overview
of related work in the area of ioCM. For the process design we have adapted the
UN/CEFACT Modeling Methodology (UMM), developed by the UN/CEFACT
(United Nations Center for Trade Facilitation and Electronic Business) to sup-
port the development of inter-organizational business processes [17]. We conclude
with an overview of our future plans in section 5.
Researcher and
Research Groups
...
Power Network Software Storage
3 Related Work
While many articles covering the area of change management are available,
hardly any related work is addressing change management in eIS. In [16] there is a
discussion on inter-organizational change management in public funded projects.
But in this article the authors mainly focus on the sociological aspects like the
need for communication between the public and the participating project part-
ners. Aspects of inter-organizational ITSM are not considered in this paper.
Also in [11] communication is identified as one vital concern in e-Government
projects. Within their analysis the authors concentrate on the structures that
Change Management in e-Infrastructures to Support SLAs 129
uc ChangeManagement
«BusinessCollaboration»
inter-organizational
Change-Process
class BIELibrary
«ABIE»
«ABIE» Request-for-Change_Document
SLA_Contract «ABIE»
IT_Serv ice «BBIE»
«BBIE» + Change-Acceptance: DateTime [0..1]
+ Description: Text [0..*] - PRACE-relevant + Change-Authorization: Text [0..1]
+ End: Date [0..*] «BBIE» + Change-Category: Code [0..1]
+ Identification: Identifier [0..*] + Description: Text [0..*] + Change-Creation: DateTime [0..*]
+ ItemIdentifier: Identifier [0..*] + Identification: Identifier [0..*] + Change-Identification: Identifier [0..*]
+ Name: Text [0..*] + IT-Service-Name: Text [0..*] + Change-Revew-Status: Code [0..*]
+ Start: Date [0..*] + Description: Text [0..*]
+ Name: Text [0..*]
«ABIE» «ABIE»
Av ailability_Parameter Maintenance-Announcement_Serv ice
«BBIE» «BBIE»
+ Identification: Identifier [0..1] + Description: Text [0..*]
+ Value: Text [0..1] + Identification: Identifier [0..*]
+ Name: Text [0..*]
act ChangeManagement
Change Process
PRACE
Change
Announcement
Message
Change
Announcement
Message
[yes]
5 Summary
In this article we have presented a framework for inter-organizational change
management and described an application scenario based on an international
e-Infrastructure (eIS) project. The goal of change management is to establish
mechanisms for coordination of activities for maintenance of existing and imple-
mentation of new services in an eIS. Change management provides means for
exchange of information about planned, ongoing and completed changes that
affect availability of eIS components and thus is essential for successful Service
Level Management (SLM). In the majority of eIS providing services to the sci-
entific community the areas of SLM and change management still receive very
little attention. However, since eIS projects are becoming mature in their service
offering, the overall ITSM needs to be professionalized.
To address the challenge we have applied standards, both in the modeling
and ITSM fields, to our problem domain. The selected standards include the
UMM modeling method, developed originally for B2B environment and adopted
to inter-organizational provider networks and the ITIL process framework. This
methodology has a number of advantages. International, well established stan-
dards can be applied to the design of both the intra- and inter-organizational
ITSM processes. models that result from this approach can be easily shared and
applied by all partners within an eIS, which we will demonstrate in the future by
implementing a model repository accessible to all eIS partners. Having defined
the design concepts, we are going to implement them in the PRACE environment
described in our case study. In the following stages of our work we are intending
to implement our framework in other eIS projects we are involved in. Within
this article we have focused on the operational process of change management.
Even though, at present, not every of the collaborating partners within the eIS
project has implemented basic ITIL processes, we think, that there is a high
potential for standardization, which we will present in [8] based on an analysis
of the ITIL adoption rate of three different eIS projects.
References
1. Andonoff, E., Bouaziz, W., Hanachi, C.: Protocol management systems as a
middleware for inter-organizational workflow coordination. IJCSA 4(2), 23–41
(2007)
Change Management in e-Infrastructures to Support SLAs 133
Michael Gerndt
Foreword
The PROPER workshop addresses the need for productivity and performance
in high performance computing. Productivity is an important objective during
the development phase of HPC applications and their later production phase.
Paying attention to the performance is important to achieve efficient usage of
HPC machines. At the same time it is needed for scalability, which is crucial in
two ways: Firstly, to use higher degrees of parallelism to reduce the wall clock
time. And secondly, to cope with the next bigger problem, which requires more
CPUs, memory, etc. to be able to compute it at all.
Tool support for the user is essential for productivity and performance. There-
fore, the workshop covers tools and approaches for parallel program development
and analysis, debugging and correctness checking, and for performance measure-
ment and evaluation. Furthermore, it provides an opportunity to report success-
ful optimization strategies with respect to scalability and performance.
This year’s contributions reflect this spectrum nicely. The invited presentation
by Mitsuhisa Sato about Challenges of programming environment and tools for
peta-scale computers (programming environment researches for the K computer)
takes place during the first session ”Programming Interfaces”, chaired by Felix
Wolf. The second session is about ”Performance Analysis Tools” and guided by
Michael Gerndt. The topic of the last session is ”Performance Tuning” and the
chair is Allen Malony.
We would like to thank all the authors for their very interesting contribu-
tions and their presentations during the workshop. In Addition we thanks all
the reviewers for the reading and the evaluation of all the submitted papers.
And furthermore, we would like to thank the EuroPar 2011 organizers for their
support and for the chance to offer the PROPER workshop in conjunction with
this attractive conference. We are most grateful for all the administrative work
of Petra Piochacz. Without her help the workshop would not have been possible.
The PROPER workshop was initiated and is supported by the Virtual In-
stitute - High Productivity Supercomputing (VI-HPS), an initiative to promote
the development and integration of HPC programming tools.
September 2011
Michael Gerndt, Workshop Chair
Scout: A Source-to-Source Transformator
for SIMD-Optimizations
1 Introduction
Most modern CPUs provide SIMD units in order to support data-level paral-
lelism. One important method of using that kind of parallelism is the vectoriza-
tion of loops. However, programming using SIMD instructions is not a simple
task. SIMD instructions are assembly-like low-level intrinsics and often steps like
finalization computations after a vectorized loop become necessary. Thus tools
are needed in order to efficiently exploit the data-level parallelism provided by
modern CPUs.
In the context of the HI-CFD project [4] we needed a mean to comfortably
vectorize loops written in C. We are going to target various HPC platforms with
different instruction sets and different available compilers.
2 Related Tools
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 137–145, 2012.
c Springer-Verlag Berlin Heidelberg 2012
138 O. Krzikalla et al.
are available to vectorize various forms of codes, especially loops [7]. However in
practice it is not possible to always reason about the absence of dependencies
(e.g. in a loop with indirect indexing). Thus means are needed in order to provide
meta information about a particular piece of code. For instance the Intel compiler
allows a programmer to augment loop statements with pragmas to designate the
absence of inner-loop dependencies.
We have tested some compilers with respect to their auto-vectorization ca-
pabilities. For some loops in our codes the available means to provide meta in-
formation were insufficient (see Sect. 3.3). Sometimes subtle issues arose around
compiler-generated vectorization. For instance in one case a compiler suddenly
rejected the vectorization of a particular loop just when we changed the type of
the loop index variable from unsigned int to signed int. A compiler expert
can often reason about such subtleties and can even dig in the documentation
for a solution. But an application programmer normally concentrates on the
algorithms and cannot put too much effort in the peculiarities of each used com-
piler. The vectorization of certain (often more complex) loops was rejected by
all compilers regardless of inserted pragmas, given command-line options aso.
We have checked other tools specifically targeting loop vectorization. In [6]
a retargetable back-end extension of a compiler generation system is described.
Being retargetable is an interesting property (see also Sect. 3.2) but for our
project it did not come into consideration due to its tight coupling to a particular
compiler system. SWARP [9] seems to depend solely on a dependency analysis
– something we could not rely on.
3.1 Unroll-and-Jam
Various approaches to vectorize loops exist. Traditional loop-based vectorization
transforms a loop so, that every statement processes a possible variable-length
vector [5]. With the advent of the so-called multimedia extensions in commodity
processors the unroll-and-jam approach became more important [8]. In [7] this ap-
proach is descibed mainly as a mean to resolve inner-loop dependencies. However,
we use this approach in a more general way. First, we partially unroll each state-
ment in the loop according to the vector size. Then we test whether the unrolled
statements can be merged to a vectorized statement. Unvectorizeable statements
(e.g. if-statements including their bodies) remain unrolled. Only their memory ref-
erences to vectorized variables are accordingly adjusted. All other statements are
vectorized by decomposing them to vectorizeable expressions. Scout allows the
user to vectorize arbitrarily complex expressions (see Sect. 3.2).
A nice consequence of using the unroll-and-jam approach is the possibility
to vectorize different data types (e.g. float and double) in one loop simulta-
neously. The vector sizes of vectorized data types may differ, but the largest
vector size has to be a multiple of all other used vector sizes. The loop is then
unrolled according to that largest vector size and vectorizeable statements of
other data types are then only partially merged together and remain partially
unrolled.
Listing 1 demonstrates the vectorization of different data types for a SSE
platform. The vector size for float is 4 and for double it is 2. Hence the loop is
unrolled four times. Then all operations for float values can be merged together
(in the example only the load/store operations). In contrast only two unrolled
consecutive operations for double values (one load and the division) are merged
to a vectorized operation leaving the double operations partially unrolled. Vec-
torized conversion operations are generated automatically whenever needed.
namespace scout {
} // namespace scout
For each supported data type the configuration provides a specialized class
template named config placed in the namespace scout. The first template
Scout: A Source-to-Source Transformator for SIMD-Optimizations 141
parameter denotes the underlying base type of the particular vector instruction
set. The second integral template parameter denotes the vector size of that set.
A set of predefined type names, value names and static member functions are
expected as class members of the specialization.
There are two general kinds of static member functions. If the function name
is predefined by Scout, then the function body consists of only one statement –
the string literal denoting the intrinsic. Load and store operations are defined in
this way.
If the function name of the static member functions is not predefined, then
the string literal in the function body is preceded by an arbitrary number of
expressions and/or function declarations. In that case expressions and function
calls in the original source code are matched against these configuration expres-
sions and functions and are vectorized according to the string literal if they fit.
This option adds great flexibility to Scout. Indeed it is not only possible to use
various instruction sets in their atomic shape but also combine them to more
complex or idiomatic expressions a priori.
Listing 3 demonstrates the vectorization capabilites of Scout by using the
condition_lt and sqrt functions of Listing 2.
Most loops in our codes follow very basic schemes: they read data from several
arrays, do some heavy calculations and then either write or accumulate the re-
sult in a different array. Hence and under the reasonable assumption, that there
are no pointer aliasing issues, pure writes normally don’t introduce any depen-
dencies. Accumulation operations however involve a read and write operation to
the same memory location and hence can introduce dependencies, especially if
there is indirect indexing involved. Such dependencies could prevent whole loops
from being vectorized. But actually most of the calculation can be performed in
parallel, just the accumulation process itself needs to remain serial. Thus we in-
troduced a pragma directive forcing a statement to compute each vector element
separately (Listing 4).
4 Practical Results
Beside the usual test cases we have applied Scout to two different CFD pro-
duction codes used in the German Aerospace Center. Both codes are written
in C using the usual array-of-structure approach. That approach is rather un-
friendly with respect to vectorization, because vector load and store operations
have to be composite. Nevertheless we did not change the data layout but used
the source code as is only augmented with the necessary Scout pragmas. The
TM
presented measurements were mainly done on an Intel R
Core 2 Duo P8600
TM
processor with a clock rate of 2.4 GHz, operating unter Windows 7 using the
Intel
R
compiler version 11.1. The AVX measurements were done on a an Intel R
TM
R
Sandy Bridge processor, using the Intel compiler version 12.
The first code computes interior flows in order to simulate the behavior of jet
turbines. In the loops direct indexing is used meaning array indices are linearly
transformed loop indices. We have split the code in four computation kernels
and present the splitted results for a better understanding of the overall and
detailed speedup in Fig. 1. It shows typical speedup factors of the vectorized
kernels produced by Scout compared to the originals.
As expected, we gain more speedup with more vector lanes, since more com-
putations can be executed in parallel. Kernel 2 even outperforms its theoretical
maximum speedup, which is a result of the other transformations (in particular
function inlining) performed by Scout implicitely.
Table 1 shows the effects of AVX on the performance of a complete run. The
first row shows the average time of one run including the computation kernels
and some framework activity. Naturally, this measurement method reduces the
overall speedup gained due to the vectorization but leads to very realistical
results. After all, the application of Scout reduces the runtime automatically by
about 10%. We expected a much better speedup by stepping up from SSE4 to
AVX because the vector register size has doubled on AVX.
However, the additional gains were rather negligible. The second row shows
the main reason for this behavior. The CPI metric (Clockticks per Instructions
Scout: A Source-to-Source Transformator for SIMD-Optimizations 143
5,0 2,5
4,5
4,0 2,0
3,5 kernel1
3,0 1,5 kernel2
speedup
2,5 kernel3
2,0 1,0 kernel4
1,5 total
1,0 0,5
0,5
0,0 0,0
30 32 34 36 38 40 42 44 46 48 30 32 34 36 38 40 42 44 46 48
problemsize problemsize
Fig. 1. Speedup of CFD kernels on Intel Core 2 Duo due to the vectorization (left side:
single precision, four vector lanes, right side: double precision, two vector lanes)
Retired) is an indication of how much latency affected the execution. Higher CPI
values mean there is more latency. In our case the latency is caused mainly by
cache misses. This comes with no surprise, because with a doubled vector size
also a doubled amount of data gets pumped through the processor during one
loop iteration. Even if this effect is well documented [2] a CPI value of 2.0 still
means there is a lot of room for improvements. In Sect. 6 we outline a possible
approach in order to address that issue.
The second CFD code computes flows around an air plane. Unlike the other
code it works over unstructured grids. That is, the loops use mostly indirect
indexing to access array data elements. Most loops in that kernel could only be
partially vectorized (see Sect. 3.3). Nevertheless we could achieve some speedup
as shown in Table 2. We had two different grids as input data to our disposal.
First we vectorized the original code. However the gained speedup of about 1.1
Table 2. Speedup of a partially vectorized CFD kernel on Intel Core 2 Duo (double
precision, two vector lanes)
was not satisfying. Then we merged some loops inside the kernel together to
remove repeated traversal over the indirect data structures. This made the code
more compute-bound and resulted in a much better acceleration of about 1.4
just due to the vectorization. Eventually the overall speedup was nearly 1.5.
6 Future Work
While the achieved acceleration presented in this paper was already rather good,
it was not as exciting as one would expect due to the number of available vector
lanes. Of course Amdahl’s law plays a rather large role in our results. We did
not change the data layout and thus had to live with composite load and store
operations. That in turn leads to a smaller parallel portion of code and hence
lesser speedup.
But the presented AVX results, especially the raise of the CPI value, indicate
memory accesses as another major obstacle for performant SIMD code. Actu-
ally, compute-bounded code often gets memory-bound due to vectorization. Of
course, the cache pressure can be reduced by a carefully hand-crafted data lay-
out. But the cache size is a hard limit and even hand-crafting sometimes is
not worth the rather huge effort. Thus, in order to regain a load balance be-
tween memory and computation, we will explore the energy-saving possibilites
of memory-bounded computations.
Scout: A Source-to-Source Transformator for SIMD-Optimizations 145
Acknowledgments. This work has been funded by the German Federal Min-
istry of Education and Research within the national research project HI-CFD (01
IH 08012 C) [4].
References
1. clang: a C language family frontend for LLVM, http://clang.llvm.org (visited
on March 26, 2010)
2. Intel VTune Performance Analyzer Basics: What is CPI and how do I use it?
http://software.intel.com/en-us/articles/intel-
vtune-performance-analyzer-basics-what-is-cpi-and-how-do-i-use-it/
(visited on June 6, 2011)
3. Loop unswitching, http://en.wikipedia.org/wiki/Loop_unswitching (visited
on July 19, 2011)
4. HICFD - Highly Efficient Implementation of CFD Codes for HPC Many-Core
Architectures (2009), http://www.hicfd.de (visited on March 26, 2010)
5. Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form.
ACM Trans. Program. Lang. Syst. 9, 491–542 (1987),
http://doi.acm.org/10.1145/29873.29875
6. Hohenauer, M., Engel, F., Leupers, R., Ascheid, G., Meyr, H.: A SIMD optimiza-
tion framework for retargetable compilers. ACM Trans. Archit. Code Optim. 6(1),
1–27 (2009)
7. Kennedy, K., Allen, J.R.: Optimizing compilers for modern architectures: a
dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco
(2002)
8. Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multi-
media instruction sets. In: Proceedings of the ACM SIGPLAN 2000 Conference
on Programming Language Design and Implementation, PLDI 2000, pp. 145–156.
ACM, New York (2000), http://doi.acm.org/10.1145/349299.349320
9. Pokam, G., Bihan, S., Simonnet, J., Bodin, F.: SWARP: a retargetable preprocessor
for multimedia instructions. Concurr. Comput.: Pract. Exper. 16(2-3), 303–318
(2004)
10. Schöne, R., Hackenberg, D.: On-line analysis of hardware performance events for
workload characterization and processor frequency scaling decisions. In: Proceed-
ing of the Second Joint WOSP/SIPEW International Conference on Performance
Engineering, ICPE 2011, pp. 481–486. ACM, New York (2011),
http://doi.acm.org/10.1145/1958746.1958819
Scalable Automatic Performance Analysis
on IBM BlueGene/P Systems
1 Introduction
Traditional supercomputer design which relies on the high single core perfor-
mance delivered by high frequency has a natural scalability limit coming from
unaffordable power consumption and cooling requirements. The BlueGene [1]
developers addressed this challenge from two aspects: by utilizing moderate-
frequency cores and by tightly coupling them at unprecedented scales, which
allows power consumption to grow linearly with the number of cores. This leads
to a high density, low-power, massively parallel system design.
Unfortunately the peak performance offered by modern supercomputers can
not be achieved by straight forward application porting, one would have to in-
vest significant efforts in achieving reasonable execution efficiency. In order to
make this efforts affordable, new instruments supporting application develop-
ment have to be developed. This is specially true for performance analysis tools.
On one hand, the performance analysis results of small runs often can not be
extrapolated to the desired number of cores due to new performance phenomena
This work is partially funded by BMBF under the ISAR project, grant 01IH08005A
and the SILC project, grant 6.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 146–155, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 147
manifesting itself only at large scales. On the other hand the amount of raw per-
formance data which has to be recorded for large real-world applications running
on hundreds of thousands cores is simply too big for the commodity evaluation
approaches. The way performance analysis is done has to be rethought as well as
other aspects of extremely parallel computing. Among the challenges to be over-
come are efficient recording, storing, analysis and visualization of the discovered
results.
Periscope [4], being an automatic distributed on-line performance analysis
tool, addresses the challenges of large scale performance analysis from multi-
ple aspects. The distributed architecture of Periscope allows it to scale together
with the application relying on multiple agents. On the other hand, on-line anal-
ysis of the profile based raw performance data significantly reduces memory
requirements. However even then the amount of performance data collected for
a large scale run is big enough to overwhelm the user with too much information.
Periscope addresses the issue in two ways, first, automatic search for performance
inefficiencies dramatically decreases the amount of presented results by report-
ing only important potential tuning opportunities. Second, performing scalable
reduction, based on clustering algorithms, allows to keep the amount of reported
results constant independently from the growing number of cores.
Historically the development of Periscope was carried out based on the ar-
chitecture of commodity clusters, where the maximum scalability levels were
considered to be in order of tens of thousands of cores running standard Unix-
like kernels. Therefore porting Periscope to a new cluster was a matter of minor
adjustments, whereas the overall architecture was being preserved. However with
the introduction of BlueGene/P systems it was realized that the straightforward
porting approaches would not work. The two main reasons for that were an order
of magnitude increase in number of cores and limited operating system func-
tionality. In order to adapt Periscope to the challenges posted by BlueGene/P,
significant improvements to the tool’s architecture were developed.
The rest of the paper is composed as follows: first we describe the architecture
specifics of BG/P as well as the analysis model and architecture of Periscope.
From the cross-analysis we derive three promising approaches, from which one
was implemented and is discussed in more details. Alternative tools are discussed
in related work. In the evaluation section we apply Periscope to the NAS Paral-
lel Benchmark running with 64k processors to demonstrate achieved scalability
levels.
2 BlueGene/P Architecture
The BlueGene/P [1] base component is a PowerPC 450 quad core 32 bit mi-
croprocessor with a frequency of 850 MHz. One quad-core together with 2 or
4 GB of shared memory forms a next building block of BlueGene - a compute
node ASIC. The compute nodes run under the IBM proprietary light-weight
Compute Node Kernel (CNK) and are dedicated to run exclusively MPI/hybrid
applications. CNK is, on one hand, striped down in order to minimize the sys-
tem overheads when executing an application and, on the other hand, appears
148 Y. Oleynik and M. Gerndt
3 Periscope Design
Periscope [4] is an automatic distributed on-line performance analysis system
developed by Technische Universität München (TUM) at the Chair of Computer
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 149
processes publish their network address - identity-tag pairs and also look up
for the addresses of their communication partners at the registry service.
other agents and application processes, would allow us to drop the commodity
registry service. However this would not remove the bottleneck associated with
the thousands of application processes ports being published and then queried
at the same time.
After successful registration, the FE computes the agents hierarchy, however
this is much simplified since all agents are started at once with a single MPI Spawn
command. The hierarchy, in this case, is determined according to the fan-out of
the HL agents and the number of leaf agents, which is proportional to the number
of application processes.
After setting up the agent hierarchy, the application would be started by the
FE and the agents connect to the application processes and execute the analysis
as before. This design would also support the restart of the application which is
done by Periscope if the application terminated but additional search steps are
required.
However, several severe drawbacks of this design were considered. First, the
collective network of the BlueGene can not be properly utilized when the ap-
plication is running in the sub communicator of the MPI COMM WORLD, which is
the case when the application is started by Periscope in MPMD mode. Second,
as it was mentioned before, the bottleneck of publishing and querying of every
application process and the agents still significantly impacts the efficiency of
Periscope’s analysis at large scale. Finally, the additionally required, complete
reimplementation of the communication substrate of Periscope with MPI as well
as the need to port the AA to run under CNK would require significant amount
of programming efforts.
The comparison table shows that the design relying on the MPMD function-
ality of MPI 2.1 standard is the less preferable failing to meet the majority of
the selection criteria.
The design approach utilizing the HTC partition to run Periscope agents
features low efforts to be invested in porting and maintenance, however suffers
from the fact that booting a HTC partition is a privileged operation. In addition,
the system utilization is worse since additional cores are required to run agents.
The best match with the selection criteria is the I/O node agent placement
design and therefore was chosen for implementation. This approach features best
system utilization, since it doesn’t require any additional compute nodes to run
Periscope’s agents. Instead it runs them on the I/O nodes which are not intended
for computation by design. However the efforts to port Periscope following the
described design are considered to be moderate. In oder to prove the selected
concept fast and minimize associated porting risks, it was decided to split the
porting efforts in two phases. Within the first phase the idea of running the
AAs on the I/O nodes and the application processes on the affiliated compute
nodes was evaluated and considered to be a low-effort task. The other agents are
intended to run on the frontend node of BlueGene/P in this phase. The majority
of the efforts, though, come from the task to merge the functionality of the AA
and the HL in order to run them within the single user process allowed to run
on the I/O nodes. Therefore this task was assigned to be implemented in phase
two, which will deliver optimal tool distribution and capability to operate at the
full-scale 72-rack BlueGene/P.
5 Evaluation
The phase one porting task was implemented and Periscope was installed on
the IBM BlueGene/P supercomputer operated by King Abdullah University of
Science and Technology (KAUST). The machine consists of 16 racks containing
in total 65536 IBM Power450 cores delivering 222 TFlops of peak performance.
In order to prove the scalability of the new Periscope design a large scale
performance analysis run was carried out on a standard BT benchmark from
NAS Parallel Benchmark suite [9]. The benchmark is Block Tridiagonal solver of
a synthetic system of nonlinear PDE’s. The benchmark was built to solve the E
problem size which corresponds to 1020x1020x1020 grid size. The MPI call sites
154 Y. Oleynik and M. Gerndt
6 Related Works
There are only a few performance analysis tools available on BlueGene/P, and
even less of them are specially designed for large scales. SCALASCA [8], being
one of them, is an open-source performance analysis toolset specifically designed
for an evaluation of codes running on hundreds of thousands of processors. The
tool performs parallel trace analysis searching for MPI bottlenecks, which allows
it to scalably handle the trace size linearly increasing with the number of cores.
However it was found that the time spent for the analysis as well as the re-
port size were growing linearly with the employed parallelism scale. In contrast,
Periscope performs on-line profile based search, thus omitting tracing. Also on-
line reduction allows it to keep the report size independent from the number of
processes.
References
1. IBM BlueGene team: Overview of the IBM BlueGene/P project. IBM Jopurnal of
Research and Development 52(1/2), 199–220 (2008)
2. Sosa, C., Knudson, B.: IBM System Blue Gene Solution: Blue Gene/P Application
Development. International Technical Support Organization, 4th edn. (August 2009)
3. DelSignore, J.: TotalView on Blue Gene/L, “Presented at” Blue Gene/L: Applica-
tions, Architecture and Software Workshop,
http://www.llnl.gov/asci/platforms/bluegene/papers/26delsignore.pdf
4. Gerndt, M., Fürlinger, K., Kereku, E.: Advanced techniques for performance anal-
ysis. NIC, vol. 33, pp. 15–26 (2006)
5. Benedict, S., Brehm, M., Gerndt, M., Guillen, C., Hesse, W., Petkov, V.: Automatic
Performance Analysis of Large Scale Simulations. In: Lin, H.-X., Alexander, M.,
Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009.
LNCS, vol. 6043, pp. 199–207. Springer, Heidelberg (2010)
6. Gerndt, M., Strohhäcker, S.: Distribution of Periscope analysis agents on ALTIX
4700. In: Proceedings of the International Conference on the Parallel Computing
(ParCo 2007). Advances in Parallel Computing, vol. 15, pp. 113–120. IOS Press
(2007)
7. Fahringer, T., Gerndt, M., Riley, G., Träff, J.: Knowledge specification for automatic
performance analysis. APART Technical Report (2001),
http://www.fz-juelich.de/apart
8. Wylie, B.J.N., Bohme, D., Mohr, B., Szebenyi, Z., Wolf, F.: Performance analysis of
Sweep3D on Blue Gene/P with the Scalasca toolset. In: IEEE International Sympo-
sium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW
2010). IEEE (2010) - 978-1-4244-6533-0. - S. 1 - 8
9. NPB. NAS Parallel Benchmark,
http://www.nas.nasa.gov/resources/software/newlinenpb.html
An Approach to Creating Performance
Visualizations in a Parallel Profile Analysis Tool
1 Introduction
The performance measurement and analysis of large-scale parallel applications
requires means for understanding the features within the multi-dimensional
performance datasets and their relation to the computational and operational
aspects of the application and its execution. While automatic analysis of perfor-
mance behavior and diagnosis of performance problems is desired, performance
tools today invariably involve the user for performance results interpretation.
Presentation of performance information has been regarded as an opportunity for
conveying visually characteristics and traits in the data. However, it has always
been challenging to create new performance visualizations, for three reasons.
First, it requires a design process that integrates properties of the performance
data (as understood by the user) with the graphical aspects for good visual form.
This is not easy, if one wants effective outcomes. Second, unlike visualization of
physical phenomena, performance information does not have a natural semantic
visual basis. It could utilize a variety of graphical forms and visualization types
(e.g., statistical, informational, physical, abstract). Third, with increasing ap-
plication concurrency, performance visualization must deal with the problem of
scale. The use of interactive three-dimensional (3D) graphics clearly helps, but
the visualization design challenge is still present.
In addition to these challenges, there are also practical considerations. Because
of the richness of parallel performance information and the different relationships
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 156–165, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Approach to Creating Performance Visualizations 157
2 Design Approach
For both visualizations, the UI allows the user to select how the performance
event/metric pair is displayed. Both visualizations are implemented in JOGL,
Java’s interface to OpenGL, and interactive rotation and zooming are provided.
Both of these 3D views were developed to target specific use cases. Without
any additional support, any new visualization would also be implemented that
way. Thus, when we wanted to develop a new visualization for a single event and
158 W. Spear et al.
Visualization layout design is concerned with how the visualization will ap-
pear. Our approach allows the visual presentation to be specified with respect to
the parallel profile data model (events, metrics, metadata) and possible analysis
of this information. Two basic layout approaches we support are mapping to
cartesian coordinates provided by MPI and filling a space of user-defined dimen-
sionsions in order of MPI rank. We have also worked to develop a specification
language for describing more complex layouts of thread performance in a 3D
space. In our initial implementation of these custom layouts, mathematical for-
mulae define the coordinates and color value of each thread in the layout. The
formulae are based on variables provided by the profile data model. These input
variables include event and metric values for the current thread being processed
as well as global values such as the total number of threads in the profile. The
specification is applied successively to each thread in the profile to determine X,
Y and Z coordinate values and color values which are used to generate the vi-
sualization graphics. Our initial implementation for expression analysis uses the
MESP expression parser library[5]. MESP provides a simple syntax for express-
ing mathematical formulae but is powerful enough to allow visualization layouts
based on archetecturally relevant geometries or the mathematical relationship
of multiple performance variables.
Visualization UI design is concerned with how the visualization will be con-
trolled. The key insight here is to have the UI play a role in “binding” data
model variables used in the layout specification. This approach implements the
An Approach to Creating Performance Visualizations 159
functionality present in the current ParaProf views, where the user is free to
select events and metrics to be applied in the visualization as inputs to layout
formulae. However, for large performance profiles of many threads/processes,
the specified layout can result in a dense visualization that obscures internal
structures. The current ability to zoom and rotate the the topology in the UI
partially ameliorates this issue. Our model for visualization UI further allows
more sophisticated filtering techniques.
3 Examples
The performance visualization design approach is being developed in the Para-
Prof profile analysis tool. Here we demonstrate our current prototype for three
applications: Sweep3D, S3D and GCRM/ZGrd. Our initial focus is on topol-
ogy visualization. In addition, we illustrate the flexibility of these techniques by
recreating ParaProf’s event correlation view.
3.1 Sweep3D
For development and testing of our 3D visualization approach we used data from
the Sweep3D[6] particle transport code. The Sweep3D performance data set we
used was generated from a 16k core run on an IBM Blue Gene/L system and
contains Cartesian coordinates of each MPI rank from the MPI system [19].
The most obvious topology mapping scheme is to take the rank-to-coordinate
mapping and use it to lay out the points representing the ranks in a 3D space.
Figure 2 shows this performance view for the exclusive time in MPI Barrier.
The layout specification is defined with respect to MPI ranks while event and
metric variables are selected in the UI.
Fig. 2. Sweep 3d BG/L 16k-core map- Fig. 3. Sweep 3d BG/L 16k-core user-
ping as provided by MPI defined mapping
the means of accessing topology mapping data is also variable. The relevant
mapping information may not be available in performance data from which a
topological display is desired. Even when coordinate data is available there are
potential issues which may render it inappropriate for the performance analysis
task at hand. For example, the underlying, machine-level topology may not be
the topology of interest. Higher level topologies, relating to how work is allocated
may have different or no ready means of programatically associating ranks with
topological coordinates. Another issue that can arise is the need to incorporate
another dimension, such as thread (core) ID, in the display. In such situations it
is necessary to find other means of rank mapping. In general, greater flexibility
in how ranks are visualized allows for more complete analysis of an application
with respect to topology.
Figure 3 shows a user-specified visualization defining a topology with mean-
ingful spatial context for performance data based on a block-wise layout of MPI
ranks, in this case in two dimensions. This show that even a basic linear stacking
with respect to rank id can produce valuable interpretive effects.
More general topological renderings produced by mathematical expressions
can serve a number of purposes, including defining more complex hardware
topologies and other spatial representations of computational activity. To demon-
strate the power of the layout specification, Figure 4 illustrates a spherical visu-
alization of the same Sweep3D performance data.
BEGIN VIZ=s p h e r e
rootR ank s=s q r t ( maxRank )
t h e t a =2∗ p i ( ) /
rootR ank s ∗mod( rank , rootRanks )
p h i=p i ( ) /
rootR ank s ∗ ( f l o o r ( rank / rootR ank s ) )
x=c o s ( t h e t a ) ∗ s i n ( p h i ) ∗ 1 0 0
y=s i n ( t h e t a ) ∗ s i n ( p h i ) ∗ 1 0 0
Fig. 4. Sweep 3D 16k-core mapping z=c o s ( p h i ) ∗ 1 0 0
with spherical topology END VIZ
Listing 1.1 shows the expressions mapping ranks to points on the surface of
a sphere. The X, Y, and Z formulae are required, but additional helper func-
tions may be provided to simplify the expressions. Several variables, such as
maxRank and rank are provided internally. The topology formulae are defined in
a standard text file using MESP’s syntax. The file may be loaded and refreshed
from within ParaProf, allowing rapid development and adjustment of application
or purpose specific topologies, as well as easy sharing of topological definitions
between collaborators.
An Approach to Creating Performance Visualizations 161
Once a visual layout is defined, the UI can control selection and filtering.
A particularly effective setting defines a displayable range based on minimum
and maximum values. Thread points are not displayed if the value of the metric
event combination representing color exceeds the maximum or is less than the
minimum. If the minimum is set above the maximum, only threads with values
above the minimum and below the maximum are shown, meaning only high and
low values are displayed. This is very important for identifying the topological
patterns of performance outliers, and is shown for Sweep3D in Figure 5.
Fig. 5. Sweep 3d Topology with mid- Fig. 6. Sweep 3d topology slice along
dle values excised x axis
The UI also allows exclusion by locality. For example, if a value along the X
axis is selected, only points appearing along that “slice” will be visible. Excluding
by one, two or three axes results in the visualization of a plane, line or point
respectively. The average value of the metric for the ranks in a selected area will
be displayed in each case, with three selected axes displaying the actual value
for the single selected rank. This is demonstrated in Figure 6 for Sweep3D.
BEGIN VIZ=4Px16Block
xdim=8 ,ydim=8 , zdim=16
x=mod( rank , xdim)+16∗ f l o o r ( ran k / 1 0 2 4 )
Fig. 7. Time in MPI Allreduce for y=mod( f l o o r ( rank /xdim ) , ydim )
S3D 4K-core run on BGP with core- z=mod( f l o o r ( rank /xdim/ ydim ) , zdim )
based topological layout END VIZ
We wanted to break down the visualization further to display core level activ-
ity. We elected to use the mathematical expression topology definition system to
display each of the four cores as its own point in a distinct node level topology.
The result is shown in figure 7. Each of the four blocks represents the activity
one of the four core ids laid out in the specified node level topology.
Discussing the internal topologies of each block is outside the scope of this
paper. However an interesting high level phenomenon is immediately visible. In
each node, overall, the cores are operating in pairs of high and low utilization.
That is, for each chip, one core is spending significantly more time in the routine
under observation (MPI Allreduce) than the other. This core wise breakdown is
likely related to the way individual cores are assigned to handle communication.
The formulae used to distinguish the topology by core is defined in listing 1.2.
Note that this topology definition is only applicable to topologies in which the
thread rank is the last to repeat.
There are numerous alternative thread-conscious, four dimensional layouts.
The ideal layout will vary by the application selected and the topological be-
havior being observed. For example, we have also had success with grouping the
threads or cores that comprise a single node and arranging each of these groups
in the context of the greater node-level system topology.
By opening the node and thread layout to formulaic definition we have ex-
panded the scope of topological performance visualization from machine dictated
layouts to arbitrary node configurations. This is especially useful for mapping
ranks to program domain decompositions which may have no direct relationship
with the hardware topology.
BEGIN VIZ=S c a t t e r T e s t
r e s t r i c t D i m =1
x=( e v e n t 0 . v a l −e v e n t 0 . min ) /
( e v e n t 0 . max−e v e n t 0 . min )
y=( e v e n t 1 . v a l −e v e n t 1 . min ) /
( e v e n t 1 . max−e v e n t 1 . min )
z=( e v e n t 2 . v a l −e v e n t 2 . min ) /
Fig. 8. 3D Correlation plot of 10240 ( e v e n t 2 . max−e v e n t 2 . min )
core ZGrd run END VIZ
4 Related Work
5 Conclusion
Parallel performance visualization can be a useful technique for better under-
standing performance phenomena. However, it is important to integrate the
capabilities within a performance analysis framework. This paper describes a
performance visualization design methodology and its incorporation in the TAU
ParaProf tool. Its initial implementation concentrates on topology-oriented lay-
out and examples are given for the Sweep3D and S3D applications.
However, the methods we present for visual layout and UI design are more
broadly applicable. To demonstrate their versatility, we have recently recreated
ParaProf’s event correlation view. In general, our goal is to allow the user the
full benefit of incorporating their concepts of visual presentation and semantics
to improve performance understanding.
References
1. Global cloud resolving model (gcrm), https://svn.pnl.gov/gcrm
2. Paraview, http://www.paraview.org/
An Approach to Creating Performance Visualizations 165
3. Visit, https://wci.llnl.gov/codes/visit/
4. Visualization toolkit (vtk), http://expression-tree.sourceforge.net/
5. Math expression string parser (mesp) (2004),
http://expression-tree.sourceforge.net/
6. The ascii sweep3d code (October 2006),
http://www.llnl.gov/ascibenchmarks/asci/
limited/Sweep3D/asciSweep3D.html
7. Bell, R., Malony, A.D., Shende, S.: A portable, extensible, and scalable tool for
parallel performance profile analysis. In: Proc. EUROPAR 2003 Conference, pp.
17–26 (2003)
8. Bhatele, A., Kale, L.V., Chen, N., Johnson, R.E.: A Pattern Language for Topology
Aware Mapping. In: Workshop on Parallel Programming Patterns, ParaPLOP 2009
(June 2009)
9. Chen, J., et al.: Terascale direct numerical simulations of turbulent combustion
using S3D. Computational Science and Discovery 2(1), 15001 (2009)
10. Couch, A.: Categories and Context in Scalable Execution Visualization. Journal of
Parallel and Distributed Computing 18(2), 195–204 (1993)
11. De Rose, L., Pantano, M., Aydt, R., Shaffer, E., Schaeffer, B., Whitmore, S., Reed,
D.: An approach to immersive performance visualization of parallel and wide-area
distributed applications. In: Proceedings of the Eighth International Symposium
on High Performance Distributed Computing, 1999, pp. 247–254 (1999)
12. Hackstadt, S., Malony, A., Mohr, B.: Scalable Performance Visualization of Data-
Parallel Programs. In: Scalable High-Performance Computing Conference, pp. 342–
349 (May 1994)
13. Heath, M., Etheridge, J.: Visualizing the Performance of Parallel Programs. IEEE
Software 8(5), 29–39 (1991)
14. Heath, M., Malony, A., Rover, D.: Parallel Performance Visualization: From Prac-
tice to Theory. IEEE Parallel and Distributed Technology: Systems and Technol-
ogy 3(4), 44–60 (1995)
15. Heath, M., Malony, A., Rover, D.: The Visual Display of Parallel Performance
Data. Computer 28(4), 21–28 (1995)
16. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A Holistic
Approach for Performance Measurement and Analysis for Petascale Applications.
In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot,
P.M.A. (eds.) ICCS 2009. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg
(2009), http://dx.doi.org/10.1007/978-3-642-01973-9_77
17. Shende, S., Malony, A.D.: The TAU Parallel Performance System. SAGE Publica-
tions (2006)
18. Sistare, S., Allen, D., Bowker, R., Jourdenais, K., Simons, J., Title, R.: A scalable
debugger for massively parallel message-passing programs. IEEE Parallel and Dis-
tributed Technology: Systems and Applications Distributed Technology: Systems
and Applications 2(2), 50–56 (1994)
19. Traff, J.: Implementing the mpi process topology mechanism. In: SC Conference,
p. 28 (2002)
20. Yanovich, J., Budden, R., Simmel, D.: Xt3dmon 3d visual system monitor for psc’s
cray xt3 (2006), http://www.psc.edu/~ yanovich/xt3dmon
INAM - A Scalable InfiniBand Network Analysis
and Monitoring Tool
Abstract. As InfiniBand (IB) clusters grow in size and scale, predicting the be-
havior of the IB network in terms of link usage and performance becomes an
increasingly challenging task. There currently exists no open source tool that al-
lows users to dynamically analyze and visualize the communication pattern and
link usage in the IB network. In this context, we design and develop a scalable
InfiniBand Network Analysis and Monitoring tool - INAM. INAM monitors IB
clusters in real time and queries the various subnet management entities in the IB
network to gather the various performance counters specified by the IB standard.
We provide an easy to use web-based interface to visualize performance counters
and subnet management attributes of a cluster in an on-demand basis. It is also
capable of capturing the communication characteristics of a subset of links in the
network. Our experimental results show that INAM is able to accurately visualize
the link utilization as well as the communication pattern of target applications.
1 Introduction
Across various enterprise and scientific domains, users are constantly looking to push
the envelope of achievable performance. The need to achieve high resolution results
with smaller turn around times has been driving the evolution of enterprise and super-
computing systems over the last decade. Interconnection networks have also rapidly
evolved to offer low latencies and high bandwidths to meet the communication require-
ments of distributed computing applications. InfiniBand has emerged as a popular high
performance network interconnect and is being increasingly used to deploy some of
the top supercomputing installations around the world. According to the Top500 [13]
This research is supported in part by Sandia Laboratories grant #1024384, U.S. Department
of Energy grants #DE-FC02-06ER25749, #DE-FC02-06ER25755 and contract #DE-AC02-
06CH11357; National Science Foundation grants #CCF-0621484, #CCF-0702675, #CCF-
0833169, #CCF-0916302 and #OCI-0926691; grant from Wright Center for Innovation
#WCI04-010-OSU-0; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsystems;
Equipment donations from Intel, Mellanox, AMD, Obsidian, Advanced Clustering, Appro,
QLogic, and Sun Microsystems.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 166–177, 2012.
c Springer-Verlag Berlin Heidelberg 2012
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 167
ratings of supercomputers done in June’11, 41.20% of the top 500 most powerful super-
computers in the world are based on the InfiniBand interconnects. Recently, InfiniBand
has also started to make in-roads into the world of enterprise computing.
Different factors can affect the performance of applications utilizing IB clusters. One
of these factors is the routing of packets or messages. Due to static routing, it is impor-
tant to ensure that the routing table is correctly programmed. Hoefler et al. showed,
in [4], the possible degradation in performance if multiple messages traverse the same
link at the same time. Unfortunately, there do not exist any open-source tools that can
provide information such as the communication matrix of a given target application or
the link usage in the various links in the network, in a user friendly way.
Most of the contemporary network monitoring tools for IB clusters have an overhead
attached to them which is caused by the execution of their respective daemons which
needs to run every monitored device on the subnet. The purpose of these daemons is
to gather relevant data from their respective hosts and transmit it to a central daemon
manager which renders this information to the user. Furthermore, the task of profiling
an application at the IB level is difficult considering the issue that most of the network
monitoring tools are not highly responsive to the events occurring on the network. For
example, to reduce the overhead caused by constant gathering of information at the
node by the daemons, a common solution is to gather the information at some time
intervals which could be anywhere between 30 seconds to 5 minutes. This is called the
sampling frequency. Thus, the higher the sampling frequency, the higher the overhead
created by the daemons. This causes a tradeoff with the responsiveness of the network
monitoring tool. This method has an additional disadvantage in that, it does not allow
us to monitor network devices such as switches and routers where we will not be able
to launch user specified daemon processes.
As IB clusters grow in size and scale, it becomes critical to understand the behav-
ior of the InfiniBand network fabric at scale. While the Ethernet ecosystem has a wide
variety of matured tools to monitor, analyze and visualize various elements of the Eth-
ernet network, the InfiniBand network management tools are still in their infancy. To
the best of our knowledge, none of the available open source IB network management
tools allow users to visualize and analyze the communication pattern and link usage in
an IB network. These lead us to the following broad challenge - Can a low overhead
network monitoring tool be designed for IB clusters that is capable of depicting the
communication matrix of target applications and the link usage of various links in the
InfiniBand network?
In this paper we address this challenge by designing a scalable InfiniBand Network
Analysis and Monitoring tool - INAM. INAM monitors IB clusters in real time and
queries the various subnet management entities in the IB network to gather the vari-
ous performance counters specified by the IB standard. We provide an easy to use web
interface to visualize the performance counters and subnet management attributes of
the entire cluster or a subset of it on the fly. It is also capable of capturing the com-
munication characteristics of a subset of links in the network, thereby allowing users
to visualize and analyze the network communication characteristics of a job in a high
performance computing environment. Our experimental results show that INAM is able
168 N. Dandapanthula et al.
to accurately visualize the link usage within a network as well as the communication
pattern of target applications.
The remainder of this paper is organized as follows. Section 2 gives a brief overview
of InfiniBand and the InfiniBand subnet management infrastructure. In Section 3 we
present the framework and design of INAM. We evaluate and analyze the correctness
and performance of INAM in various scenarios in Section 4, describe the currently
available related tools in Section 5, and summarize the conclusions and possible future
work in Section 6.
2 Background
2.1 InfiniBand
InfiniBand is a very popular switched interconnect standard being used by almost 41%
of the Top500 Supercomputing systems [13]. InfiniBand Architecture (IBA) [5] defines
a switched network fabric for interconnecting processing nodes and I/O nodes, using
a queue-based model. InfiniBand standard does not define a specific network topol-
ogy or routing algorithm and provides the users with an option to choose as per their
requirements.
IB also proposes link layer Virtual Lanes (VL) that allows the physical link to be split
into several virtual links, each with their specific buffers and flow control mechanisms.
This possibility allows the creation of virtual networks over the physical topology. How-
ever, current generation InfiniBand interfaces do not offer performance counters for
different virtual lanes.
2.2 OFED
OFED, short for OpenFabrics Enterprise Distribution, is an open source software for
RDMA and kernel bypass applications. It is needed by the HPC community for appli-
cations which need low latency and high efficiency and fast I/O. A detailed overview of
OFED can be found in [11]. OFED provides performance monitoring utilities which
present the port counters and subnet management attributes for all the device ports
within the subnet. Some of the attributes which can be obtained from these utilities are
shown in Table 1.
initiates the subnet and then monitors it. The subnet management agents (SMA) are
deployed on every device port of the subnet to monitor their respective hosts. All man-
agement traffic including the communication between the SMAs and the SM is done
using subnet management packets (SMP). IBA allocates the Virtual lane (VL) 15 for
subnet management traffic. The general purpose traffic can use any of the other virtual
lanes from 1 to 14 but the traffic on VL 15 is independent of the general purpose traffic.
We describe the design and implementation details of our InfiniBand Network Analysis
and Monitoring tool (INAM) in this section. For modularity and ease of portability, we
separate the functionality of INAM into two distinct modules - the InfiniBand Network
Querying Service (INQS) and the Web-based Visualization Interface (WVI). INQS acts
as a network data acquisition service. It retrieves the requested information regarding
ports on all the devices of the subnet to obtain the performance counters and subnet
management attributes. This information is then stored in a database using MySQL
methods [9]. The WVI module then communicates with the database to obtain the data
pertaining to any user requested port(s) in an on-demand basis. The WVI is designed as
a standard web application which can be accessed using any contemporary web browser.
The two modes of operation of the WVI include the live observation of the individual
port counters of a particular device and the long term storage of all the port counters
of a subnet. This information can be queried by the user in the future. INQS can be
ported to any platform, independent of the cluster size and the Linux distribution being
used. INAM is initiated by the administrator and there exists a connection thread pool
through which individual users are served. As soon as a user exits the application, the
connection is returned to the pool. If all the connections are taken up, then the user has
to wait. Currently the size of this connection pool is 50 and can be increased.
As we saw in Section 1, a major challenge for contemporary IB network monitoring
tools is the necessity to deploy daemon processes on every monitored device on the
subnet. The overhead in terms of CPU utilization and network bandwidth caused by
these daemons often cause considerable perturbations in the performance of real user
applications that use these clusters. INAM overcomes this by utilizing the Subnet Man-
agement Agents (SMA) which are required to be present on each IB enabled device
on the subnet. The primary role of an SMA is to monitor and regulate all IB network
related activity on their respective host nodes. The INQS queries these SMAs to ob-
tain the performance counters and subnet management attributes of the IB device(s)
on a particular host. The INQS uses Management Datagram (MAD) packets to query
the SMAs. As MAD packets use a separate Virtual Lane (VL 15), they will not com-
pete with application traffic for network bandwidth. Thus, compared to contemporary
InfiniBand network management tools, INAM is more responsive and and causes less
overhead.
INAM is also capable of monitoring and visualizing the utilization of a link within
a subnet. To obtain the link utilization, the XmtWait attribute alone or XmtData / Rcv-
Data and LinkActiveSpeed attributes in combination are used. The XmtWait attribute
corresponds to the period of time a packet was waiting to be sent, but could not be sent
170 N. Dandapanthula et al.
main categories which are Link Attributes, Virtual Lane Attributes, MTU Attributes and
Errors and Violations. INAM also provides dynamic updates regarding the status of the
master Subnet Manager (SM) instance to the user. If there is a change in the priority
of SM or if the Master SM instance fails or if a new slave SM takes over as a Master
SM instance, the status is updated and the user is notified. This can help to understand
the fail-over properties of OpenSM. Further more, a user can ask INAM to monitor the
network for the time period of an MPI job and then it helps the user understand the
communication pattern of that job using a color coded link utilization diagram.
4 Experimental Results
4.1 Experimental Setup
The experimental setup is a cluster of 71 nodes (8 cores per node with a total of 568
cores) which are all dual Intel Xeons E5345 connected to an InfiniBand Switch which
has an internal topology of a Fat Tree. We use a part of this cluster to show the function-
ality of INAM. This set up comprises of 6 leaf switches and 6 spine switches with 24
ports each and a total of 35 leaf nodes equipped with ConnectX cards. The functioning
of INAM is presented using a series of benchmarks in varied scenarios. The first set
of results are obtained using a bandwidth sharing benchmark to create traffic patterns
which are verified by visualizing the link usage using INAM. The second set of bench-
marks shows similar network communication patterns with MPI Bcast configured for
diverse scenarios. The third set of experiments verifies the usage of INAM using the
LU benchmark from the SpecMPI suite.
Fig. 2. INAM depiction of network traffic pat- Fig. 3. INAM depiction of network traffic pat-
tern for 16 processes tern for 64 processes
Test Pattern 1. The first test pattern is visualized in Figure 2. The process arrangement
in this pattern is such that 8 processes, one per each of the 8 leaf nodes connected to
leaf switch 84, communicate with one process, on each of the four leaf nodes connected
to the each of the two switches 78 and 66. The thick green line indicates that multiple
processes are using that link. In this case, it can be observed that the thick green line
originating from switch 84 splits into 2 at switch 110. The normal green links symbolize
that the links are not being over utilized, for this specific case.
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 173
Test Pattern 2. Figure 3 presents the network communication for test pattern 2. The
process arrangement in this pattern is such that 32 processes, four per each of the 8
leaf nodes connected to leaf switch 84, communicate with two processes, on each of
the eight leaf nodes connected to the each of the two switches 78 and 66. 32 processes
send out messages from switch 84 and 16 processes on each of the switches 78 and 66
receive these messages. This increase in the number of processes per leaf node explains
the exorbitant increase in the number of links being overly utilized. Figure 3 also shows
that all of the inter switch links are marked in thick lines, thus showing that each link is
being used by more then one process. The links depicted in red indicate that the link is
over utilized. Since each leaf node on switch 84 has four processes and each leaf node
on the other switches have two processes, the links connecting the leaf nodes to the
switch are depicted as thick red lines.
4.4 Link Utilization of Collective Operations: Case Study with MPI Bcast
Operation
In this set of experiments, we evaluate the visualization of the One-to-All broadcast
algorithms typically used in MPI libraries, using INAM. MVAPICH2 [8] uses the tree-
based algorithms for small and medium sized messages, and the scatter-allgather al-
gorithm for larger messages. The tree-based algorithms are designed to achieve lower
latency by minimizing the number of communication steps. However, due to the costs
associated with the intermediate copy operations, the tree-based algorithms are not suit-
able for larger messages and the scatter-allgather algorithm is used for such cases. The
scatter-allgather algorithm comprises of two steps. In the first step, the root of the broad-
cast operation divides the data buffer and scatters it across all the processes using the bi-
nomial algorithm. In the next step, all the processes participate in an allgather operation
which can either be implemented using the recursive doubling or the ring algorithms.
We designed a simple benchmark to study the link utilization pattern of the
MPI Bcast operation with different message lengths. For brevity, we compare the link
utilization pattern with the binomial algorithm with 16KB message length and we study
the scatter-allgather (ring) algorithm with a data buffer of size 1MB. We used six pro-
cesses for these experiments, such that we have one process on each of the leaf switches,
as shown in Figure 4. In our controlled experiments, we assign the process on switch
84 to be the root (rank 0) of the MPI Bcast operation, switch 126 be rank 1 and so on
until the process on switch 66 is rank 5. Figure 4 shows a binomial traffic pattern for
a broadcast communication on 6 processes using a 16KB message size. The binomial
communication pattern with 6 processes is as follows:
– Step1: Rank0 → Rank3
– Step2: Rank0 → Rank1 and Rank3 → Rank4
– Step3: Rank1 → Rank2 and Rank4 → Rank5
In Figure 4, a darker color is used to represent a link that has been used more than
once during the broadcast operation. We can see that processes with ranks 0 through 4,
the link connecting the compute nodes to their immediate leaf-level switches are used
more than once, because these processes participate in more than one send/recv oper-
ation. However, process P5 receives only one message and INAM demonstrates this
174 N. Dandapanthula et al.
by choosing a lighter shade. We can also understand the routing algorithm used be-
tween the leaf and the spine switches by observing the link utilization pattern generated
by INAM. We also observe that the process with rank4, uses the same link between
switches 90 and 110 for both its send and receive operations. Such a routing scheme is
probably more prone to contention, particularly at scale when multiple data streams are
competing for the same network link.
Figure 5 presents the link utilization pattern for the scatter-allgather (ring) algorithm
with 6 processes. We can see that the effective link utilization for this algorithm is con-
siderably higher when compared to the binomial exchange. This is because the scatter-
allgather (ring) algorithm involves a higher number of communication steps than the
binomial exchange algorithm. With 6 processes, the ring algorithm comprises of 6 com-
munication steps. In each step, process P i communicates with its immediate logical
neighbors processes P (i − 1) and P (i + 1). This implies that each link between the
neighboring processes are utilized exactly 6 times during the allgather phase.
In this experiment, we ran the LU benchmark (137.lu - medium size - mref) from the
SpecMPI suite [10] on a system size of 128 processes using 16 leaf nodes with 8 nodes
on each of the two leaf switches. The prominent communication used by LU comprises
of MPI Send and MPI Recv. The communication pattern is such that each process com-
municates with its nearest neighbors in either directions (p2 communicates with p1 and
p3). In the next step, p0 communicates with p15, p1 communicates with p16 and so
on. This pattern is visualized by INAM and is shown in Figure 6. It can be seen that a
majority of the communication is occurring on an intra-switch level.
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 175
Since we use the subnet management agent (SMA), which acts like daemons monitoring
all the devices of a subnet, we do not need to use any additional daemons installed on
every device to obtain this data. This is a major advantage as it avoids the overhead in
the contemporary approach caused by the daemons which are installed on every device.
The user just needs to have the service opensmd started on the subnet. Since the queries
used communicate through Virtual Lane 15 for the purpose of data acquisition, there is
no interference with the generic cluster traffic. For the verification of this, we compared
the performance of an IMB alltoall benchmark while toggling the data collection service
on and off by using messages of size 16 KB and 512 KB for a system size varying from
16 cores to 512 cores. The results obtained are shown in Figure 7 which shows that the
overhead is very minimal even though the service is on and there is not much difference
even though the message size is increased.
System Size 16 cores 32 cores 64 cores 128 cores 256 cores 512 cores
Message size 16 KB 0.13% 0.11% 0.15% 0.09% 0.07% 0.14%
Message size 512 KB 0.19% 0.21% 0.16% 0.08% 0.21% 0.15%
5 Related Tools
There is a plethora of free or commercial network monitoring tools that provide dif-
ferent kinds of information to the system administrators or the users. But only a few of
them provide specific information related to IB network. We focus here on three popular
network monitoring tools: Ganglia [6], Nagios[1] and FabricIT [7].
Ganglia is a widely used open-source scalable distributed monitoring system for
high-performance computing systems developed by the University of California inside
the Berkeley Millennium Project. One of the best features of Ganglia is to offer an
176 N. Dandapanthula et al.
overview of certain characteristics within all the nodes of a cluster, like memory, CPU,
disk and network utilization. At the IB level, Ganglia can provide information through
perfquery and smpquery. Nevertheless, Ganglia can’t show any information related to
the network topology or link usage. Furthermore, to get all the data, Ganglia needs to
run a daemon, called gmond, on each node, adding an additional overhead.
Nagios is another common open-source network monitoring tool. Nagios offers al-
most the same information as Ganglia through a plug-in called ”InfiniBand Performance
Counters Check”. But, as Ganglia, Nagios can’t provide any information related to the
topology.
FabricIT is a proprietary network monitoring tool developed by Mellanox. Like
INAM, FabricIT is able to provide more information than Ganglia or Nagios, but the
free version of the tool does not give a graphical representation of the link usage or the
congestion.
INAM is different from the other existing tools by the richness of the given informa-
tion and also its unique link usage information, giving all the required elements to users
to understand the performance of applications at the IB level.
References
1. Barth, W.: Nagios. System and Network Monitoring. No Starch Press, U.S. Ed edn. (2006)
2. Charts, H.: HighCharts JS - Interactive JavaScript Charting,
http://www.highcharts.com/
3. DWR: DWR - Direct Web Remoting, http://directwebremoting.org/dwr/
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 177
4. Hoefler, T., Schneider, T., Lumsdaine, A.: Multistage Switches are not Crossbars: Effects of
Static Routing in High-Performance Networks. In: Proceedings of the 2008 IEEE Cluster
Conference (September 2008)
5. InfiniBand Trade Association, http://www.infinibandta.org/
6. Massie, M.L., Chun, B.N., Culler, D.E.: The Ganglia Distributed Monitoring System: De-
sign, Implementation, and Experience. Parallel Computing 30(7) (July 2004)
7. Mellanox: Fabric-it,
http://www.mellanox.com/pdf/prod ib switch systems/
pb FabricIT EFM.pdf
8. MVAPICH2, http://mvapich.cse.ohio-state.edu/
9. MySQL: MySQL, http://www.mysql.com/
10. Müller, M.S., van Waveren, G.M., Lieberman, R., Whitney, B., Saito, H., Kumaran, K.,
Baron, J., Brantley, W.C., Parrott, C., Elken, T., Feng, H., Ponder, C.: Spec mpi2007 - an
application benchmark suite for parallel systems using mpi. Concurrency and Computation:
Practice and Experience, 191–205 (2010)
11. Open Fabrics Alliance,
http://www.openfabrics.org/
12. SUN: Java 2 platform, enterprise edition (j2ee) overview,
http://java.sun.com/j2ee
13. Top500: Top500 Supercomputing systems (November 2010),
http://www.top500.org
14. Vienne, J., Martinasso, M., Vincent, J.M., Méhaut, J.F.: Predictive models for bandwidth
sharing in high performance clusters. In: Proceedings of the 2008 IEEE Cluster Conference
(September 2008)
15. W3C: HTML5 - Canvas Element,
https://developer.mozilla.org/en/HTML/Canvas
Auto-tuning for Energy Usage in Scientific
Applications
1 Introduction
As the HPC community prepares to enter the era of exascale systems, a key
problem that the community is trying to address is the power wall problem.
The power wall arises because as compute nodes (consisting of multi/many-
cores) become increasingly powerful and dense, they also become increasingly
power hungry. The problems this creates are two-fold; it is more expensive to
run compute nodes due to the energy they require and it is difficult/expensive
to cool them.
Going forward, power-aware computing research in the HPC community will
focus in at least two main areas. The first is to develop descriptive and universal
ways of describing power usage, either by direct measurement or through ex-
planatory models. Inexpensive, commercially-produced devices such as WattsUp?
Pro [3] or more customized frameworks such as PowerMon2 [4] or PowerPack [13]
can help measure power and energy consumption. Modeling energy usage through
combinations of architectural parameters with performance counters [26] or other
resource usage information [21] also fall in this category. The second thrust,
which invariably depends on the first, is to attempt to minimize the amount
of energy required to solve various scientific problems. This includes the use of
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 178–187, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Auto-tuning for Energy Usage in Scientific Applications 179
In this paper we take a first concrete step towards answering these questions.
The study presented here takes a search-based offline auto-tuning approach. We
start by identifying a set of tunable parameters for different potential perfor-
mance bottlenecks in an application. The feedback driven empirical auto-tuner
monitors the application’s performance and power consumption and adjusts the
values of the tunable parameters in response to them. When the auto-tuner re-
quires a new code variant in order to move from one set of parameter values
to another, it invokes a code generation framework [8] to generate that code
variant. The feedback metric values associated with different parameter config-
urations are measured by running the target application on the target platform.
The methodology is thus offline because the tuning adjustments are made be-
tween successive full application runs based on the observed power consumption
for code-variants.
2 Motivation
In this section, we demonstrate that there are opportunities for power and energy
consumption auto-tuning. We use an implementation of the Poisson’s equation
1
The prevalance of stencil computations in DARPA Ubiquitous High Performance
Computing (UHPC) challenge applications is documented in [10].
2
Frequency scaling can be used to reduce energy consumption.
180 A. Tiwari et al.
1.05 1.05
1 1
0.95 0.95
Normalized Energy 0.9 0.9
0.85
0.85
0.8
0.8
0.75
0.7 0.75
0.7
50
100
150 400
350
200 300
250 250
TI 200
300 150
350 100 TJ
50
400
Fig. 1. Normalized energy consumption of the entire system for 8-core experiment.
Figure is easier to see in color.
3 Experiments
To drive the tuning process we use Active Harmony [9, 27], which is a search-
based auto-tuner. Active Harmony treats each tunable parameter as a variable
in an independent dimension in the search (or tuning) space. Parameter config-
urations (admissible values for tunable parameters) serve as points in the search
space. The objective function values (feedback metrics) associated with points
in the search space are gathered by running and measuring the application on
the target platform. The objective function values are consumed by the Active
Harmony server to make tuning decisions.
For tunable parameters that require new code (e.g. unroll factors), Active
Harmony utilizes code-transformation frameworks to generate code. The exper-
iments reported in this paper use CHiLL [8], a polyhedral loop transformation
and code generation framework. CHiLL provides a high-level script interface
that auto-tuners can leverage to describe a set of loop transformation strategies
for a given piece of code. More details on offline auto-tuning using Active Har-
mony and CHiLL are described in [27]. Both Active Harmony and CHiLL are
open-source projects.
We measure the energy consumption of a system using the WattsUp? Pro power
meter [3]. The power meter is a fairly inexpensive device and, at the time of
this writing, costs less than $150. This device measures the AC power being
consumed by the entire system. We have implemented a command line interface
on top of the wattsup driver to monitor and calculate the overall energy usage
of an application.
on the overall goal of the tuning exercise and one could certainly consider a user-
set set delay penalty per job. We think these 4 are enough to characterize our
methods and and the optimization space.
search favors a lower clock frequency for better energy usage. On average, energy
conscious parameter configurations save 58% energy and run 2.16× faster com-
pared to the baseline compiler optimized code. The auto-tuning runs that use ED,
ED2 and T metrics all favor high clock frequency and show similar performance and
energy characteristics. In terms of the best runtime improvement, auto-tuning runs
done with delay as the feedback metric achieves 2.24× improvement along with en-
ergy saving of 55%. This confirms the popular belief that auto-tuning for runtime
in scientific applications leads to better system-wide energy usage.
Often the best performing code is nearly the most energy efficient as short
runtimes shorten one component of the P ower × T product (Energy). So finally,
we compared the performance and energy consumption measurements between
configurations that give best timing and best energy usage respectively. The
configuration that provides best energy usage suffers a delay of 4.1%; however,
the energy usage saving is 5.8%. The search heuristic used in these experiments
does not guarantee a globally best configuration with respect to timing or en-
ergy consumption, which means there can be other configurations in the search
space that can possibly demonstrate different behavior. However, the result indi-
cates that there are some non-trivial interactions between compiler performance
optimization strategies and energy usage.
Table 2 shows the results for L2-Error function. The results for this function
follows a similar pattern to that of the relaxation function. We then compared
the performance and energy consumption measurements between configurations
that give best timing 3 and best energy usage respectively. The configuration
that provides best energy usage suffers a performance loss of 3.9% and the energy
usage savings is 5%. This result further strengthens our earlier argument about
the need to investigate the interactions between compiler optimization strategies
and energy consumption.
4 Future Work
5 Related Work
Reducing power consumption has long been of great interest to embedded and
mobile systems architects [11, 18, 22]. The architectural properties of these sys-
tems are fundamentally different from those of the HPC systems, so the strategies
proposed for for them generally fail to translate into reducing energy consump-
tion for HPC systems and applications [16]. As such, power optimization has
received a fair amount of attention from the HPC community. Most previous re-
search on power optimization uses architectural simulation to estimate power or
energy usage by different components of the compute system [6]. More recently,
direct power and energy measurement hardware and software have been devel-
oped [4, 13, 15]. Bedard et. al. [4] developed PowerMon2, a framework designed to
obtain fine-grained current and voltage measurements for different components
of a target platform such as CPU, memory subsystem, disk I/O etc. Power pro-
filing frameworks can be integrated within our power auto-tuning framework to
obtain greater understanding of the impact of different optimization techniques
on individual components of the target architecture.
Power or energy usage modeling and benchmarking is another relevant area.
The Energy-Aware Compilation (EAC) [17] framework uses a high-level en-
ergy estimation model to predict the energy consumption of a given piece of
code. The model utilizes architectural parameters and energy/performance con-
straints. The overall idea is to use the model to decide the profitability of different
3
Note that for this kernel, ED2 tuning runs gives the best timing, which is what we
use for the comparison with the best energy usage configuration.
Auto-tuning for Energy Usage in Scientific Applications 185
compiler optimization techniques. Singh et. al. [26] derive an analytic, workload-
independent piece-wise linear power model that maps performance counters and
temperature to power consumption. Laurenzano et. al. [19] use a benchmark-
based approach to determining how system power consumption and performance
is affected by various demand regimens on the system, then use this to select
processor clock frequency.
Seng et. al. [25] examine the effect of compiler optimization levels and a few
specific compiler optimization flags on the energy usage and power consumption
of the Intel Pentium 4 processor. Rather than relying on compiler optimiza-
tion levels, we exercise a greater control over how different code transformation
strategies are applied. Moreover, our technique is general purpose and uses a
fairly inexpensive power measurement hardware to guide the exploration of the
parameter search space.
Rahman et. al. [23] use a model-based approach to estimate power consump-
tion of chip multiprocessors and use that information to guide the application
of different compiler optimization techniques. This work is most closely related
to the work that we have presented here. Power estimations for different code-
variants are obtained using the model described by Singh et. al. [26]. Our work
uses power measurements rather than models and we simultaneously treat clock
frequency as a tunable parameter alongside the generation and evaluation of
different code variants.
6 Conclusion
In this paper, we showed that there are non-trivial interactions between com-
piler performance optimization strategies and energy usage. We used a fairly
inexpensive power meter and leveraged open source projects to explore energy
and performance optimization space for computation intensive kernels.
References
1. CPU Frequency Scaling, https://wiki.archlinux.org/index.php/Cpufrequtils
2. KeLP, http://cseweb.ucsd.edu/groups/hpcl/scg/KeLP1.4/
3. WattsUp? Meters, https://www.wattsupmeters.com/secure/products.php?pn=0
4. Bedard, D., Lim, M.Y., Fowler, R., Porterfield, A.: PowerMon: Fine-grained and
integrated power monitoring for commodity computer systems. In: Proceedings of
the IEEE SoutheastCon 2010 (SoutheastCon), pp. 479–484 (2010)
5. Bekas, C., Curioni, A.: A new energy aware performance metric. Computer Science
- Research and Development 25, 187–195 (2010)
186 A. Tiwari et al.
6. Brooks, D., Tiwari, V., Martonosi, M.: Wattch: a framework for architectural-level
power analysis and optimizations. In: Proceedings of the 27th Annual International
Symposium on Computer Architecture, ISCA 2000, pp. 83–94. ACM, New York
(2000)
7. Brooks, D.M., Bose, P., Schuster, S.E., Jacobson, H., Kudva, P.N., Buyukto-
sunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M., Cook, P.W.: Power-aware
microarchitecture: Design and modeling challenges for next-generation micropro-
cessors. IEEE Micro 20, 26–44 (2000)
8. Chen, C.: Model-Guided Empirical Optimization for Memory Hierarchy. PhD the-
sis, University of Southern California (2007)
9. Chung, I.-H., Hollingsworth, J.: A case study using automatic performance tuning
for large-scale scientific programs. In: 2006 15th IEEE International Symposium
on High Performance Distributed Computing, pp. 45–56 (2006)
10. Ciccotti, P., et al.: Characterization of the DARPA Ubiquitous High Performance
Computing (UHPC) Challenge Applications. Submission to International Sympo-
sium on Workload Characterization, IIWSC (2011)
11. Flinn, J., Satyanarayanan, M.: Energy-aware adaptation for mobile applications.
In: Proceedings of the Seventeenth ACM Symposium on Operating Systems Prin-
ciples, SOSP 1999, pp. 48–63. ACM, New York (1999)
12. Freeh, V.W., Kappiah, N., Lowenthal, D.K., Bletsch, T.K.: Just-in-time dynamic
voltage scaling: Exploiting inter-node slack to save energy in mpi programs. J.
Parallel Distrib. Comput. 68, 1175–1185 (2008)
13. Ge, R., Feng, X., Song, S., Chang, H.-C., Li, D., Cameron, K.: PowerPack: En-
ergy Profiling and Analysis of High-Performance Systems and Applications. IEEE
Transactions on Parallel and Distributed Systems 21(5), 658–671 (2010)
14. Horowitz, M., Indermaur, T., Gonzalez, R.: Low-power digital design. In: IEEE
Symposium on Low Power Electronics, Digest of Technical Papers 1994, pp. 8–11
(October 1994)
15. Hotta, Y., Sato, M., Kimura, H., Matsuoka, S., Boku, T., Takahashi, D.: Profile-
based optimization of power performance by using dynamic voltage scaling on a
pc cluster. In: Proceedings of the 20th International Conference on Parallel and
Distributed Processing, IPDPS 2006, p. 298. IEEE Computer Society, Washington,
DC (2006)
16. Hsu, C.-H., Feng, W.-C.: A Power-Aware Run-Time System for High-Performance
Computing. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomput-
ing, SC 2005, p. 1. IEEE Computer Society, Washington, DC (2005)
17. Kadayif, I., Kandemir, M., Vijaykrishnan, N., Irwin, M., Sivasubramaniam, A.:
Eac: a compiler framework for high-level energy estimation and optimization. In:
Proceedings of Design, Automation and Test in Europe Conference and Exhibition,
2002, pp. 436–442 (2002)
18. Kandemir, M., Vijaykrishnan, N., Irwin, M.J., Ye, W.: Influence of compiler opti-
mizations on system power. IEEE Trans. Very Large Scale Integr. Syst. 9, 801–804
(2001)
19. Laurenzano, M.A., Meswani, M., Carrington, L., Snavely, A., Tikir, M.M., Poole,
S.: Reducing Energy Usage with Memory and Computation-Aware Dynamic Fre-
quency Scaling. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part
I. LNCS, vol. 6852, pp. 79–90. Springer, Heidelberg (2011)
20. Li, D., de Supinski, B., Schulz, M., Cameron, K., Nikolopoulos, D.: Hybrid
MPI/OpenMP power-aware computing. In: 2010 IEEE International Symposium
on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)
Auto-tuning for Energy Usage in Scientific Applications 187
21. Olschanowsky, C., Carrington, L., Tikir, M., Laurenzano, M., Rosing, T.S.,
Snavely, A.: Fine-grained energy consumption characterization and modeling. In:
DOD High Performance Computing Modernization Program User Group Confer-
ence (2010)
22. Pillai, P., Shin, K.G.: Real-time dynamic voltage scaling for low-power embedded
operating systems. SIGOPS Oper. Syst. Rev. 35, 89–102 (2001)
23. Rahman, S.F., Guo, J., Yi, Q.: Automated empirical tuning of scientific codes
for performance and power consumption. In: Proceedings of the 6th International
Conference on High Performance and Embedded Architectures and Compilers,
HiPEAC 2011, pp. 107–116. ACM, New York (2011)
24. Rivera, G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In:
Proceedings of the 2000 ACM/IEEE Conference on Supercomputing (CDROM),
Supercomputing 2000. IEEE Computer Society, Washington, DC (2000)
25. Seng, J.S., Tullsen, D.M.: The Effect of Compiler Optimizations on Pentium 4
Power Consumption. In: Proceedings of the Seventh Workshop on Interaction be-
tween Compilers and Computer Architectures, INTERACT 2003, p. 51. IEEE
Computer Society, Washington, DC (2003)
26. Singh, K., Bhadauria, M., McKee, S.A.: Prediction-based power estimation and
scheduling for cmps. In: Proceedings of the 23rd International Conference on Su-
percomputing, ICS 2009, pp. 501–502. ACM, New York (2009)
27. Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.: A Scalable Auto-
Tuning Framework for Compiler Optimization. In: 23rd IEEE International Par-
allel & Distributed Processing Symposium, Rome, Italy (May 2009)
28. Vuduc, R., Demmel, J.W., Yelick, K.A.: Oski: A library of automatically tuned
sparse matrix kernels. Journal of Physics: Conference Series 16, 521–530 (2005)
Automatic Source Code Transformation
for GPUs Based on Program Comprehension
1 Introduction
The development of software for scientific applications through the years has
seen different seasons. Continuous growth in performance requests to fulfil spe-
cific calculus needs, drove the birth of parallel machines and related concurrent
programming models. Lot of effort has been spent on developing parallelization
techniques to port applications on parallel, vectorial or super-scalar architectures
in the ninety.
Subsequently, continuous improvements on the hardware systems and mainly
on clock processors’ clock speeds, caused lacking of interest on research activities
on parallelization, since the performances of applications got a natural growth.
But in the last few years the processors’ clock speed growth has stopped due
to physical limits on junctions dimensions and to the dissipated power. Proces-
sor improvements have to follow a different path by multiplying the number of
processing units on a chip (multi-core systems). Chips producers nowadays an-
nounce systems, no longer with higher frequencies, but with increased number
of cores. Special purpose devices as the GPUs, designed for graphics applica-
tions, can be used to do parallel computations. Not only in systems for scientific
applications, but also in common personal computers, there are now multi-core
CPUs and GPUs.
It is hard to write parallel code and it requires skilled developers. Great effort
is needed to port existing software and manual parallelization of applications
with high orders of magnitude of lines of code is a critical and error-prone process.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 188–197, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Automatic Source Code Transformation for GPUs 189
2 Related Works
the transformation has been made on OpenMP code by analizing parallel con-
structs and work-sharing constructs to extract candidate kernels and transform
them to CUDA code. A framework for optimization of affine loop nests, with
polyhedral compiler model, has been described in [1]. All these works start with
code that is already parallel or user annotated code to drive the transformation
and not from sequential code as ours.
dƌĂŶƐĨŽƌŵĞƌ
&ƌŽŶƚͲŶĚ ĂĐŬͲŶĚ
ƐŽƵƌĐĞ ƐŽƵƌĐĞ
ĂƐŝĐ
ŽŶĐĞƉƚƐ ZĞĂƐŽŶĞƌ ůŐŽƌŝƚŚŵ
džƚƌĂĐƚŽƌ ƌĞƉŽƐŝƚŽƌLJ
ůŐŽƌŝƚŚŵŝĐ
ZƵůĞƐ
For example the analysis of the statement: int i = 0; produces the facts in
the listing 1.1.
s c a l a r v a r d e f ( i , d e f l i s t 1 , e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 2 , 0 , e l e m u p d a t e r , main ) .
a s s i g n r (( d e f l i s t 1 , stp 1 ) , stp 1 , stp 2 , elem update r ,
main ) .
We can see that four facts are generated: a) the definition of a scalar variable;
b) the usage of a scalar value; c) the usage of a constant; d) the assignment
statement. In detail, the second fact above indicates a basic concept named
scalar_var_inst. Its instance number is 1 (stp_1), its parameter is i (the
variable name), the rule is recognized by is the elem_update_r and the function
in which it is present is named main.
Similarly, in last fact, we see the composition of the previous concepts in a
tree.
Another example is the loop statement: for (i = 0; i < 10; i++) which
produces the facts in listing 1.2
f o r r (15 , f o r (15 , ex it 115 ) , i n i t 6 , exit 115 , incr 7 , elem update r
, main ) .
s c a l a r v a r i n s t ( s t p 1 1 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 1 2 , 0 , e l e m u p d a t e r , main ) .
a s s i g n r ( i n i t 6 , assign ( i n i t 6 , stp 11 ) , stp 11 , stp 12 ,
e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 3 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 1 4 , 1 0 , e l e m u p d a t e r , main ) .
l e s s ( e x i t 1 1 5 , s t p 1 3 , s t p 1 4 , e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 5 , i , e l e m u p d a t e r , main ) .
p o s t i n c r ( i n c r 7 , s t p 1 5 , e l e m u p d a t e r , main ) .
In this case the numbers are the pointers to the nodes of the AST.
Control dependence facts generated have a syntax like:
control_dep(dependant_id, depend_from_id, type, class, method).
Data dependence facts have a syntax like:
data_dep(type,dependant_id,depend_from_id,variable,class,method).
The concept recognition rules are the production rules of the parsing process;
they describe the feature set that permits the identification of an instance of
an algorithmic concept in the code. This feature set can be named algorithmic
pattern. The rules can be defined as the way in which abstract concepts, as groups
of statements in the code, are organized under an abstract control structure. With
this definition we include structural relationships as Control and Data Flow,
Control and Data Dependence and function calling.
192 P. Cantiello and B. Di Martino
The information obtained after the recognition of the algorithm drives the trans-
former module. The source code region that implements the algorithm can be
replaced by optimized parallel code or by call to optimized libraries. The al-
gorithm repository contains, for each target architecture, one or more possible
Automatic Source Code Transformation for GPUs 193
– The sub-tree corresponding to the code region is pruned from the AST and,
if desired, a comment block with the original code is inserted.
– A new sub-tree is generated with transformed code. If needed (as in GPUs),
it contains also: memory allocation on device, memory transfer from CPU
to device, library invocation, memory transfer from the device back to the
CPU and memory deallocation.
– This tree is appended in the AST at the removal point, just after the com-
ment block.
After all the transformations done on the AST, an unparsing operation permits
to generate the code ready to be compiled on the target platform.
4 Prototype Tool
To test the technique a prototype tool has been built. The reasoner has been
implemented with SWI-Prolog [19] as a stand-alone module with a shell interface.
The rest of the work has been done by using Rose Compiler [18]. This is a
complete compiler infrastructure, tailored for source-to-source transformations.
It uses two front-end modules, one that can parse C/C++ and the other for
Fortran 2003 and earlier. The intermediate representation used by Rose is very
reach and preserve all the information from the source code (including source file
references, code comments, macros and templates for C++). This can be valuable
in the unparsing process to produce source code that can still be readable by
humans. The programming interface of Rose Compiler is C++, so our work was
done in this language.
Starting from the intermediate representation obtained by the front-end, the
AST should be traversed in order to find basic concepts. We built a class that im-
plements the Visitor Design Pattern [11] by extending the ROSE_VisitorPattern
class and overriding the visit() methods for each node type we need to pro-
cess. The AST is so traversed and the series of facts corresponding to the basic
concepts are produced in a text file. Similarly control-dependence and data-
dependence facts are produced by using the related Rose Library functions. Now
the reasoner is invoked with a series of goals each corresponding to a known al-
gorithm that is present in the repository. If a goal is satisfied the reasoner replies
with the name of the algorithm, the references to the code region that imple-
ments it and the data involved. Since multiple queries can be done to search for
different algorithms and the reasoning is a time consumption process it can be
done separately from the transforming and results saved in intermediate files.
194 P. Cantiello and B. Di Martino
The transformer, starting with that information, cuts the original code (or sim-
ply enclosed in comments, depending on the preferences of the user), builds new
code from the templates for the platform the user has chosen, and modify the AST
accordingly. But before doing the code removal, a test to prove legality of the trans-
formation is done. All the code that is enclosed in the AST sub-tree of the algorithm,
but is not mapped to basic concepts of the algorithm (eg. extra added lines), is
checked for data dependencies with the data involved in the algorithm.
To add new code and comments to the AST, Rose Compiler furnishes the so
called Rewrite mechanism. It uses three simple functions: insert(), replace()
and remove() that can be used at different levels of abstractions. Two low levels
which interact directly with the nodes of the tree and permit a fine grained control
on the generated nodes but they are extremely verbose, an intermediate level which
lets the user express the transformation with strings and an higher level which can
be used during the traversal operations. We have used the mid level since it gave
use the best compromise between complexity and power of use.
After all the transformations, a final call to backend() function can be used
to generate the source code from the AST in a new file.
5 Case Study
As a case study we used the source code for a sequential C implementation
of a matrix-matrix multiplication. This contains one of the algorithms we can
recognize at present.
In listing 1.3 we can see a fragment of the code that is given in input to the
tool.
double x [ 1 0 ] [ 1 0 ] ;
double y [ 1 0 ] [ 1 0 ] ;
double z [ 1 0 ] [ 1 0 ] ;
double temp = 0 ;
int i = 0 ;
int j = 0 ;
int k = 0 ;
In listing 1.4 is shown a small excerpt of the Prolog facts with basic concepts
and dependence information produced for the code.
Automatic Source Code Transformation for GPUs 195
In listing 1.5 we can see the response of the reasoner for a query of the goal
matrix_matrix_r.
% hierarchy of concepts : ref erences omitted
matrix matrix product (
simple scan ( . . . ) ,
matrix vector product (
simple scan ( . . . ) ,
dot product ( . . . ) ,
simple scan ( . . . ) ,
),
simple scan ( . . . )
).
After that recognition, in listing 1.6 is shown the added source code with the
calls to CUBLAS library, assuming the user has chosen that implementation.
We have omitted the commented code block.
// . . . . o m i t t e d commented code . . .
// −−−> Added by Transf ormer −−−
void∗ d p t r x ;
void∗ d p t r y ;
void∗ d p t r z ;
// Memory a l l o c a t i o n
cudaMalloc ( ( void ∗ ∗ )& d p t r x , 10∗10∗ s i z e o f ( double ) ) ;
cudaMalloc ( ( void ∗ ∗ )& d p t r y , 10∗10∗ s i z e o f ( double ) ) ;
cudaMalloc ( ( void ∗ ∗ )& d p t r z , 10∗10∗ s i z e o f ( double ) ) ;
c u b l a s C r e a t e (& h a n d l e ) ;
// Data t r a n s f e r CPU−>GPU
196 P. Cantiello and B. Di Martino
Listing 1.6. Code region added for Matrix multiplication with CUBLAS
6 Conclusion
In this work we have seen how to do source code analysis in order to recognize
basic algorithmic concepts, to reason on them and drive a source to source
transformation of code so that it can execute on new parallel architectures as
GPUs. A prototype tool has been presented to validate the technique and a test
on a case study has been shown.
The work must be intended as a starting point for future investigation. At
present the rules can recognize basic linear algebra algorithms as matrix and
vector multiplication, dot product, maximum and minimum search, reduction.
One direction on which we are now working is the extension of the set of rec-
ognized algorithms and their implementation variants (i.e. variants with use of
pointers and dynamic memory allocation). At the same time, since the reasoning
is a time-consuming process, the recognition process does not scale well with the
increasing in the number of recognized algorithms. We are studying techniques
to finding code clones that maybe can be adapted to extract basic concepts.
Another research path is to add performance investigation on the transformed
code; at present the transformation is done with no regards on the size of the
problem. We know that for small problems, the overhead added by memory
transfers can vanish the improvements obtained by the use of the parallel device.
Conversely, large problems may not fit the device memory. We are working on
adding test points on code so that they can be used to select, at runtime, different
implementation variants depending on the size of the data involved. In this
direction, an extension of the transformation to produce OpenCL code can be
used to tailor heterogeneous architectures as many-core sytems.
References
1. Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev,
A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for
gpgpus. In: Proceedings of the 22nd Annual International Conference on Super-
computing, ICS 2008, pp. 225–234. ACM, New York (2008)
Automatic Source Code Transformation for GPUs 197
2. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using
abstract syntax trees. In: IEEE International Conference on Maintenance (ICSM
1998), p. 368 (1998)
3. Benkner, S.: Vfc: The vienna fortran compiler. Scientific Programming 7, 67–81
(1999)
4. Di Martino, B.: Algorithmic concept recognition to support high performance code
reengineering. Special Issue on Hardware/Software Support for High Performance
Scientific and Engineering Computing of IEICE Transaction on Information and
Systems E87-D, 1743–1750 (2004)
5. Di Martino, B., Iannello, G.: Pap recognizer: A tool for automatic recognition of
parallelizable patterns. In: International Workshop on Program Comprehension, p.
164 (1996)
6. Di Martino, B., Kessler, C.W.: Two program comprehension tools for automatic
parallelization. IEEE Concurrency 8, 37–47 (2000)
7. Di Martino, B., Zima, H.P.: Support of automatic parallelization with concept
comprehension. Journal of Systems Architecture 45(6-7), 427–439 (1999)
8. Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detect-
ing duplicated code. In: IEEE International Conference on Software Maintenance
(ICSM 1999), p. 109 (1999)
9. Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its
use in optimization. ACM Transactions on Programming Languages and Systems
(TOPLAS) 9(3) (1987)
10. Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings
of the 30th International Conference on Software Engineering, ICSE 2008, pp.
321–330. ACM, New York (2008)
11. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: elements of
reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc.,
Boston (1995)
12. Hall, M., Padua, D., Pingali, K.: Compiler research: the next 50 years. Commu-
nunications of the ACM 52, 60–67 (2009)
13. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-
based detection of code clones. In: Proceedings of the 29th International Confer-
ence on Software Engineering, ICSE 2007, pp. 96–105. IEEE Computer Society,
Washington, DC (2007)
14. Komondoor, R., Horwitz, S.: Using Slicing to Identify Duplication in Source Code.
In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg
(2001)
15. Lee, S., Min, S.-J., Eigenmann, R.: Openmp to gpgpu: a compiler framework for
automatic translation and optimization. SIGPLAN Not. 44, 101–110 (2009)
16. Liao, C., Quinlan, D., Willcock, J., Panas, T.: Extending Automatic Parallelization
to Optimize High-Level Abstractions for Multicore. In: Müller, M.S., de Supinski,
B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 28–41. Springer,
Heidelberg (2009)
17. NVIDIA. Cuda: Compute unified device architecture,
http://www.nvidia.com/cuda/
18. Quinlan, D.: Rose compiler, http://www.rosecompiler.org/
19. Wielemaker, J.: Swi-prolog, http://www.swi-prolog.org/
Enhancing Brainware Productivity
through a Performance Tuning Workflow
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 198–207, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Enhancing Brainware Productivity through a Performance Tuning Workflow 199
For this reason many HPC-sites employ tuning specialists to provide the nec-
essary expertise. This can be most prominently seen in the Tier-0 and Tier-
1 HPC-sites, like the UK National Supercomputing Service (HECToR) with
the Distributed Computations Science and Engineering support1 and the HPC-
Simulation Labs in Germany2 . These projects provide this essential performance
tuning expertise to enable "users", i.e. code developers and program users, to
scale to ten-thousands or more cores. Also the smaller Tier-2 sites (like ours)
who focus more on productivity and throughput of compute-jobs especially ben-
efit from such support.
Whilst funding for such experts is typically difficult for Tier-2 sites, we argued
in previous work (Brainware for Green HPC [1]), summarized in section 2, that
savings in the Total Cost of Ownership (TCO) obtained by the tuning activity
could be used to fund these experts. This claim holds true, as long as these
experts achieve sufficient improvement for a project in a definite time.
Under this assumption, we propose a Tuning Workflow, described in detail in
section 3, to guide and monitor the tuning service effort of a performance tuning
expert. This workflow aims to maintain the necessary balance between tuning
investment and the obtained savings. To our knowledge such a process has not
been published, though tuning-experts may already follow this process.
We discuss our work and, as this workflow is our first implementation of this
workflow, further improvement possibilities in section 4.
verify the results. However, if one performs a change in algorithms or other exten-
sive code modifications, more time is necessary for these changes but the potential
outcome is larger. In our experience a tuning expert invests on average about 2
months worth of work to achieve the aforementioned conservative 10 % to 20 %
performance improvements.
Assuming a constant
100,00%
load, these funds are
90,00%
initially freed up by
80,00%
shutting down the re-
70,00%
spective surplus hard-
60,00%
ware starting with older,
CPU-time usage
50,00%
less power efficient sys-
40,00% tems thus saving en-
30,00% ergy. In the long run a
20,00% portion money can be
10,00% diverted from procure-
0,00% ments towards tuning as
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
the new machinery could
Number of users responsible
be considered 10 % more
Fig. 1. Accumulated Cluster Usage
productive. For a non-
constant load, i.e. the users allways use all available ressources, the systems
behaves as if one had obtained 10 % more hardware, which would cost more
in terms of running costs and procurement. So in either way the tuning would
conserve funds in the long run.
With this in mind and looking at the top users who consume 10 % of our
system (approx. 550k e), it is easy to see, that by just improving such a users
project by 10 % a full time employee at 60ke per year could be funded. Looking
at the total usage, a mere average improvement of 5% on all projects would be
sufficient to continuously support 3 HPC experts, if those experts tune one of
the top projects every 2 months. For further detail and information please refer
to "Brainware for Green HPC" [1].
3.2 Base-Line
The Base-line is intended to be quick and easy to obtain information. Great
emphasis should be put on disturbing the target application as little as possible
202 C. Iwainsky et al.
Initial Meeting
x Goals
Baseline
x Expectations xMetrics
x Code Transfer Runtime, CPI, LLC-
Misses (Bandwith),
x Brief Introduction
Mflops
xTools
Meeting: time, gprof, Oracle
Algorithmic Review Analyzer, Intel Amplifier,
Vampir
Schedule Meeting
Continue? yes
Meeting:
* Code Transfer
* Acceptance
Serial Analysis OpenMP Analysis
MPI Analysis Hybrid Analysis
Acumem Intel Thread Profiler
no Vampir Vampir ...
Oracle Analyzer Oracle Analyzer
Scalasca Oracle Analyzer
Intel Amplifier/PTU Intel Amplifier
Continue?
no Tune/Modify yes
Code
Baseline
xMetrics
Goal Runtime, CPI, LLC-
Stop Achieved? Misses (Bandwith),
End? Mflops
xTools
time, gprof, Oracle
Analyzer, Intel Amplifier,
Generate
Vampir
Modification and
Schedule Meeting Improvement Report
Code modification
recommondation
3.3 Pathology
In the Pathology-phase the expert uses the information form the initial base-
line to spot potential problems that the project-code may have. The expert
should only survey the program for general common issues, i.e. if the applica-
tion is bound by Algorithmic Issues,Communication Issues, Computation Issues,
Memory Access Issues and I/O Issues, without spending a lot of time in detail
analysis. Also each of these issues may require different tool-sets for observation
and measurement. Whilst it is in this phase tempting to identify the cause of
a specific problem, we recommend against it, as an early focus on a specific
issue may lead to wrong conclusions. It is, in our experience, much more impor-
tant to get an overall picture. In addition, the expert should gather information
about the hardware-capabilities of the target platform, in order to later correlate
any observed issues. For example, if an application is spending most of its time
5
Floating Point Operations per Second.
6
Integer Operations per Second.
7
Cycles Per Instruction.
204 C. Iwainsky et al.
3.5 Results/Handover
The tuning cycle ends in a review meeting to elaborate the modified code in
conjunction with the final Modification and Improvement Report in detail. This
provides the user with an opportunity to raise concerns regarding code-changes
or the results of the last tuning cycle. These concerns and the general acceptance
or reasons for dismissal should be attached to the report. Lastly, the invested
tuning effort should be reflected against the achieved improvement to decide if
further tuning is worthwhile or desirable. If further open issues from the pathol-
ogy exists, a new iteration of the tuning cycle is started with a new specific issue
(see second part of section 3.3) - otherwise the tuning projects concludes at this
point.
8
i.e. using only the wrapped MPI routines.
206 C. Iwainsky et al.
4 Conclusion
In this work we proposed a Tuning Workflow to improve the productivity of
tuning-experts. We argued, that increasing complexity and diversity of paral-
lelism in clusters requires specialized expertise to efficiently use current hard-
ware. As domain experts typically do not have this expertise, it must be pro-
vided, in particular at University installations, where a large diversity of scientific
applications typically is supported by a single HPC installation. Previous work
showed that such a service pays for itself, if top user’s projects are targeted in a
systematic fashion by performance tuning experts. We called this the "brainware"
component of an HPC operation. However, to maximize gains from brainware,
we need to develop standards and processes to govern the performance tuning
process to maximize its efficiency. If performance tuning is only left to "gurus"
there will just not be sufficient staff available for this task.
Our tuning process itself is based on the notion of a service and distinguishes
the roles of users and experts. In reality this role separation is not so clear as
both sides may interact in every phase to assist each other with details when
tuning and adapting the code. Nevertheless, we still see a formal tuning process
description as necessary to ensure the quality of a tuning service: clear goals,
deadlines, avoidance of exaggerated expectations, limitation of wasted effort and
an indication when to stop. A particularly important fact is the documentation
of the tuning effort in a Modification and Improvement Report. Such a workflow
together with the entailed documentation also can provide better argumentation
for funding to the management, as the costs and benefits of tuning become
evident. Furthermore, this workflow can be used to train additional experts and
even integrate non-scientific staff in the tuning process.
The workflow itself is based so far solely on the collective experience of the
HPC-group of the RWTH Aachen University. We recognize the fact, that this
workflow has yet to see a throughout study and that additional input from other
HPC-sites must still be incorporated. At the time of this work, the workflow was
only partially applied to one ongoing tuning effort, using only the Baseline and
Reporting with good acceptance by the users.
In its current form there are some additional conceivable steps. For example
the topic of version control, data management and verification remain unan-
swered. However, we consider the workflow in its current form to be already
quite complex, such that we plan to gain further use-case experience and feed-
back, before revising and adding additional steps.
Whilst we did not cover any performance tools in specific, we would like to
raise the question, to what extent tools could generate external documentation
of performance issues.
The workflow we described as a guide for tuning processes is of course not
meant to be the last word, but rather to serve as a rough guide and incentive
for implementation and improvement of such tuning processes. It is also clear
that depending on local characteristics these processes may need to be modified.
Nonetheless, we believe that defined processes, "cook-books" for specific tasks
and the requirement of modification and improvement reports are important
Enhancing Brainware Productivity through a Performance Tuning Workflow 207
ingredients that should be part of such a structured process. In the long run, we
hope that, similar in spirit to ITIL9 for general IT operations, also for HPC
code development and tuning a structured body of best-practice knowledge
will develop to structure the increasingly complex task of ensuring good HPC
performance.
References
1. Bischof, C., an Mey, D., Iwainsky, C.: Brainware for Green HPC. In: Ludwig, T.
(ed.) Proceedings EnA-HPC 2011 (2011) (to appear)
2. Behr, M., Arora, D., Benedict, N.A., O’Neill, J.J.: Intel compilers on linux clusters.
Intel Developer Services online publication (October 2002)
3. Zeng, P., Sarholz, S., Iwainsky, C., Binninger, B., Peters, N., Herrmann, M.: Sim-
ulation of Primary Breakup for Diesel Spray with Phase Transition. In: Ropo, M.,
Westerholm, J., Dongarra, J. (eds.) PVM/MPI. LNCS, vol. 5759, pp. 313–320.
Springer, Heidelberg (2009)
4. Altenfeld, R., Apel, M., an Mey, D., Böttger, B., Benke, S., Bischof, C.: Parallelising
Computational Microstructure Simulations for Metallic Materials with OpenMP.
In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP
2011. LNCS, vol. 6665, pp. 1–11. Springer, Heidelberg (2011)
5. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The
Scalasca performance toolset architecture. Concurrency and Computation: Practice
and Experience 22(6), 702–719 (2010)
6. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller,
M.S., Nagel, W.E.: The vampir performance analysis tool-set. In: Proceedings of
the 2nd HLRS Parallel Tools Workshop, Stuttgart, Germany (July 2008)
7. Shende, S.S., Malony, A.D.: The tau parallel performance system. The Interna-
tional Journal of High Performance Computing Applications 20, 287–331 (2006)
8. GNU: gprof, http://sourceware.org/binutils/docs/gprof/
9. Intel: Intelparallel
c amplifier (2011)
http://software.intel.com/en-us/articles/intel-parallel-amplifier/
10. London, K., Moore, S., Mucci, P., Seymour, K., Luczak, R.: The papi cross-
platform interface to hardware performance counters. In: Department of Defense
Users Group Conference Proceedings, pp. 18–21 (2001)
11. Iwainsky, C., an Mey, D.: Comparing the Usability of Performance Analysis Tools.
In: Cèsar, E., Alexander, M., Streit, A., Träff, J., Cèrin, C., Knüpfer, A., Kran-
zlmüller, D., Jha, S. (eds.) Euro-Par 2008 Workshops. LNCS, vol. 5415, pp. 315–
325. Springer, Heidelberg (2009)
9
www.itil.org
Workshop on Resiliency in High Performance
Computing (Resilience) in Clusters, Clouds,
and Grids
Clusters, Clouds, and Grids are three different computational paradigms with
the intent or potential to support High Performance Computing (HPC). Cur-
rently, they consist of hardware, management, and usage models particular to
different computational regimes, e.g., high performance systems designed to sup-
port tightly coupled scientific simulation codes and commercial cloud systems
designed to support software as a service (SAS). However, in order to support
HPC, all must at least utilize large numbers of resources and hence effective HPC
in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future increases in
performance, in excess of those resulting from improvements in single- processor
performance, will be achieved through corresponding increases in system scale, i.e.,
using a significantly larger component count. As the raw computational perfor-
mance of these HPC systems increases from today’s tera- and peta-scale to next-
generation multi peta-scale capability and beyond, their number of computational,
networking, and storage components will grow from the ten-to-one-hundred thou-
sand compute nodes of today’s systems to several hundreds of thousands of com-
pute nodes and more in the foreseeable future. This substantial growth in system
scale, and the resulting component count, poses a challenge for HPC system and
application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-
recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added
another major source of concern. The probability of such errors not only grows
with system size, but also with increasing architectural vulnerability caused by
employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer
technology. Reactive fault tolerance technologies, such as checkpoint/restart, are
unable to handle high failure rates due to associated overheads, while proactive
resiliency technologies, such as migration, simply fail as random soft errors can’t
be predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
The goal of this workshop is to bring together experts in the area of fault
tolerance and resiliency for HPC to present the latest achievements and to discuss
the challenges ahead.
The Malthusian Catastrophe Is Upon Us!
Are the Largest HPC Machines Ever Up?
1 Introduction
Conventional wisdom dictates that as supercomputers become larger, they also
become more complex with an increased number of parts. Each individual part
might have a long Mean Time Between Failure, but when many parts are com-
bined together, the chance for any one part to fail is great. The failure of specific
parts might cause an entire machine to fail, and even more likely it will cause
one or more running applications to fail. This seemingly makes it impossible
for large-scale applications to run to completion without interruption. This sit-
uation reminded us of the Malthusian Catastrophe: the idea that populations
grow geometrically, but the food supply only grows linearly, with population
size being limited by starvation. We decided to explore the parallels between
supercomputing and food supply.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 211–220, 2012.
c Springer-Verlag Berlin Heidelberg 2012
212 P. Kovatch, M. Ezell, and R. Braby
There are four main reasons why population growth hasn’t been limited by
food supply:
1. economies of scale
2. national and world-wide markets mitigate shortages/problems
3. more efficient technologies, and
4. best practices.
Over time, agricultural resources became more concentrated, with companies
making larger infrastructure investments. In this way, more food could be pro-
duced more efficiently, taking advantage of economies of scale. This has also
happened with supercomputing, with the fewer larger machines becoming more
efficient at delivering cycles. This has been done with more shared physical in-
frastructure, for instance, at Oak Ridge National Laboratory (ORNL), where
three supercomputers (one for the Department of Energy, one for the National
Science Foundation, and one for the National Oceanic and Atmospheric Ad-
ministration) share the same computer room. These centers are more efficient
are delivering power and cooling, and concentrate and take advantage of the
intellectual expertise.
To cope with local shortages, agriculture developed wider markets and better
distribution networks. A drought in one area did not cause local people to starve
because they could get food from another location. This is also true within su-
percomputing, where users have access to different machines–for instance, the
TeraGrid [1] offers multiple supercomputers, and users have allocations at dif-
ferent sites, to help mitigate the situation when one site is down.
In agriculture, more efficient technologies were developed that took advantage
of new equipment and techniques. For instance, farmers started using tractors
instead of horses and plows. In supercomputing, vendors have used one large in-
dustrial strength fan per cabinet, instead of many PC-quality fans. This shared
physical infrastructure for the cabinet reduces the overall total number of com-
ponents, and makes the individual nodes and overall machine more reliable.
And lastly, farmers developed better practices, like canning, and storing food
in case of hard times. Supercomputing has taken similar action: implementing
application checkpointing in case of failure.
The University of Tennessee’s National Institute for Computational Sciences
(NICS) has employed methods to keep their machine available as much as pos-
sible. They have redundant power in the facilities, put the machines through a
rigorous acceptance test to ferret out as many bad parts as possible, perform
maintenance regularly to fix all the down nodes, and run regression tests to verify
the system after planned and unplanned outages. Cray, the vendor for their XT4
and XT5 machines, has incorporated power redundancy, shared cooling with
multiple liquid cooling cabinets, error correcting memory, reduced components
per cabinet (fan, for instance), and support for Berkeley checkpoint/restart. Cray
is also working on an MPI implementation to survive link failures. To see if these
strategies are working well for production supercomputing, this paper examines
if the largest machines “stay up” for reasonable amounts of time–long enough
for full machine jobs to be run routinely.
The Malthusian Catastrophe Is Upon Us! 213
This paper explains some of the difficulties in defining and collecting resiliency
statistics, and presents initial findings. Failure rates for individual components
are examined over the lifetime of a machine, and conclusions are made based
on the data. Failure rate and mean time between failure data from several large
systems with similar architectures are examined and patterns and trends are
extracted from the data. Is the HPC industry on the cusp of the intersection of
system complexity and job length (that is, the inability to get through a single
day without an interrupt)?
Different people have different definitions of the qualities that make a system
resilient. Do you examine resiliency at the full system level, cabinet level, node
level, or processor level? Do you have to examine all of the above? Modern high
performance computing systems are often complex and hierarchal in nature,
meaning that failure of a single component may or may not affect the availability
of additional components.
Although it would be
nice to have HPC sys-
tems that are 100% reli-
able, “unbreakable” sys-
tems are not practical.
System design is often
based on a number of
factors, balanced accord-
ing to the the “project
triangle” shown in
Figure 1. The wisdom
of the project triangle
states that you can build
Fig. 1. The Project Triangle [2]
something fast (rapid en-
gineering design), you can build something good (high quality), and you can
build something cheap (low cost), but you only get to pick two. In HPC, a
goal is often to place well on the Top500 [4] list. Every dollar spent on high-
reliability parts takes away from the system’s peak performance. Designing a
high-performance reliable system becomes a complex balancing act.
In [3], Stearley provides definitions and equations for reliability, availabil-
ity, serviceability, and utilization based largely on the semiconductor indus-
try’s SEMI-E10 specification. Of particular interest, mean time between failure
(MTBF) for an entire system or node is defined as:
production time
M T BFSystem = (1)
number of system failures
production time
M T BFN ode = (2)
number of node failures
214 P. Kovatch, M. Ezell, and R. Braby
where
1
λ= (4)
M T BF
is a constant failure rate, but Stearley and others [6] suggest a time-varying
failure rate may be more appropriate.
into a commercial offering called the Cray XT3. Each Cray XT cabinet contains
24 blades, and the machine contains a mixture of compute and service blades.
Each compute blade holds 4 compute nodes, while each SIO blade has two ser-
vice nodes. SeaStar network chips (one for each node) live on a mezzanine card
and provide access to the three dimensional torus network.
10
10
10
11
1
-1
-1
r-1
-1
l-1
-1
-1
-1
n-
n-
g-
p-
ar
ay
ct
ov
ec
Ju
Ap
Ja
Fe
Ju
Au
Se
M
N
D
Fig. 2. Kraken Utilization
central file, and events that require manual intervention from the Cray hardware
engineers are automatically entered into Cray’s case tracking system. Additional
records are added as necessary for any maintenance performed on the machine.
Cray uses this tool to record every hardware failure at every site with Cray
machines. They keep an extensive database of each failure, and are thus able to
calculate a variety of metrics.
"
!
"
"
"
"
"
"
"
"!
""
"
"
"
!
"
Fig. 3. Kraken XT5 Node Failure Causes by Month
are a more complicated part, the authors expected to see more failures due to
memory than attributed to CPUs, as was observed in most systems in [10]. One
possible explanation is more recent ECC memory technologies (Chipkill, SDDC)
and improved memory controllers may have significantly reduced the memory
failure rates. Upon analyzing the Opteron failure data, it was discovered that
most of the errors were attributed to ECC errors in the on-processor cache hier-
archy. Investigating this with Cray and AMD, it was discovered that a recently
released bios update should significantly reduce the number of these failures.
Similar to the improvements seen in DRAM over recent years, it is expected
that processor reliability will also improve. Improved error correction algorithms
are being developed and will be built into the processors at various levels, in-
cluding the cache hierarchies. We expect that the processor failure rates will see
drops similar to those observed with the memory. This should more than make
up for the potential increase in failures as processor cache sizes grow.
Fig. 4. Kraken Unscheduled Reboots per Month
computed by simply averaging all the data points provided. Assuming 30 days per
month, the mean time between failure can be computed by Equation (5).
30 × months of data
M T BFdays = (5)
all failures
Figure 5 shows the relationship between system size and failure rate. As one
might expect, increasing the size and complexity of a machine linearly increases
the likelihood of failure. The data also suggests that failure rate is more highly
correlated to number of components than peak performance rating; as individual
components get more powerful, their failure rate does not significantly increase.
The Malthusian Catastrophe Is Upon Us! 219
"
!
!
!
!
6 Future Work
There are several areas that warrant further investigation. The authors have ob-
tained the component-level “Failures per Month” data for Jaguar, and it appears
to match the rough shape of the Kraken data. Comparing it more closely to the
Kraken data would be interesting. A study comparing the individual compo-
nent (CPU, memory) MTBF numbers with what we’ve seen in the field at scale
would be interesting, but it requires obtaining data from vendors. Unfortunately,
vendors do not like to share this. Future work will hopefully include both node
and system failure data between Cray and other machine types, including IBM
BlueGenes, IBM Power series and conventional clusters. This would show if a
particular design or machine is more reliable, at scale, than others. Future work
might examine in more depth why some machines of the same kind, that are
similar in size exhibit almost a 30% difference in failure rates (see Figure 5 for
Jaguar XT4 and Franklin XT4 failure rates).
7 Conclusions
In this paper, the authors have shared node and system-wide failure data from
the largest systems in the world and shown that although the number of com-
ponents in the systems is quite large, applications can regularly run full ma-
chine jobs. Vendors and centers have developed techniques, technologies and
approaches to help mitigate the geometrically growing number of parts in the
largest machines.
220 P. Kovatch, M. Ezell, and R. Braby
Acknowledgements. The authors would like to thank XTreme, the Cray sys-
tem administrators user group who facilitated the sharing of site specific machine
uptime data, and Jim Craw, its President during the time of data collection.
Thanks go out to the representatives from each site along with their manage-
ment: Tina Butler and Nick Cardo (NERSC), Joni Viranen (CSC), Hank Kuehn,
Don Maxwell and Buddy Bland (ORNL), Steve Andrews (HECToR/STFC
Daresbury), Lloyd Slonaker (AFRL/RCMT), and Alexander Oltu (Uni). The
authors would also like to thank Steve Johnson and Pete Ungaro from Cray for
their data, support and assistance. Several folks at ORNL helped analyze and ex-
plain the data, including Stephen L. Scott and Robert Harrison, and Rick Mohr
and Troy Baer from The University of Tennessee. Lastly, the authors would like
to thank Phil Andrews for recognizing and suggesting the comparison with the
ideas of Thomas Malthus.
References
1. TeraGrid, http://www.teragrid.org/
2. Piazzalunga, D.: Project Triangle. Figure in public domain, downloaded from,
http://en.wikipedia.org/wiki/File:Project_Triangle.svg
3. Stearley, J.: Defining and Measuring Supercomputer Reliability, Availability, and
Serviceability (RAS). In: 6th LCI Conference on Linux Clusters (April 2005)
4. Top500 Supercomputer Sites, http://top500.org/
5. The Computer Failure Data Repository, http://cfdr.usenix.org/
6. Gottumukkala, N., Nassar, R., Paun, M., Leangsuksun, C., Scott, S.: Reliability
of a System of k Nodes for High Performance Computing Applications. IEEE
Transactions on Reliability 59(1), 162–169 (2010)
7. Johnson, S.: Cray Inc. Personal Communication
8. Andrews, P., Kovatch, P., Hazlewood, V., Baer, T.: Scheduling a 100,000 core
Supercomputer for Maximum Utilization and Capability. In: 39th International
Conference on Parallel Processing Workshops (2010)
9. Becklehimer, J., Willis, C., Lothian, J., Maxwell, D., Vasil, D.: Real Time Health
Monitoring of the Cray XT3/XT4 Using the Simple Event Correlator (SEC). Cray
Users Group (2007)
10. Schroeder, B., Gibson, G.: A Large-Scale Study of Failures in High-Performance
Computing Systems
Simulating Application Resilience at Exascale
1 Introduction
Parallel scientific applications frequently use coordinated checkpoint and restart
(CCR) to recover from system failures. Failures can be anything from loss of
power, human error, hardware component faults, to software bugs. For an ap-
plication using CCR, all of these failures force it to abort and, at a later time,
to restart from a previous checkpoint. Several studies have shown that this will
not scale much beyond the machines currently in existence [4,8,3,11].
For exascale systems, even if per-component reliability remains the same, the
sheer number of components will lead to frequent faults. Therefore, alternative
methods are needed to enable computational progress of large-scale applications.
Many alternative resilience algorithms have been proposed to replace CCR,
but few have been evaluated thoroughly at large scale, with differently behaving
applications, strong scrutiny of their cost – especially for recovery – and the
impact on application throughput. Recovery is often assumed to be infrequent
and neglected in performance studies. In exascale systems we expect failures
Sandia National Laboratories is a multi-program laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 221–230, 2012.
c Springer-Verlag Berlin Heidelberg 2012
222 R. Riesen et al.
to be common and that cascading failures during recovery might change the
performance characteristics of resilience algorithms substantially.
Another aspect that is sometimes overlooked is that a given resilience algo-
rithm may not be suitable for all types of applications. For example, CCR works
well for self-synchronizing applications, since they already bear the synchroniza-
tion cost necessary to achieve coordination. Other applications do better without
introducing additional synchronization steps.
While exascale systems will not be radically different from today’s supercom-
puters, there are features such as massive multicore CPUs, Solid State Disks
(SSD), and non-volatile random access memory (NVRAM) that have impact on
the performance of resilience algorithms. Application characteristics may also
become different when they adapt to the larger scale and new programming
models. Yet, self-synchronizing legacy applications need to be supported as well.
To evaluate proposed and existing resilience algorithms at scale, simulation
and modeling is needed. In this paper we analyze the requirements for an evalu-
ation framework that lets us measure the performance and overhead of various
resilience algorithms with different application characteristics.
This perspective paper is meant to explore future exascale systems in terms of
modeling and other relevant aspects that have to be considered when studying
application recovery after failures. A goal is to generate a discussion that will
help define the taxonomy of future exascale systems and the tools that will
enable us to study them even before they become available.
We list the requirements we have identified in Section 2, describe our design
in Section 3 and 4, and report on the status of our implementation in Section 5.
2 Requirements
The compromises and restrictions we will have to put into our simulation will
prevent us from being able to make absolute and precise performance predictions.
However, the goal is to make relative performance comparisons among resilience
algorithms under various conditions. For that we need a somewhat accurate
model of data movement within the system, but not the data itself nor the
computations necessary to generate that data.
Before we can design an experiment, we need to get an idea of what a future
exascale system might look like [2,3]. Since we cannot simulate a complete system
at scale in full fidelity, we then need to identify the aspects of a system that have
a measurable impact on the performance of resilience algorithms.
We have about five more iterations of Moore’s law ahead of us and can expect
to see about 512 to 1,024 cores per socket in such a system. If the current trend
continues, each core will have relatively weak performance to help with power
consumption and enable the placement of that many cores onto a single die.
The current number one system on the top500 list employs 548,352 cores to
achieve 8 petaflops. The number of cores per CPU, as well as their total number,
will increase to reach an exaflop. These cores will be connected with each other
through a Network on Chip (NoC). Most likely there will be a complex hierarchy
of caches where some cores in the same “neighborhood” share lower-level caches
such as an L2, and groups of cores share L3 caches, and the memories shared by
these cores may not be shared coherently.
The memory hierarchy will be further complicated by some or all of main
memory becoming non-volatile (NVRAM). SSD with faster access times than
spinning media will also be prevalent. Some of that storage will be local, in the
same rack for example, while more of it will be farther away in a dedicated storage
server. Compute accelerators, such as Graphical Processing Units (GPUs), on
the same motherboard or integrated into CPUs will most likely also play a role
in achieving exascale performance by providing additional compute cycles and
processing stream-oriented application kernels.
With the above assumptions, it is not possible to simulate such a system with
high fidelity. There are simply too many components and not enough technolog-
ical certainty for a fully detailed simulation in a reasonable amount of time. In
order to make evaluation of different resiliency algorithms possible, we have to
make some compromises. We can leave out some less important aspects and still
arrive at results that are valid when comparing two different resiliency algorithms
for a given type of application.
The first thing we will abandon is an application’s computation. Obviously,
this will save a lot of simulation time by allowing us to dispense with a detailed
processor model or emulation framework. Furthermore, resilience algorithms are
dependent only on two aspects of the computation itself: How much data it
touches and changes over time, and the duration of compute phases between
data exchanges with other cores and nodes; i.e., externally visible state changes.
While we can dispense with computation, we cannot be quite so cavalier with
communication. Cores on a single die will communicate with each other over the
NoC and, nodes will communicate with each other over a system-wide network.
The exact form of communication is less important. Some of that data will be
transferred using MPI, while other data will be written directly into memory.
Because these are externally visible state changing events, resilience algorithms
depend on the timing of these transfers and the amount of data being moved.
Performance, frequency, and location of saving and restoring state depends on
data traffic. However, the actual content of these messages does not matter.
Because moving data is a large overhead and influences resilience algorithm
performance, a fairly accurate simulation of data flowing through a system is
224 R. Riesen et al.
necessary. The simulation needs enough resolution to detect congestion and mea-
sure its impact. The same applies to I/O. State needs to be saved into remote
memory, NVRAM, and SSD devices. While access times to individual memory
banks are too fine grained to track in a simulation of this scale, access competi-
tion to these devices and transfer times do need to be tracked.
Finally, but not least, the simulation needs to provide a method to inject
faults into the system. A form of notifying a resilience algorithm that a node,
socket, core, or link has failed, with the corresponding data loss, is necessary.
But the exact type of failure notification is not that important.
FIFO order, to preserve message ordering. Messages arrive in the form of events.
The events themselves could contain message data, but since we are not gener-
ating that data, the events only contain the number of bytes the message would
contain. That message length, a configurable router latency, and bandwidth are
used to calculate how long an output port will be occupied. For that duration,
further messages destined for that output port, are queued.
Omitting modeling of flow control be-
tween routers reduces synchronization
overhead. However, since incoming mes-
sages are queued when an output port
is busy, a message traveling through two
or more routers cannot move faster than
the bandwidth and latency limitations –
as well as other traffic in the network –
allow. Messages can be delayed at the
input or output port. A quick stream of
short messages on an input port can be
held up by a larger message using the
same output port. This mechanism gives
the router model a crude approximation
of flow control. If a message is delayed
due to a busy port, congestion statistics Fig. 1. The router model
in the delayed event are updated.
The router model accepts, via the SST configuration file, several parameters:
A hop delay specifies the minimum amount of time a message (event) is delayed
when passing through a router1 . The bandwidth parameter, together with the
incoming message length, dictates how long a message occupies a port. The
number of ports is also a configurable parameter. Two more parameters are used
for power and thermal modeling. One is the hypothetical frequency this router
runs at, and another dictates which power model to use; SST supports several [5].
The router model can use the amount of traffic and the above parameters to
compute power dissipation.
Note that links configured between components, for example between two
routers in the SST configuration file, also have a delay assigned to them. SST
uses that when partitioning the graph of components between processes in a
parallel simulation and to compute event lookahead.
We use the same router model component described so far to also create the
NoC within each socket of our simulation. To keep things simple, we assume the
NoC is also a torus. However, instead of using five-port routers, we allow for addi-
tional ports to connect more than one CPU core to each router in a NoC. A bit in
events traveling between cores attached to the same router indicates local traffic
that moves at higher bandwidth than off-CPU traffic. We assume that these cores
will be communicating through a shared cache instead of making use of the NoC.
1
The actual delay may be much larger if there is congestion on the input or output
ports.
226 R. Riesen et al.
Additionally, when used for a NoC, the router model does not use wormhole
routing. This is a more realistic mode of operation when the NoC also connects
to random access memory, where multiple streams of data can be overlapping
and be destined for the same device.
Figure 2 is a diagram that shows one possible configuration of a node in our
exascale simulation. The router model is used to build the NoC as well as the
main network that connects the nodes in a system. Each node consists of multiple
SST components described in this section.
The router model is
also used as an aggrega-
tor to coordinate data traf-
fic to a single resource.
In that configuration, one
port connects to the re-
source, and all remaining
ports connect to users of
the resource. We use ag-
gregators to control access
to the main network from
each node. Each core can
access the main network,
but has to compete with Fig. 2. Combining components for a node
any other core on that
node for that resource. This is akin to multiple cores and CPUs on a node shar-
ing a single NIC. We also use aggregators to gate access to on-board NVRAM,
which is a shared resource for the cores on that board. Each core also has ac-
cess to a storage network to access a nearby SSD. We assume that access to a
parallel file system will happen through the main network, as it does on most
of today’s machines. But, we also envision that each rack has some SSD devices
for scratch storage and that nodes in the same rack have access to that storage
via a separate, but local, storage network.
For resilience methods that access remote memory, data has to be transferred
using the main network first. Then, one of the cores on a node with direct access
to the local NVRAM has to handle these remote requests.
The second type of storage we provide in our simulation is a “nearby” SSD.
We assume that each rack has some amount of SSD storage that can be used for
temporary data. A rack-wide, local network provides access to that storage. We
use aggregators to build a two-level tree storage network for each rack.
We assume that rack SSD storage has more capacity and is more reliable
than the individual local NVRAMs. A rack will have multiple, redundant power
supplies, and the SSDs will be RAID devices. The NVRAM on a node is quicker
to access and has less contention. But, it also has a smaller capacity and may
become inaccessible if a node, or the network connection to a node, fails.
At this time we have no plans to simulate a remote parallel file server. We
assume that data from the rack SSD can be trickled off to such a server in the
background, if desired. Other research teams are working on full disk simulation
components for SST, including an SSD device, that we will be able to integrate
at a later time, if necessary.
Source Node
# messages
40
36 400
patterns that originate from various par- 32
28 300
allel graph algorithms; e.g., [9]. 24
20 200
16
12
8 100
4
0 0
3.5 Implementing Pattern 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Destination Node
Generators
SST is an event driven parallel simula- Fig. 3. Communication patterns for
tor. Each component that is integrated NAS MG, class C, 64 nodes [10]
into the SST framework needs to pro-
cess events it receives and then relin-
quish control back to SST so that the overall simulation can proceed. A natural
way of expressing and implementing the communication patterns we need to
drive our simulations, are state machines. Event processing is an integral part
of state machines. Therefore, that choice is simple.
What makes this choice a little bit difficult is the state explosion when we
combine a communication pattern generator, a resilience algorithm, collective
operations, and the handling of asynchronous I/O events and faults. What
starts out with a handful of states to express a nearest neighbor data ex-
change becomes much more complicated when some events from a collective
operation, that has already started on another rank, arrive early. Even more
states are needed to process the requirements of the resilience algorithm under
evaluation. The algorithm will generate I/O and completion events will arrive
asynchronously. A state machine describing all these possibilities will grow very
complex quickly.
In addition, we want to use the same communication pattern with different
resilience algorithms and, perhaps, different implementations of collective oper-
ations. The solution we have chosen is that of a gate keeper. It is a C++ SST
component from which all communication patterns inherit. Each instantiated
communication pattern component specifies what other services it needs; e.g.
which collective operations it will perform. The resilience algorithm is chosen
through the configuration file. All of these individual, relatively simple state
machines, register an event handler with the gate keeper.
In some respect, the individual state machines are all subroutines of the com-
munication patter generator. At any given time, only one of those state machines
is active. When a new event arrives at the gate keeper, it determines which state
machine needs to receive that event. If that state machine is currently running,
then the event is delivered right away. For currently inactive state machines,
(early) events are queued. The gate keeper component provides functions for
state machines to call each other and to return to a previous caller. Whenever a
state machine change occurs, pending events for the newly active state machine
are delivered by the gate keeper.
Simulating Application Resilience at Exascale 229
In the XML configuration file for SST, every component to be used, every link
between any two components, and the parameters for each component need to
be specified. The file structure allows for a common shared parameter, among a
set of components, to be specified only once. Nevertheless, these files, for a large
simulation, are too big to be created manually. Therefore, we wrote a separate
program to create configuration files specific to the subset of SST components
used for our experiments.
The configuration generator takes command line arguments, for main network
bandwidth for example, and inserts it into the appropriate places in the XML
file. The choice of which communication pattern to use is also a command line
option, while several other things are hard coded into the generator.
For example, the generator takes as command line parameters the X and Y
dimension of the main network and a separate pair of parameters for the NoC on
each node. But, it is currently hard coded to generate tori. This makes several
things a lot simpler, including source route generation, and should suffice for our
initial experiments.
The simulation design described in this paper has several limitations. Some
stem from the core design. These include the inability to create communication
patterns that are data dependent, computation delays between communications
that vary with the results of computation, and the inaccuracies introduced by
using models instead of more fine-grained simulation.
Other limitations are caused by our implementation choices and the configu-
ration generator. These include things like the fixed topologies and the architec-
ture of the I/O and memory subsystem. These are more easily corrected than
our design choices by adapting our code to the new requirements.
5 Work in Progress
Work in progress includes the study of simple resilience algorithms for exascale
systems, beyond the 256k cores we have already tested, with the support of the
framework proposed in this paper. Future work includes integrating a larger set
of resilient algorithms and a broader range of applications with their communi-
cation patterns in the framework.
For resilience algorithms and methods we plan to look at uncoordinated check-
point restart with message logging, log-based rollback-recovery mechanisms [6],
the RAID-like approach taken by SCR [7], and communication induced check-
pointing [1].
Validation of a complex simulation tool like ours is of course made extremely
difficult by the lack of existing exascale systems. Nevertheless, we plan to use
micro-benchmarks to calibrate various parameters and models built into our sim-
ulation by comparing them against existing systems. Then we will run bench-
marks and applications that our communication patterns are meant to mimic
on existing, large-scale systems, and compare the results with our simulations.
230 R. Riesen et al.
We will be able to do this using individual multicore CPUs, large clusters, and
clusters containing multicore CPUs. Viewing and comparing the actual results
with our simulations from these different angels will provide us with an indication
of the validity of our approach. Scaling experiments within the range of systems
available to us will further assist with validation and provide us with error bars
for simulations at exascale.
We are building the simulation infrastructure to evaluate resilience algorithms.
However, the same infrastructure will be suitable for evaluation of many different
aspects of exascale computing. We have started to investigate projects in the
area of programming models and application performance on a heterogeneous
network where not all components are (virtually) fully connected.
SST, including the components described in this paper, is open source and
freely available.
References
1. Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of com-
munication induced checkpointing. In: FTCS (1999)
2. Bergman, K., et al.: Exascale computing study: Technology challenges in achieving
exascale systems (2008)
3. Bianchini, R., et al.: System resiliency at extreme scale (2009)
4. Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the fu-
ture of practical rollback-recovery. IEEE Transactions on Dependable and Secure
Computing 1(2) (2004)
5. Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework
for architecture-level power, area and thermal simulation and its application to
network-on-chip design exploration. In: 1st Intl. Workshop on Performance Mod-
eling, Benchmarking and Simulation of High Performance Computing Systems,
PMBS 2010 (November 2010)
6. Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-
recovery for cluster systems. Concurrency and Computation: Practice and Experi-
ence (April 2009)
7. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and
evaluation of a scalable multi-level checkpointing system. In: SC (2010)
8. Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R.,
Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In:
24th IEEE Conference on Mass Storage Systems and Technologies (September
2007)
9. Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries.
In: Cluster Computing (2010)
10. Riesen, R.: Communication patterns. In: Workshop on Communication Architec-
ture for Clusters CAC 2006 (April 2006)
11. Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump:
The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-
Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)
12. Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen,
R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural
simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computing Systems, PMBS 2010 (November
2010)
Framework for Enabling System Understanding
1 Introduction
Resilience has become one of the top concerns as we move to ever larger high
performance computing (HPC) platforms. While traditional checkpoint/restart
mechanisms have served the application community well it is accepted that they
are high overhead solutions that won’t scale much further. Research into alter-
native approaches to resilience include redundant computing, saving checkpoint
state in memory across platform resources, fault tolerant programming models,
calculation of optimal checkpoint frequencies based on measured failure rates,
and failure prediction combined with process migration strategies. In order to
predict the probability of success of these methods on current and future sys-
tems we need to understand the increasingly complex system interactions and
how they relate to failures.
These authors were supported by the United States Department of Energy, Office of
Defense Programs. Sandia is a multiprogram laboratory operated by Sandia Corpo-
ration, a Lockheed-Martin Company, for the United States Department of Energy
under contract DE-AC04-94-AL8500.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 231–240, 2012.
c Springer-Verlag Berlin Heidelberg 2012
232 J. Brandt et al.
2 Approaches
markers for the entry and exit of a particular part of a code and then use that
information to determine the times for an analysis.
While these samplers currently run a database client on the nodes, we have
recently implemented a distributed metric service which propagates information
to the database nodes itself for insertion.
Metric Generator. OVIS provides a utility called the Metric Generator that
enables a user to dynamically create new data tables and insert data. The in-
terface allows the user to specify optional input data, a user-defined script to
generate the additional data, and a new output data table. The script can oper-
ate on both metrics in the database and on external sources. Examples include
a) scripts that grep for error messages in log files and insert that occurrence into
the database, thus converting text to numeric representations and b) more so-
phisticated analyses, such as gradients of CPU temperature. This enables rapid
development of prototype analyses. These scripts can be run at run-time or after-
the fact still with the ability to inject data timestamped to be concurrent with
earlier data in the system.
Resource Manager. Resource Manager (RM) data includes information about
job start and end times, job success/failure, user names, job names etc. OVIS
does not collect RM data, but rather intends to natively interface to a variety of
RMs’ native representations. Currently, OVIS natively interfaces to databases
produced by SLURM [15] (and associated tools) and enables search capabilities
upon that data as described in Section 2.4.
parallel analysis engines we adopted the visualization tool kit (VTK) statistical
analysis interface. Current analysis engines available in OVIS are: descriptive
statistics, multi-variate correlation, contingency statistics, principal component
analysis, k-means, and wear rate analysis. These analyses allow the user to un-
derstand statistical properties and relative behaviors of collected and derived
numeric information which can in turn provide insight into normal vs. abnormal
interactions between applications and hardware/OS. These analysis engines can
be accessed via a python script interface, or a high performance C++ interface
either directly or through OVIS’s GUI. Additionally the user can write scripts,
described in Section 2.1, that manipulate numeric information from any source.
3 Applications
This section describes use of OVIS integrated, interactive capabilities to enable
system understanding.
Framework for Enabling System Understanding 235
Fig. 1. OVIS-Baler integration user interface. The OVIS-Baler integration provides in-
teracting capabilities for log pattern analysis and visualization, numerical data analysis
and visualization, and job log search.
Fig. 2. OVIS Resource Manager (RM) view. Job information is searchable and is
shown. Selecting a job automatically populates an analysis pane and the 3D view
with job-relevant data (Figure 4).
views are supported as in this figure where the highlighted (completed) job in
the job-search display is dropped upon the physical display, limiting the col-
ored nodes to only those participating in the job. It is seen that one of the nodes
(Glory 234, colored red and circled) has significantly higher Active Memory than
any of the other nodes participating in the job. Scrolling through time indicates
that the node has high Active Memory, even during idle times on the node, and
during the subsequent failed job.
Note that any one data source is insufficient to understand the entire situation.
The Resource Manager data shows that there may be a problem on Glory 234,
but it does not elucidate the cause of the problem. The log data shows that
a possible cause of job failure is an out of memory condition on Glory 234,
but it does not indicate the onset of the problem, nor if this is this due to a
naturally occurring large demand for memory. The physical visualization with
outlier indication shows the onset and duration of the abnormal behavior on
the node, but does not directly tie it to a failure condition. The combination of
all three pieces, each providing a different perspective and each working upon
a different data source is necessary for overall system understanding. (this is in
contrast to the condition detection itself, which can be done purely by statistical
means).
Fig. 3. Baler Pattern view (top) provides search capabilities and drill down view-
ing of wild card values. Out of Memory related meta-patterns, determined by Baler
are shown. Baler Event view (bottom) shows color coded events in time and space.
Mouseover highlights and displays message patterns. Some messages for the out of
memory condition on a node are shown.
the syslog files. In the system file representation a separate file is written for each
row and channel, which can be mapped to slots for DIMMS and/or physical CPU
sockets. Each of these files (e.g., .../edac/mc/mcX/csrowY/ue count) contain-
ers counters of errors that have occurred. This same information can be extracted
via the command line calls. In the syslog output, such errors are reported as: Feb
20 12:41:22 glory259 EDAC MC1: CE page 0x4340a1, offset 0x270, grain 8,
syndrome 0x44, row 1, channel 0, label DIMMB 2A: amd64 edac. Thus present-
ing the row, channel, DIMM, and error categorization, but in a different format.
In OVIS the same innate information is harvested from the two different
sources, and it is processed in complimentary and different fashion. Baler reduces
238 J. Brandt et al.
Fig. 4. OVIS physical cluster display. An outlier in Active Memory is seen (red, circled)
across nodes in this job (colored nodes). Job selection in the RM view (Figure 2)
automatically populates a) the analysis pane with relevant nodes and time and b) the
3D view with nodes.
4 Related Work
There has been much work done and various tools built within the HPC com-
munity with respect to information collection, analysis, and visualization some
of which we describe here. In each case, however, only a portion of the wealth of
information available has been harvested and hence the understanding that can
be realized is limited. By contrast we seek through the OVIS project to create an
integration framework for available information and tools to, through knowledge
extraction and system interaction, build more resilient HPC platforms.
Both Ganglia [9] and VMware’s Hyperic [10] provide a scalable solution to
capturing and visualizing numeric data on a per-host basis using processes run-
ning on each host and retrieving information. While Ganglia uses a round robin
database [14] to provide fine grained historic information over a limited time win-
dow and coarser historic information over a longer time Hyperic uses a server
hosted database. Each retains minimal information long term. While Hyperic,
Framework for Enabling System Understanding 239
unlike Ganglia, supports job based context in order to present more job centric
analysis or views, neither has support for complex statistical analysis. Ganglia
is released under a BSD license making it an attractive development platform
while Hyperic releases a stripped down GPL licensed version for free and a full
featured version under a commercial license.
Nagios [12] is a monitoring system for monitoring critical components and
their metrics and triggering alerts and actions based on threshold based direc-
tives. It provides no advanced statistical analysis capability nor the infrastruc-
ture for performing detailed analysis of logs, jobs, and host based numeric metrics
in conjunction. Nagios Core is GPL licensed.
Splunk [16] is an information indexing and analysis system that enables ef-
ficient storage, sorting, correlating, graphing, and plotting of both historic and
real time information in any format. Stearley et al give a variety of examples
of how Splunk can be used to pull information from RM’s, log files, and nu-
meric metrics and present a nice summary including descriptive statistics about
numeric metrics [4]. Some missing elements though are numeric data collection
mechanisms, spatially relevant display, and a user interface that facilitates drag
and drop type exploration. Like Hyperic, Splunk provides a free version with
limited data handling capability and limited features as well as a full featured
commercial version where the cost of the license is tied directly to how much
data is processed.
Analysis capabilities outside of frameworks exist. Related work on algorithms
for log file analysis can be found in [5], as the algorithmic comparison is not
directly relevant to this work. Lan et al have explored use of both principal
component analysis and independent component analysis [7] [6] as methods for
identifying anomalous behaviors of compute nodes in large scale clusters. This
work shows promise and would be more useful if the analyses were incorporated
into a plug and play framework such as OVIS where these and other analysis
methods could be easily compared using the same data.
5 Conclusions
While there are many efforts under way to mitigate the effects of failures in
large scale HPC systems, none have built the infrastructure necessary to explore
and understand the complex interactions of components under both non-failure
and failure scenarios nor to evaluate the effects of new schemes with respect to
these interactions. OVIS provides such an infrastructure with the flexibility to
allow researchers to add in new failure detection/prediction schemes, visualize
interactions and effects of utilizing new schemes in the context of real systems
either from the perspective of finding when prediction/detection would have
happened and validating that it is correct or by comparing operation parameters
both with and without implementation of such mechanism(s).
240 J. Brandt et al.
References
1. Brandt, J., Gentile, A., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.:
Methodologies for Advance Warning of Compute Cluster Problems via Statistical
Analysis: A Case Study. In: Proc. 18th ACM Int’l Symp. on High Performance
Distributed Computing, Workshop on Resiliency in HPC, Munich, Germany (2009)
2. Brandt, J., Gentile, A., Houf, C., Mayo, J., Pebay, P., Roe, D., Thompson, D.,
Wong, M.: OVIS-3 User’s Guide. Sandia National Laboratories Report, SAND2010-
7109 (2010)
3. Brandt, J., Debusschere, B., Gentile, A., Mayo, J., Pebay, P., Thompson, D., Wong,
M.: OVIS-2 A Robust Distributed Architecture for Scalable RAS. In: Proc. 22nd
IEEE Int’l Parallel and Distributed Processing Symp., 4th Workshop on System
Management Techniques, Processes, and Services, Miami, FL (2008)
4. Stearley, J., Corwell, S., Lord, K.: Bridging the gaps: joining information sources
with Splunk. In: Proc. Workshop on Managing Systems Via Log Analysis and
Machine Learning Techniques, Vancouver, BC, Canada (2010)
5. Taerat, N., Brandt, J., Gentile, A., Wong, M., Leangsuksun, C.: Baler: Determin-
istic, Lossless Log Message Clustering Tool. In: Proc. Int’l Supercomputing Conf.,
Hamburg, Germany (2011)
6. Lan, Z., Zheng, Z., Li, Y.: Toward Automated Anomaly Identification in Large-
Scale Systems. IEEE Trans. on Parallel and Distributed Systems 21, 174–187 (2010)
7. Zheng, Z., Li, Y., Lan, Z.: Anomaly localization in large-scale clusters. In: Proc.
IEEE Int’l Conf. on Cluster Computing (2007)
8. EDAC. Error Detection and Reporting Tool see, for example, Documentation in the
Linux Kernel (linux/kernel/git/torvalds/linux-2.6.git)/Documentation/edac.txt
9. Ganglia, http://ganglia.info
10. Hyperic. VMWare, http://www.hyperic.com
11. lm-sensors, http://www.lm-sensors.org/
12. Nagios, http://www.nagios.org
13. OVIS. Sandia National Laboratories, http://ovis.ca.sandia.gov
14. RRDtool, http://www.rrdtool.org
15. SLURM. Simple Linux Utility for Resource Management,
http://www.schedmd.com
16. Splunk, http://www.splunk.com
Cooperative Application/OS DRAM Fault
Recovery
1 Introduction
Proposed exascale systems will present extreme fault tolerance challenges to
applications and system software. In particular, these systems are expected to
suffer soft or hard errors at least several times a day. Such errors include un-
correctable DRAM failures, I/O system failures, and CPU logic failures. Un-
fortunately, fault-tolerance methods currently in use by large-scale applications,
such as roll-back recovery from a checkpoint, may be unsuitable to address the
challenges of exascale computing. As a result, applications will need to address
This work was supported in part by a faculty sabbatical appointment from Sandia
National Laboratories and a grant from the U.S. Department of Energy Office of Sci-
ence, Advanced Scientific Computing research, under award number DE-SC0005050,
program manager Sonia Sachs.
Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 241–250, 2012.
c Springer-Verlag Berlin Heidelberg 2012
242 P.G. Bridges et al.
resilience challenges previously handled only by the hardware, OS, and run-time
system if they want to utilize future systems efficiently. Unfortunately, few in-
terfaces and mechanisms exist to provide applications with useful information
on the faults and failures that affect them. This is particularly true of DRAM
failures, one of the most common failures in current large-scale systems [19], but
for which only low-level models of failure and recovery currently exist.
In this paper, we describe work on a collaborative application / OS system to
handle uncorrected memory errors. We begin by reviewing the basics of DRAM
memory failures and how they are handled in current systems. We then discuss
specific models of memory failures that we are examining for application design
and system implementation purposes, how the application, OS, and hardware
can interact under these failure models, and how applications recover in this
scenario. Based on this, we then present a simple OS and hardware interface
to provide the information to applications necessary to handle these errors, and
we outline an implementation of this interface. We illustrate the use of this in-
terface through its integration with a new fault-tolerant iterative linear solver
implemented with components from the Trilinos solvers library [9], and present
initial convergence results showing the viability of this recovery approach. Fi-
nally, we discuss related and future work.
2 DRAM Failures
DRAM memory modules are one of the most plentiful hardware items in modern
HPC systems. Each node may have dozens of DRAM chips, and large systems
may have tens or hundreds of thousands of DRAM modules. The combination
of the quantity and the density of the information they store makes them par-
ticularly susceptible to faults. As a result, most HPC systems include some
built-in hardware fault tolerance for DRAM. The most common hardware mem-
ory resilience scheme has the CPU memory controller write additional checksum
bits on each block of data written (128-bit blocks are used on modern AMD
processors, for example). The controller uses these bits to detect and correct
errors reading these blocks of data back into the CPU. Most modern codes
use Single-symbol1 Error Correction and Double-symbol Error Detection (SEC-
DED) schemes, allowing them to recover from the simplest memory failures and
at least detect more complex (and less frequent) ones.
Recent research has shown that uncorrectable errors (e.g., double-symbol
errors) are increasingly common in systems with SEC-DED memory protec-
tion [19], with uncorrectable DRAM errors occurring in up to 8% of DIMMs per
year. Such errors result in a machine check exception being delivered to the oper-
ating system, which then generally logs the error and either kills the application
or reboots the system depending upon the location of the error in memory. Some
systems exist for recovering from such errors, for example in Linux when they
occur in memory used for caching data or owned by a virtual machine [13], but
these systems are much too low-level to be useful to application developers.
1
A symbol in modern DRAM systems typically comprises 4 or 8 bits of data.
Cooperative Application/OS DRAM Fault Recovery 243
4 Application / OS Interface
4.1 Design
We have designed an application / OS interface to support the fault and re-
covery models described in Section 3, and implemented a library to provide this
interface. Our key design goals were to provide a simple interface for applications
and algorithmic libraries, and to support existing OS-level interfaces to handling
memory errors such as those provided by Linux.
This application level of this interface, shown in Figure 1, focuses on run-time
memory allocation. In particular, the interface provides the application with
separate calls for allocating failable memory—memory in which failures will
cause notifications to be sent to the application. These calls work like malloc()
and free(). In addition, the application also registers a callback with the library.
2
For example, a solver may replace corrupted matrix entries with averages of their
uncorrupted neighbors.
Cooperative Application/OS DRAM Fault Recovery 245
The callback is called once for every active allocation when the library is notified
by the OS of a detected but uncorrected memory fault in that allocation.
In addition to this interface, we also provide a simple producer-consumer
bounded ring buffer that the application can use to queue up a sequence of
failed allocations when signaled by the library. This ring buffer is non-blocking
and atomic to allow asynchronous callbacks from the library to enqueue failed
allocations that will be fully recovered at the end of an iteration. The application
determines the size of this buffer when it is allocated; the number of entries
needed must be sufficient to cover all of the allocations that could plausibly fail
during a single iteration. For applications with relatively few failable allocations,
this should be a minimal number of entries.
At the OS level, the library first notifies the operating system that it wishes
to receive notifications of DRAM failures, either in general or in specific areas of
its virtual address space depending upon the interface provided by the operating
system. Second, the library keeps track of the list of failable memory allocated
by the application so that it can call the application callback for each failed allo-
cation when necessary. Finally, the library handles any error notifications from
the operating system (e.g., using a Linux SIGBUS signal handler) and performs
OS-specific actions to clear a memory error from a page of memory if necessary
prior to notifying the application of the error.
4.2 Implementation
We added support for handling signaled memory failures as described in the
previous section to an existing incremental checkpointing library for Linux, lib-
hashckpt [8]. We chose this library because it helps track application memory
usage, and provides checkpointing functionality to recover from memory failures
for applications that cannot. Its ability to trap specific memory accesses eases
the testing of simulated memory failures, as described later in Section 4.3.
The modified version of the library adds the application API calls listed previ-
ously in Figure 1, with the failable memory allocator using malloc() to allocate
and free memory. This allocator also keeps a data structure sorted by allocation
address of failable memory allocations.
Linux notifies the library of DRAM memory failures, particularly failures
caught by the memory scrubber using a SIGBUS signal that indicates the ad-
dress of the memory page which failed. The library then unmaps this failed
page using munmap(), maps in a new physical page using mmap(), and calls the
application-registered callback with appropriate offset and length arguments for
every failable application allocation that overlapped with the page that included
the failure.
Note that Linux currently only notifies the application of DRAM failures de-
tected by the memory scrubber. When the memory controller raises an exception
caused by the application attempting to consume faulty data, Linux currently
kills the faulting application. In addition, Linux only notifies applications of the
page that failed and expects the application to discard the entire failed page.
This approach is overly restrictive in some cases, as the hardware notifies the
246 P.G. Bridges et al.
kernel of the memory bus line that failed, and some memory errors are soft and
could be corrected simply by rewriting the failed memory line.
To provide support for testing DRAM memory failures, we added support to the
incremental checkpointing library for simulating memory failures. In particular,
we added code that randomly injects errors at a configurable rate into the appli-
cation address space and uses page protection mechanisms, i.e., mprotect(), to
signal the application with a SIGSEGV when it touches a page to which a simu-
lated failure has been injected. The library then catches SIGSEGV and proceeds as
if it had received a memory failure on the protected page. We also implemented
a simulated memory scrubber in the library which can asynchronously inject
memory failures into the application by signaling the library when it scrubs a
memory location at which a failure has been simulated.
5 Fault-Tolerant GMRES
be transient. However, each outer iteration of FT-GMRES must run reliably, and
requires a correct copy of the matrix A, right-hand side b, and additional outer
solve data (the same that FGMRES would use). Since FT-GMRES expects only
a small number of outer iterations, interspersed by longer-running inner solves,
we need not store two copies (unreliable and reliable) of A and b in memory.
Instead, we can save them to a reliable backing store, or even recompute them.
With fault detection, we can avoid recovering or recomputing these data if no
faults occurred, or even selectively recover or recompute just the corrupted parts
of the critical data.
6 Initial Results
0
FTGMRES vs. standard GMRES, Ill_Stokes, fault rate 1e12
10
FTGMRES(50,10)
GMRES(500)
GMRES(50) x 10
1
10
1 2 3 4 5 6 7 8 9 10 11
Outer iteration number
Fig. 2. FT-GMRES (10 outer iterations, 50 inner iterations each), 500 iterations of
non-restarted GMRES, and 10 restart cycles (50 iterations each) of restarted GMRES.
(Down is good.)
preconditioner from reliable storage before every restart cycle. (We optimized
by not refreshing if no memory faults were detected.)
Figure 2 shows our convergence results. FT-GMRES’ reliable outer itera-
tion makes it able to roll forward through faults and continue convergence. The
fault-detection capabilities discussed earlier in this work let FT-GMRES refresh
unreliable data only when necessary, so that memory faults appear transient to
the solver.
7 Related Work
There has been a wide range of research on application and OS techniques for
recovering from faults in HPC systems. This includes both algorithmic work
on allowing specific numeric codes to run failures, and OS-level work on han-
dling memory faults both transparently and directing to the application. In the
remainder of this section, we describe related work in both areas.
potentially invoking higher-level recovery systems based on, for example, check-
pointing or redundancy. Some systems have attempted to provide additional
protection against memory faults both on CPUs [5] and GPUs [15], though with
substantial cost.
References
1. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear al-
gebra methods. In: Proceedings of the 22nd Annual International Conference on
Supercomputing, ICS 2008, pp. 155–164. ACM, New York (2008)
2. Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.: Computations to en-
hance the performance while achieving the 64-bit accuracy. Tech. Rep. UT-CS-06-
584, University of Tennessee Knoxville, lAPACK Working Note #180 (November
2006)
250 P.G. Bridges et al.
3. Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for paral-
lel matrix computations on volatile resources. In: 20th International Parallel and
Distributed Processing Symposium, IPDPS 2006 (April 2006)
4. Davis, T.A., Hu, Y.: The University of Florida Sparse Matrix Collection. ACM
Trans. Math. Softw. (2011) (to appear),
http://www.cise.ufl.edu/research/sparse/matrices
5. Dopson, D.: SoftECC: A System for Software Memory Integrity Checking. Master’s
thesis, Massachusetts Institute of Technology (September 2005)
6. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-
recovery protocols in message-passing systems. ACM Computing Surveys 34(3),
375–408 (2002)
7. van den Eshof, J., Sleijpen, G.L.G.: Inexact Krylov subspace methods for linear
systems. SIAM J. Matrix Anal. Appl. 26(1), 125–153 (2004)
8. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt:
Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis,
A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp.
272–281. Springer, Heidelberg (2011)
9. Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G.,
Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., Salinger, A.G., Thorn-
quist, H.K., Tuminaro, R.S., Willenbring, J.M., Williams, A., Stanley, K.S.: An
overview of the Trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)
10. Heroux, M.A., Hoemmen, M.: Fault-tolerant iterative methods via selective relia-
bility. Tech. Rep. SAND2011-3915 C, Sandia National Laboratories (2011),
http://www.sandia.gov/~ maherou/
11. Howle, V.E.: Soft errors in linear solvers as integrated components of a simula-
tion. Presented at the Copper Mountain Conference on Iterative Methods, Copper
Mountain, CO, April 9 (2010)
12. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix opera-
tions. IEEE Transactions on Computers C-33(6) (June 1984)
13. Kleen, A.: mcelog: memory error handling in user space. In: Proceedings of Linux
Kongress 2010, Nuremburg, Germany (September 2010)
14. Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware
errors and software system susceptibility. In: Proceedings of the 2010 USENIX
Annual Technical Conference (USENIX 2010), Boston, MA (June 2010)
15. Maruyama, N., Nukada, A., Matsuoka, S.: A high-performance fault-tolerant soft-
ware framework for memory on commodity GPUs. In: 2010 IEEE International
Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)
16. Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci.
Comput. 14, 461–469 (1993)
17. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadel-
phia (2003)
18. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, 856–869
(1986)
19. Schroeder, B., Pinheiro, E., Weber, W.D.: DRAM errors in the wild: a large-scale
field study. Communications of the ACM 54, 100–107 (2011)
20. Simonici, V., Szyld, D.B.: Theory of inexact Krylov subspace methods and appli-
cations to scientific computing. SIAM J. Sci. Comput. 25(2), 454–477 (2003)
A Tunable, Software-Based DRAM Error
Detection and Correction Library for HPC
1 Introduction
With the increased density of modern computing chips, components are shrink-
ing, heat is increasing, and hardware sensitivity to outside events is growing.
These variables combined with the extreme number of components expected to
make their way in to computing centers as our computational demands expand
are posing a strong challenge to the HPC community. Of particular interest are
soft errors in memory that manifest themselves as silent data corruption (SDC).
Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.
Research sponsored in part by the Laboratory Directed Research and Development
Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC
for the U.S. Department of Energy under Contract No. De-AC05-00OR22725. The
United States Government retains and the publisher, by accepting the article for pub-
lication, acknowledges that the United States Government retains a non-exclusive,
paid-up, irrevocable, world-wide license to publish or reproduce the published form
of this manuscript, or allow others to do so, for United States Government purposes.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 251–261, 2012.
c Springer-Verlag Berlin Heidelberg 2012
252 D. Fiala et al.
SDCs are of great importance due to their ability to render invalid results in
scientific applications.
Silent data corruption can occur in many components of a computer includ-
ing the processor, cache, and memory due to radiation, faulty hardware, and/or
lower hardware tolerances. While cosmic particles are one source of concern,
another growing issue resides within the circuits themselves, due to miniatur-
ization of components. As components shrink, heat becomes a design concern
which in turn leads to lower voltages in order to sustain the growing chip den-
sity. Lower component voltages result in a lower safety threshold for the bits
that they contain, which increases the likelihood of an SDC occurring. Further,
as the densities continue to grow, any event that upsets chips (i.e., radiation) is
more likely to both interact with and be successful at flipping bits in memory.
Currently, servers that use memory with hardware-based ECC are capable of
correcting single bit error and detecting double bit errors [1], but errors that re-
sult in three or more bit flips will produce undefined results including silent data
corruption which may produce invalid results without warning. Today, research
has been performed on the frequency and occurrence of single and double bit er-
rors [9], but data on the frequency of triple bit errors remains inconclusive even
though up to 8% of DIMMs will incur correctable errors while 2%-4% will incur
uncorrectable errors. Nonetheless, the overall occurrence of bit flips is expected
to increase as chip densities increase and computing centers move to millions of
cores.
To combat this growing problem, new methods to both detect and correct
faults that result in data corruption are essential. Specifically, it is critical to de-
velop a fault resilient framework that provides for SDC detection and continuous
execution in the face of faults. As applications increase in run time and scale
out, it is no longer feasible to rely on traditional checkpoint-restart solutions to
protect an application. Even with the bottlenecks that are checkpoint/restart
I/O aside, we can not guarantee that an execution will be able to fully execute
fault-free without interrupt due to a low average time between failures. Follow-
ing this thought, we may not be able to reliably verify application results by
simply running it twice if we are prone to a very high probability that a fault
will render the results of both runs incorrect.
One method to address silent data corruption is in the field of algorithmic
fault tolerance where researchers have proposed methods to protect matrices
from SDCs that corrupt elements within a matrix [3]. While it is possible for this
work to protect some matrix operations such as multiplication, this form of fault
tolerance may not be able to protect all types of possible matrix operations even
if we disregard the fact that matrices are only one of numerous important types
of structures. Although promising in some regards, fault tolerant algorithms
can be incredibly difficult or simply impossible to design for any arbitrary data
structure or operation on data. Worse, this type of protection does not provide
comprehensive coverage of the entire application, which leaves anything outside
of the algorithm such as other data and instructions entirely vulnerable to SDCs.
A Tunable, Software-Based DRAM Error Detection and Correction Library 253
For these reasons, there is a dire need to develop generic fault tolerance op-
tions that provide wide coverage to an application and its data while remaining
agnostic to the actual algorithms that applications utilize.
This paper outlines a generic memory protection library that increases the
resilience of all applications that it guards by protecting data at the page level
using a transparent, tunable on-demand verification system. The library pre-
sented within provides the following contributions:
– Provides transparent protection against SDC for all applications without the
need for any program modifications.
– Our solution is tunable to best match the data access patterns of an appli-
cation.
– Extensibility within the library provides for easy addition of new features
such as adding software-based ECC which can not only detect, but also
correct SDC that evades hardware ECC.
2 Design
In this paper we present LIBSDC, a transparent library that is capable of detect-
ing and optionally correcting soft-errors in system memory that cause corruption
in program data during execution. LIBSDC protects against SDCs by tracking
memory accesses at the virtual memory page level and verifies that the contents
of each accessed page have not unexpectedly been altered.
To ensure memory has not become corrupted, LIBSDC is responsible for mon-
itoring all read and write requests that an application incurs during execution
while simultaneously verifying these data accesses. Each memory access is hence-
forth assumed to be at the granularity of an entire page of virtual memory in-
stead of individual bytes. At a high level, each memory access that an application
makes will be intercepted by LIBSDC and the contents of the page in which the
memory address resides are verified against a previously known-good hash of
that page. If during execution an unexpected hash mismatch occurs between the
page and its last known value, then LIBSDC will terminate the process or roll
back to a previous checkpoint if available to ensure that the application does not
continue to compute and report invalid results. After a page’s integrity has been
successfully verified, then application is allowed to proceed with the memory
access and continue making forward progress.
Once a memory access completes verification, the entire page in which the
access resides will become available for use without further interception by LIB-
SDC. A page in this state will be referred to as unlocked. Likewise, all other pages
that have not yet been verified by LIBSDC will be considered locked. For each
additional locked memory access that occurs, LIBSDC will intercept the request
and verify the locked memory before unlocking it and allowing the application
to progress.
254 D. Fiala et al.
On page lock:
Calculate new hash of entire page
Storage hash in separate location
Mark page as locked
Return control to application
Managing locked and unlocked pages internally requires LIBSDC to hook mem-
ory allocation functions such as malloc, realloc, and memalign to learn of new
memory addresses that should receive protection. When a new memory range
has been allocated for an application, LIBSDC automatically locks all pages in
the range of the new memory so that all future accesses to that memory are
within the scope of protection that LIBSDC provides.
As the amount of allocated memory per application as well as the working-set
of pages required varies, LIBSDC allows the user to tune the maximum number
of pages to allow in the unlocked state. This tunable parameter, known as max-
unlocked, is set prior to invoking an application and permanently defines the
maximum number of pages to allow unlocked at any given time during execution.
When the max-unlocked limit of unlocked pages is reached, any further accesses
to pages in the locked state will require LIBSDC to lock some other unlocked
page to accommodate for the new page of memory.
Tuning the max-unlocked parameter requires consideration as its value is di-
rectly related to both application performance as well as the effectiveness of SDC
protection. Providing a relatively low max-unlocked value will force LIBSDC to
more frequently lock and unlock pages resulting in unnecessary verifications. In
this case, the overhead of intercepting page accesses combined with frequent
rehashing will quickly diminish application performance. The effect of a max-
unlocked value much less than the application’s work-set of pages will result in a
A Tunable, Software-Based DRAM Error Detection and Correction Library 255
Through the design section of this paper we have referred to LIBSDC storing a
hash of pages that are under its protection. When a page is hashed, the hash may
be compared against a future hash taken on the same page to determine if any
changes have occurred, but this information alone is not suitable for correcting
errors that a hash may detect. To provide additional SDC correction capabilities
on top of the detection mechanisms, it is possible to additionally compute and
store error correcting codes (ECC) such as hamming codes that may be used to
fix bit flips in memory. For example, 72/64 hamming codes which are frequently
used in hardware may be employed inside of LIBSDC to provide single error cor-
rect, double error detect (SECDEC) protection at the expense of the additional
storage required for the ECC codes. Combining LIBSDC with hardware-ECC
can provide not only the ability to detect triple bit errors or greater, but can
also provide correction capabilities as the software-layered protection in LIB-
SDC may still retain viable error correcting codes. If LIBSDC is extended with
hashing plus ECC codes then it is possible to enjoy the protection and speed of
hashing while limiting ECC code recalculation only to times when a page has
been modified during execution resulting in a changed hash.
Any application that depends on DMA with devices such as network inter-
connects must ensure that buffers are in an unlocked state before DMA begins.
This assumption is necessary since DMA avoids the MMU and thus LIBSDC
is never notified of page accesses to buffers. Data written through DMA would
appear as corruption to LIBSDC because the changes were made while the data
pages written were in a locked state.
3 Implementation
LIBSDC protects memory from SDCs by comparing last known good hashes of
virtual memory pages with a hash of their current data upon page access by an
application. Therefore it is critical that LIBSDC be able to receive notification
when a page is being accessed by an application. To achieve this, LIBSDC uses
the mprotect system call to modify page permissions and take away read and
write access. By installing a signal handler for SIGSEGV (segmentation fault),
LIBSDC is notified by the operating system any time a locked page (one without
read/write permissions) is accessed. Upon notification, LIBSDC uses an internal
table to verify that the page being accessed is one that it intends to protect.
If it is, then verification is performed by taking a hash of the current page
and comparing it to the last known good hash which is stored in LIBSDC’s
table. After verification, the page’s read and write permissions are restored using
mprotect before returning control to the user application upon exiting the signal
handler.
Internally, the table that LIBSDC uses to store information on pages is com-
promised of several fields:
– A status flag to indicate locked, unlocked, or not managed by LIBSDC
– Storage for the page’s last known good hash
– Pointers to indicate which pages were accessed for use as a first-in-first-out
queue
Of particular interest of the LIBSDC’s table fields are the FIFO pointers. In
order to maintain a fair policy for evicting unlocked pages when the application
needs to access a page that is not currently available, LIBSDC maintains FIFO
ordering so that the oldest pages in the table are evicted first. Unfortunately
once a page is in the unlocked state it is not possible to track accesses to the
page until it is again locked. For this reason, the FIFO queue is based on the
order of unlocking, and while it may not exactly mirror an application’s data
access patterns, it should be similar.
Each locked page’s hash storage is tunable to accommodate the size of
whichever hashing algorithm is used. Additional fields can also be added to
accommodate storage for other needs such as ECC codes.
execution. Unfortunately, any system calls that are executed in kernel space do
not enjoy this luxury as kernel space does not call the SIGSEGV handler during
a page fault in a system call. System calls that attempt to access user space
pointers will fail unpredictably if proper page permissions are not applied prior
to the system call occurring. Therefore all system calls that accept user space
pointers require hooking in order to unlock memory regions that the kernel is
likely to access during the system call.
While in many cases it is possible to override GLIBC calls between applica-
tion linking/loading and replace them with wrappers that unlock any pointers
present, the GLIBC implementation may make system calls directly within it-
self instead of using your wrapper. For this reason it is essential that all system
calls are wrapped no matter their source. For simplicity, our LIBSDC prototype
makes a clone process of the original using the clone system call with CLONE_VM
as a parameter to share address spaces, and then uses the ptrace system call to
trace the application as it executes in order to receive notification of all system
calls occurring. The ptrace interface is provided as part of the Linux kernel and
allows a process to intercept all system calls and signals that another process
generates.
It should be noted that there are other less portable solutions that may ac-
complish system call hooking, but would require extensive per-platform work
such as binary rewriting to hook system calls or specialized kernel modules that
wrap system calls. Our prototype’s goal was to provide a platform for gauging
the viability and costs of SDC protection through hashing and page protection
while avoiding writing a complex platform specific system call hooking scheme
that would not add to the research contributions.
4 Results
To gauge the overheads and demonstrate the effects of tuning the max-unlocked
value, the HPCCG Mantevo Miniapp[8] was run with matrix size of 768x8x8
25
Runtimewithouthashing
20 Runtimewithhashing
ormalizedruntime
Doublemodularredundancy
15
10
No
0
4096 4224 4352 4480 4608 4736 4864 4992 5120
maxunlocked value(numberofpages)
scaled over 256 processes. The compute nodes used consisted of 2-way SMPs
with AMD Opteron 6128 (Magny-Cours), 32GB of memory per node, and a
40Gb/s Infiniband interconnect.
In Figure 1 we compared normalized execution time vs. the max-unlocked value
to demonstrate the effects of LIBSDC on an application. The baseline execution
time was taken by running HPCCG without LIBSDC performing any mprotect
calls and by default leaving all memory in an unlocked state. As a comparison,
the dashed line with a constant normalized time of 2 demonstrates the overhead
of double modular redundancy. LIBSDC’s overheads are shown with the dashed
line indicating the run-times without hashing and the solid line indicating the
run-times with hashing.
The choice of a range for max-unlocked between 4096 and 5120 is due to the
maximum working-set of pages residing near the middle of that range at around
4672 unlocked pages. As depicted in Figure 1, there is a dramatic drop in the
normalized run-time when we tune LIBSDC to use a max-unlocked value that
corresponds well to the active number of working pages. From the max-unlocked
range of 4672 to 5120, the normalized execution time falls from 1.79 to 1.53
respectively, which shows good improvement over even double modular redun-
dancy. Although not shown, in the poorly tuned ranges below 4096 a normalized
run-time of 21 or greater was observed.
For the results reported above, the average time spent calculating hashes
during execution is 15%.
It is important to note that the performance of LIBSDC’s hashing is highly
dependent on both the hashing algorithm used and on the way it is computed.
Although we chose to use SHA-1 computed on the CPU, research on comput-
ing hashes of pages using GPUs[2] has demonstrated that hashing performance
on GPUs greatly outperforms CPUs. This research indicates that applications
requiring page hashes should not consider the hashing itself to be a bottleneck.
We also find that the reason for the substantial overhead incurred with LIB-
SDC for a max-unlocked value less than the working-set of pages is due to our use
of the ptrace system call. ptrace is known to have performance penalties due to
frequent context switching on each system call and each received signal as well as
generating O.S. noise. This is worsened because each page unlock is intercepted
by ptrace during execution. While our prototype shows good performance for a
well tuned max-unlocked value, we expect that a production version of LIBSDC
would not use ptrace to intercept system calls. This would also result in better
performance for applications running with a well tuned max-unlocked value, too.
5 Related Work
Similar to LIBSDC, another approach [10] that is transparent to the application
achieves software-implemented error detection and correction using background
scrubbing combined with software calculated ECC to periodically validate all
memory and correct errors if possible. While this approach and LIBSDC are
both entirely transparent to the application, LIBSDC differentiates itself by
providing on-demand page-level checking based on the application’s data access
A Tunable, Software-Based DRAM Error Detection and Correction Library 259
verification does not necessarily protect against SDCs that only alter data with-
out affecting the execution path of an application.
References
1. Chen, C.L., Hsiao, M.Y.: Error-correcting codes for semiconductor memory ap-
plications: A state-of-the-art review. IBM Journal of Research and Develop-
ment 28(2), 124–134 (1984)
2. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt:
Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis,
A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp.
272–281. Springer, Heidelberg (2011)
3. Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations.
IEEE Transactions on Computers C-33(6), 518–528 (1984)
4. Oh, N., Shirvani, P., McCluskey, E.J.: Error detection by duplicated instructions
in super-scalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)
5. Oh, N., Shirvani, P., McCluskey, E.: Control-flow checking by software signatures.
IEEE Transactions on Reliability 51(1), 111–122 (2002)
A Tunable, Software-Based DRAM Error Detection and Correction Library 261
1 Introduction
Large-scale HPC clusters containing thousands of nodes are usually intercon-
nected using a commodity interconnect such as InfiniBand [1] and arranged
by a cost-effective slimmed fat-tree topology [2]. One popular example of these
systems is the Roadrunner supercomputer [3] built at Los Alamos National Lab-
oratory which was the first supercomputer that achieved one Petaflop of peak
performance.
On these systems, MPI is the de facto standard for communication. MPI pro-
vides both point-to-point communications as well as collective communications.
Collective communications are group communications between many nodes used
for different purposes such as combining partial results of computations (such as
Gather and Reduce), synchronization of nodes (Barrier), and publication (Broad-
cast). However, because collectives involve the participation of all the members
in the group before it can conclude, any variance in node communication re-
sponsiveness and performance for any member of the group have a big impact
on the completion time. One major cause of variability is produced by the OS
jitter. Recently, Fabric-based collective communications [4] have been proposed
to address this scalability problem by moving the collective’s calculation from
nodes onto switches which do not have OS jitter.
Another important factor that introduces a high variability and performance
degradation of collectives are soft errors. These errors corresponds to alterations
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 262–271, 2012.
Springer-Verlag Berlin Heidelberg 2012
Reducing the Impact of Soft Errors on Fabric-Based Collectives 263
in the bit stream received over a communication channel. They are caused by
various factors such as channel noise, interference, distortion, bit synchronization
and attenuation problems. The frequency of occurring these errors is measured
by network manufactures by the Bit Error Rate (BER)—the number of bit errors
divided by the total number of transferred bits during a studied time interval.
Typical BER values found on transmission channels range from 10−12 up to
10−15 for high-end optical cables. Although, the probability of an error happen-
ing on a single channel is small, the large amount of communication channels
found in clusters results in a system-wide failure rate quite small. Note that next
generation of supercomputers will contain in the order of millions of channels.
Unfortunately, Fabric-based collectives can suffer from soft errors. The gen-
eral approach to deal with these errors in current interconnection networks is
to detect errors at the receiver side with an CRC code, and then ask for a
re-transmission if the error is positive. However, these re-transmissions are in-
evitably adding delays to the individual messages involved in each step of the
collective’s calculation resulting in an overall performance loss. Providing a tech-
nique to avoid these re-transmissions or ameliorate their negative impact would
be of great interest. One possible technique would be the use of error-correcting
codes (ECC), so errors can be detected and also corrected. However, since some
of the collective’s messages are not fully stored in switches, ECC is not viable
on Fabric-based collectives.
In this paper, we propose the use of message replications in order to reduce
the degradation caused by soft errors on Fabric-based collectives. Two different
replication techniques have been evaluated: spatial and temporal replications.
Results on 1,728-node InfiniBand cluster arranged on a slimmed fat-tree shows
that temporal replications is the most effective solution to mitigate the negative
effects of soft errors on Fabric-based collectives. Performance of up to 50% can
be achieved with respect to spatial replications.
The rest of this paper is organized as follows. Section 2 briefly describes the
operation of Fabric-based collectives on slimmed fat-tree topologies. Section 3
shows how InfiniBand detects and handles soft errors. Section 4 describes two
approaches based on spatial and temporal replications to mitigate the negative
effects of soft errors. Section 5 characterizes the impact on collective performance
when soft errors are present in the network for both proposed techniques. Section
6 summarizes recent approaches to deal with network errors. Conclusions from
this work are given in Section 7.
2 Fabric-Based Collectives
Fabric-based collectives is an approach to accelerating the calculation of col-
lective communications. It uses the switch CPU to perform the collective steps
and required calculations instead of using the host CPU as in the traditional ap-
proach. Recently, it has been integrated with the popular OpenMPI and Platform
MPI message passing libraries and it is fully supported on InfiniBand networks.
Basically, this scheme is composed of a manager that orchestrates the ini-
tialization of the collective communication tree and a SDK that offloads the
264 J.C. Sancho, A. Jokanovic, and J. Labarta
computation of the collective onto the switches. Today’s switches have been re-
designed and optimized to support a super scalar FPU hardware engine that
performs single and double precision operations in just one single cycle. This
technology have been specifically targeted to the M P I− Barrier, M P I− Reduce,
and M P I− Allreduce operations, which in turn are the most frequent operations
found on scientific applications.
In essence, the collective calculation is composed on two phases— a reduction
phase and a broadcast phase. In the first phase, switches aggregate the collective
values from all the computing nodes and switches attached to them, calculate the
resulting value, and forward it to higher level switches. The root of the collective
tree calculates the final reduction and on the second collective phase the result is
broadcasted to computing nodes using multicast operations. Notice that in order
to calculate the reduction it implies that messages have to be fully received at
the switches before sending the partial result to upper switches in the reduction
phase. However, on the final broadcast phase, there is no need to wait until the
full message is received to start transmitting it down to computing nodes.
– Physical errors. Errors indicative of bit errors at the attached physical link.
These are detected by CRC checks.
– Malformed packet errors. Errors indicative of packets transmitted with in-
consistent content.
– Switch routing errors. Errors indicative of an error in switch routing.
– Buffer overrun. Error indicative of an error in the state of the flow control
machine.
When one of these errors is detected on any single packet at the switches, the
immediate action is to discard the packet, record the type of error for further
processing, and notify to sender side of the transmission that the packet was
corrupted. This last action is performed by the hardware-level ACK messages.
Originally, InfiniBand’s host channel adapters (HCA) where the ones sending
these notifications back to the sender HCAs. However, these notifications had
to be supported also at switches on Fabric-based Collectives because they are
becoming now the originators of packets.
In case of a corrupted packet it might be just dropped if it has not been
yet forwarded to the next switch; or in the case that it has been started the
transmission, switches are appending a bad CRC value and the End Bad Packet
delimiter (EBP) as an alternative to dropping the packet.
InfiniBand implements two different CRC checks in any packet which are
invariant (ICRC) and variant (VCRC) CRCs. ICRC is 4 bytes long and covering
only the fields of the packet which are invariant from end to end on the network.
Reducing the Impact of Soft Errors on Fabric-Based Collectives 265
R7= R5 + R6 R7 R7 R7
Soft error
R5= R1+R2 R5 R5 R6
R5 R6= R3 + R4 R6 R6
S8 S9 S10 S11
S0 S1 S2 S3 S4 S5 S6 S7
R7= R5 + R6
S0 S1 S2 S3 S4 S5 S6 S7
the second R5 instead of the first R5 then there will be no delay at all because
the first packet will still have a valid value and the second packet would be just
discarded when it arrives at S12 .
5 Experiments
In this section, we have evaluated both schemes, spatial and temporal replica-
tions Fabric-based collectives in the case of having one or multiple soft errors
in the network. The evaluation is performed by simulation using the Venus net-
work simulator [6]. In this simulator, it has been implemented the Fabric-based
collective technology and also both fault-tolerant schemes.
A large network containing 1,728 computing nodes is used in the evaluation.
It is arranged on a 3-level slimmed fat-tree topology, XGFT(3;24,12,6;1,12,6).
InfiniBand network is considered using 36-port switches and single port HCAs
with a 100ns delay for each HCAs and switches. A 4X SDR (10Gb/s) port config-
uration is used in the evaluation. On the spatial approach the HCA’s bandwidth
is reduced proportionally to the number of output port replications. Replications
of two, four, and six are considered for both temporal and spatial techniques.
We used the M P I− Allreduce collective operation in our evaluations. This is
the most common collective operation found in scientific applications because it
is being used on Conjugate Gradient Solvers. We assume a collective operation
with few operands that actually fit on the InfiniBand’s minimum transfer unit
that consist of 256 bytes.
One and multiple soft errors are injected in the network. It is considered
the case of one and two soft errors on specific channels and also multiple soft
errors randomly affecting multiple channels. In the latter case it is following
a exponential distribution of errors with mean values ranging from 1, 10, 100,
up to 1000 µs. In this case, the average time to complete the collective in the
presence of soft errors is reported over a thousand collective operations.
268 J.C. Sancho, A. Jokanovic, and J. Labarta
4.5 2.0
1.0
0.4
0.5 0.2
0.0 0.0
one network failure two network failures two network failures two network failures two network failures
non-failure all links failure once same switch level 1 same switch level 2 switch level 1-2
Fig. 3. Non-failure and all channels fail- Fig. 4. One and two soft errors on se-
ure scenarios lected network channels
5.1 Results
Figure 3 shows the performance of Fabric-based collectives for spatial and tem-
poral replications on two extreme cases: the case of a failure-free scenario and the
case of experiencing a failure on all the network links once. In both techniques
the number of replications used is two. As can be seen, temporal significantly
outperforms spatial by 22% in the case of non-failures. The reason for that is
that the HCA’s bandwidth had to be divided up to support multiple output
channels on spatial. For the last extreme case of experiencing a soft-error on ev-
ery link, temporal still significantly outperforms spatial because spatial is not able
to provide a free-failure path, and thus it is been heavily penalized due to mul-
tiple re-transmissions of collective packets once soft errors have been detected.
In particular, there is a difference in performance of almost a 2X factor.
Figure 4 shows various cases having one and two soft errors occurring in
specific network channels. The first case, having only one soft error in a chan-
nel significantly degrades temporal by 22% with respect to the non-failure case
showed before. This is because the waiting time to get the second collective
packet. In the case of spatial does not suffer additional degradation because an-
other collective packet is still being transmitted through another channel. In this
scenario both approaches are achieving the same performance. Similarly, the case
to having two simultaneous soft errors in two different network channels and in
two different switches, but at the same tree level, is not further degrading the
performance of both techniques. However, the interesting case comes when soft
errors are occurring in the same switch on the level 1 for spatial. In this par-
ticular case, both output ports are experiencing soft errors and thus it can not
provide a fault-free path suffering a 55% degradation. This case is not happening
at higher levels of the fat tree as it can be seen in the next set of results. This
is due to the fat-tree topology where replications on different output ports on
a switch on level i make the collective going to different switches on the upper
level i + 1. And hence, if one these switches at level i + 1 is experiencing failures
the other switch on the same level can still deliver the collective to upper levels.
The last case, shows a worse scenario for temporal where soft errors are actually
Reducing the Impact of Soft Errors on Fabric-Based Collectives 269
6 6
spatial spatial
5 5
temporal temporal
Collective time (usecs)
3 3
2 2
1 1
0 0
1000 100 10 1 2 4 6
occurring in two connecting switches, but each one sit on a different level. In this
failure scenario, temporal is suffering higher degradation than spatial because the
first collective packet is being consecutively delayed in both switches. Therefore,
suffering twice the penalty to wait for the second collective packet to arrive.
Figure 5 shows the scenario of experiencing random soft errors over multiple
collectives. As can be seen, temporal significantly outperforms spatial. Specifi-
cally, at 1ms, 100 µs, and 10 µs the collective time is reduced by 18%, 30%, and
14%, respectively. The margins are reduced at less frequent failure rate (1,000 µs)
because when there is a small number of failures both techniques performs simi-
lar as it was shown before. Also, on a very high failure rate (1 µs), both techniques
perform the same. The reason for that is that there are too many soft errors that
almost any collective packet is having a soft-error, and thus both techniques suf-
fer from having multiple re-transmissions. In order to provide more resiliency to
this environment we increased the number of collective replications up to six as it
is shown in Figure 6. As can be seen, the collective time significantly decreases as
we increases the number of replications, specially for temporal. In particular, the
collective time drops by 30% and 46% when going to 4 and 6 replications. Note
that spatial is not decreasing significantly collective time from 2 to 4 because it
is also reducing the available HCA bandwidth proportionally. Overall, temporal
is still outperforming spatial for these cases. Performance improvements of 50%
and 22% are seen for 4 and 6 replications.
6 Related Work
processes in the MPI universe fail. In this context, it has been analyzed various
fault-tolerant algorithms to deal with hard failures for collectives in [8] and [9].
In [8] these algorithms are based on a new MPI function M P I− Comm− validate
that can check process failures any time. If a process failure is detected then it re-
builds the collective tree accordingly, so for the next collective, the tree is already
working and optimized. In [9], the collective tree is only re-built when the failure is
detected during the collective operation. However, re-building the collective tree
is too expensive in order to handle soft errors on Fabric-based collectives.
Additionally, it has been proposed in [10] an enhancement resilient protocol
for Eager and Rendezvous point-to-point communications that covers fabric end-
to-end hard and soft failures including also HCAs. Unlike our approach, the basic
idea is to solely act as soon as a failure is detected, but this approach may lead to
a higher degradation. We believe that a pro-active approach—also acting before
a failure will occur— is better to reduce the potential harmful degradation of
soft errors.
7 Conclusions
Soft errors are having a big impact on the performance of collective communi-
cation operations. For these communication operations, solely acting when soft
errors occur is not efficient enough, and thus pro-active solutions are highly rec-
ommended. We have evaluated two of these pro-active solutions called spatial
and temporal replications.
Evaluations show that temporal replications deliver higher performance than
spatial replications. In particular, a 50% lower degradation is observed on tem-
poral with respect to spatial in the presence of soft errors. Therefore, temporal
replications are effectively diminishing the impact of soft errors. In addition,
temporal replications can be seamlessly deployed in current production systems
because it does not require the use of special hardware. Note that spatial would
require at least 4X HCAs.
We understand that the only benefit of spatial would come from the potential
to mitigate also hard failures. However, this work demonstrates that spatial
achieves a very poor performance with respect to temporal, and thus it will
make it less attractive to be deployed as a stand-alone solution.
Acknowledgments. We thankfully acknowledge the support of the Spanish
Ministry of Science and Innovation under grant RYC2009-03989, the European
Commission through the HiPEAC-2 Network of Excellence (FP7/ICT 217068),
the Spanish Ministry of Education (TIN2007-60625, and CSD2007- 00050), and
the Generalitat de Catalunya (2009-SGR-980).
References
1. InfiniBand website: Infiniband trade association, official website on,
http://www.infinibandta.org
2. Öhring, S.R., Ibel, M., Das, S.K., Kumar, M.J.: On generalized fat trees. In: Pro-
ceedings of the 9th International Parallel Processing Symposium, p. 37. IEEE Com-
puter Society, Washington, DC (1995)
Reducing the Impact of Soft Errors on Fabric-Based Collectives 271
3. Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho,
J.C.: Entering the petaflop era: the architecture and performance of roadrunner.
In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008,
pp. 1:1–1:11 (2008)
4. Mellanox: Fabric collective accelerator (2011),
http://www.mellanox.com/related-docs/
prod acceleration software/fca.pdf
5. InfiniBand specification: Infiniband trade association, infiniband architecture spec-
ification, vol. 1, release 1.0.a (2001)
6. Minkenberg, C., Rodriguez, G.: Trace-driven co-simulation of high-performance
computing systems using OMNeT++. In: Proceedings of the 2nd International
Conference on Simulation Tools and Techniques, Simutools 2009 (2009)
7. Fault Tolerance Working Group: Run-though stabilization interfaces and
semantics,
svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/runthroughstabilization
8. Hursey, J., Graham, R.: Preserving collective performance across process failure
for a fault tolerant. In: 16th International Workshop on High-Level Parallel Pro-
gramming Models and Supportive Environments (HIPS) held in conjunction with
the 25th International Parallel and Distributed Processing Symposium (IPDPS),
Anchorage, Alaska (May 2011)
9. Jaros, J.: Evolutionary Design of Fault Tolerant Collective Communications. In:
Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp.
261–272. Springer, Heidelberg (2008)
10. Koop, M.J., Shamis, P., Rabinovitz, I., Panda, D.K.: Designing high-performance
and resilient message passing on infiniband. In: Communication Architecture for
Scalable Systems Workshop held in conjunction with the 25th International Parallel
and Distributed Processing Symposium (IPDPS), Atlanta, Georgia USA (April
2010)
Evaluating Application Vulnerability to Soft
Errors in Multi-level Cache Hierarchy
1 Introduction
Two trends are observed in the ongoing development of the future generations
of high performance computing systems: 1) the processor is fabricated with the
CMOS processing technology that is constanly scaling down and 2) commodity
high-end processors, rather than customized processors, are more widely em-
ployed to reduce the total cost. The combination of these two trends makes the
reliable working of hardware much more difficult [16]. Unreliable hardware be-
haviors can be roughly split into hard errors and soft errors. While a hard error is
a persistent hardware failure, a soft error is a transient failure and hence harder
to detect and analyze.
This work is funded by Intel and by the Institute for the Promotion of Innovation
through Science and Technology in Flanders (IWT).
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 272–281, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Evaluating Application Vulnerability to Soft Errors 273
Soft error is a well known reliability concern. However, it is known that appli-
cations have intrinsic masking of soft errors (error rate derating) [12,14]. Thus
many bit errors are filtered out and are not visible at the application level. To find
a cost-effective soft error mitigation strategy, it is necessary for system designers
to have fault injection tests in order to obtain a good estimate of application
level soft error rate. How to perform error injection is an important topic in a
computer system reliability analysis. Different approaches are developed and can
be roughly grouped into the following categories:
In the rest of this paper we first describe the processor simulator that we used
for fault injection and how we can conduct fault injections in the cache hierarchy
(Section 2); then we present our motivation for various bit error patterns used
in the fault injection (Section 3); next, we describe the experimental process
(Section 4) and present the experimental results (Section 5); we finally discuss
the collected results and draw some conclusions (Section 6).
We want to obtain the applications response to cache bit errors by directly ob-
serving the simulated results after the fault injection. This simulation should
faithfully execute the target application’s instructions; and the error bits are
only injected to the instructions that are accessing the specific cache that we
select for injection during the specified time period. We have chosen a proces-
sor architecture simulator called Graphite developed at MIT [11]; and used the
extensions made by University of Ghent [3] .
...
[fault_injection_model/L3]
start_cycle = 12022450
total_faults_nr = 1
err_bit_nr = 2
multi_byte_upset = false
...
A random configuration generator has been made to generate a large number
of fault injection configuration files. While the injection location and time are
randomized by the generator, the bit error pattern (see Section3) in the configu-
ration files is given as an input to the generator. Because soft error is a rare event
and is unlikely to hit the same application more than once during its execution
with the input size used in our simulation, we only inject one soft error with a
single injection configuration file. One simulation is launched for each individual
fault injection configuration file. During every simulation, the Graphite simula-
tor flips the error bits that are specified in this configuration file providing that
a cache access at the selected cache level takes place during the specified time
period in the configuration file. The injected error bits stay as long as they are
not overwritten or flushed out of the cache.
4 Simulation Setup
4.1 Applications
We use the SPLASH-2 benchmarks [18] as our applications for the fault injec-
tion simulation. SPLASH-2 benchmarks have a variety of scientific computing
276 Z. Ma et al.
programs that are widely used with processor simulators. Because computa-
tional kernels usually account for the main execution times of scientific com-
putations, we only present the results from three computational kernels from
SPLASH-2 suite in this paper. The selected kernels are a sparse matrix factor-
ization (Cholesky), a fast fast Fourier transform (FFT) and an integer radix sort
(Radix). The problem sizes used for each benchmark are listed in Table 1. All
benchmarks are compiled by GCC in 64-bit mode, with the -O3 optimization.
5 Simulation Results
We first profile our target applications by running them on the simulator without
fault injection. These profiling results are called baseline results. We use baseline
results 1)to setup the normal execution time for each application and 2) to collect
Evaluating Application Vulnerability to Soft Errors 277
the correct output if applicable. We repeat each simulation for 10 times and
obtain consistent profiling results as shown in Table 2. Note that as a Dunnington
processor has six L1 I/D caches and three L2 caches, the access numbers in the
table are averaged access numbers for each individual L1 and L2 cache. The
total execution cycle is the largest execution cycle number among six cores.
In the second step we simulate the applications with the randomly generated
fault injection configuration files. We have observed different responses from the
simulations with injections. In the rest of this section, we first describe four
different responses caused by fault injections; then we compare the responses
from different benchmarks for fault injections from each cache level.
Table 4. Applications responses percentages for L1 data cache fault injection simula-
tions: 1-bit upset in single byte (SBU1) and 2-bit upset in consecutive bytes (MBU2)
6 Conclusions
We present a cache fault injection framework based on a fast processor simulator.
Running several scientific computing programs on this simulator with injected
280 Z. Ma et al.
cache bit errors, we have observed various responses from the simulated programs
with different probabilities. All programs show that a large percentage of errors
are filtered out and hence invisible at the application level. For the errors that do
cause an application failure, application crash is the most likely type of failures
(4.8% – 16.0%); while silent data corruption, though relatively rare, is still not
negligible (up to 5.1% for FFT). Moverover, our results indicate that different
programs have different levels of vulnerability to bit errors injected in different
caches (e.g., 6.4% application failures for Cholesky v s. 17.6% for FFT in L3 cache
fault injection simulations). These results suggest that the benefits of protecting
an individual cache depends on the application program that is running on this
processor.
References
1. Baumann, R.: Soft errors in advanced computer systems. IEEE Design & Test of
Computers 22(3), 258–266 (2005)
2. Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear al-
gebra methods. In: SELSE (2007)
3. Carlson, T.E., Heirman, W., Eeckhout, L.: Exploring the level of abstraction for
scalable and accurate parallel multicore simulation. In: SC (2011)
4. Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley (1977)
5. da Lu, C., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: SC, p.
37. IEEE Computer Society (2004)
6. Daveau, J.-M., Blampey, A., Gasiot, G., Bulone, J., Roche, P.: An industrial fault
injection platform for soft-error dependability analysis and hardening of complex
system-on-a-chip. In: IRPS, pp. 212–220 (2009)
7. Heidel, D., Marchal, P., et al.: Single-event upsets and multiple-bit upsets on a
45nm SOI SRAM. IEEE Transactions on Nuclear Science 56(6), 3499–3504 (2009)
8. Kim, J., Hardavellas, N., Mai, K., Falsafi, B., Hoe, J.C.: Multi-bit error tolerant
caches using two-dimensional error coding. In: MICRO, pp. 197–209 (2007)
9. Luk, C.-K., Cohn, R.S., Muth, R., Patil, H., Klauser, A., Geoffrey Lowney, P., Wal-
lace, S., Reddi, V.J., Hazelwood, K.M.: Pin: building customized program analysis
tools with dynamic instrumentation. In: PLDI, pp. 190–200 (2005)
10. Mak, T.M., Mitra, S., Zhang, M.: DFT assisted built-in soft error resilience. In:
IOLTS, p. 69 (2005)
11. Miller, J.E., Kasture, H., Kurian, G., Gruenwald III, C., Beckmann, N., Celio, C.,
Eastep, J., Agarwal, A.: Graphite: A distributed parallel simulator for multicores.
In: HPCA, pp. 1–12 (2010)
12. Mukherjee, S.S., Weaver, C.T., Emer, J.S., Reinhardt, S.K., Austin, T.M.: A sys-
tematic methodology to compute the archi- tectural vulnerability factors for a
high-performance microprocessor. In: MICRO, pp. 29–42. ACM/IEEE (2003)
13. Ramachandran, P., Kudva, P., Kellington, J.W., Schumann, J., Sanda, P.: Statis-
tical fault injection. In: DSN, pp. 122–127. IEEE Computer Society (2008)
14. Rao, S., Sanda, P., Ackaret, J., Barrera, A., Yanez, J., Mitra, S.: Examing workload
dependence of soft error rates. In: SELSE (2008)
Evaluating Application Vulnerability to Soft Errors 281
15. Ruckerbauer, F.X., Georgakos, G.: Soft error rates in 65nm SRAMs analysis of
new phenomena. In: IOLTS, pp. 203–204 (2007)
16. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high performance
computing systems. In: DSN, pp. 249–258 (2006)
17. Wang, N.J., Fertig, M., Patel, S.J.: Y-branches: When you come to a fork in the
road, take it. In: IEEE PACT, pp. 56–66 (2003)
18. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs:
Characterization and methodological considerations. In: ISCA, pp. 24–36 (1995)
Experimental Framework for Injecting Logic
Errors in a Virtual Machine to Profile
Applications for Soft Error Resilience
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 282–291, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Experimental Framework for Injecting Logic Errors in a Virtual Machine 283
have found that major undertakings would be required to create resilient next-
generation systems[6,5]. High performance computing (HPC) systems of today
already struggle with reliability and these concerns are expected to only amplify
as systems are pushed to even larger scales.
The high performance computing (HPC) field of resilience aims to find ways to
run applications on often unreliable hardware with emphasis on making timely
progress toward a correct solution. The goal of resilience is to move beyond
merely tolerating faults but coexisting with failure to a point where failure is
recognized as the norm and not the exception.
One of the more daunting areas of resilience research is soft errors - those errors
which are generally transient in nature and difficult or impossible to reproduce.
Often these errors cause incorrect data values to be present in the system. While
soft errors are generally rare, there is evidence to believe that the rate is increas-
ing as feature sizes and voltages decrease[10]. Not only will these increasingly
common errors negatively impact performance while hardware corrects some of
them, we believe these errors will occur not only in the more familiar memory
but in logic circuits where traditional techniques will neither detect or be able
to correct the error. This leads us to believe that next generation systems will
either have to be hardened to get around these errors or application program-
mers will have to learn to design for systems that give incorrect answers with
some noticeable probability.
In this work we present SEFI, the Soft Error Fault Injection framework, a tool
aimed at quantifying just how resilient an application is to soft errors. While our
goal is to look at both corrupted data in memory and corrupted logic circuits, we
start our research by examining the latter. We choose to focus on logic errors as
faults in memory have been studied in the past and, to a large extent, hardware
to detect and correct such errors exists. Our software tools inject soft errors in the
logic operations at known locations in an application which allows us to observe
how the application responds to faulty behavior of the simulated hardware.
The rest of this paper is organized as follows: Section 2 presents an overview
of the logic soft error injection framework and then Section 3 outlines an initial
experiment and discusses the results. In Section 4 we discuss the importance of
this work and its intended uses. Section 5 compares our approach with other
work in the field. Finally, Section 6 discusses the future work and we conclude
with our findings in Section 7.
2 Overview of Methodology
SEFI’s logic soft error injection operational flow is roughly depicted in Figure 1.
First, the guest environment is booted and the application to inject faults into
is started. Next, we probe the guest operating system for information related to
the code region of the target application and notify the VM which code regions
to watch. Then the application is released, allowing it to run. The VM observes
the instructions occurring on the machine and augments ones of interest. A more
detailed explaination of these techniques follows.
284 N. DeBardeleben et al.
2.1 Startup
Initial startup of SEFI begins by simply booting a debug enabled Linux kernel
within a standard QEMU virtual machine. QEMU allows us to start a gdbserver
within the QEMU monitor such that we can attach to the running Linux kernel
with an external gdb instance. This allows us to set breakpoints and extract
kernel data structures from outside the guest operating system as well as from
outside QEMU itself. This is a fairly standard technique used by many Linux
kernel developers. Figure 2 depicts the startup phase.
2.2 Probe
Once the guest Linux operating system is fully booted and sitting idle we use
the attached external gdb to set a breakpoint at the end of the sys exec call
tree but before an application is sent to a cpu to be executed. We are currently
focused on only ELF binaries and have therefore set our breakpoint at the end
of the load elf binary routine. This is trivial to generalize to other binary
formats in future work. With the breakpoint set we are free to issue a continue
via gdb to allow the Linux kernel to operate. The application of interest can
now be started and will almost immediately hit our set breakpoint and bring
the kernel back to a stopped state. By this point in the exec procedure the
kernel has already loaded an application’s text section into physical memory
in a memory region denoted by the start code and end code elements of the
task’s mm struct memory structure. We can now extract the location in memory
assigned to our application by the kernel by walking the task list in the kernel.
Starting with the symbol init task, we can find the application of interest either
by comparing a binary name to the task struct’s comm field or by searching for
a known pid which is also contained in the task struct. The physical addresses
within the VM of the application’s text region can now be fed into our fault
Experimental Framework for Injecting Logic Errors in a Virtual Machine 285
injection code in the modified QEMU virtual machine. Currently this is done
by hand but we have plans to automate this discovery and transfer using scripts
and hypervisor calls.
Figure 3 depicts the probe phase of SEFI.
Qemu
Linux
start app
gdb gdb
[setup] set break at end of task−>mm−>start_code [fault injection]
load_elf_binary task−>mm−>end_code
In figure 4 we see that once QEMU has the code segment range of the target
application, the application is resumed. Next, when any opcode is called in the
guest hardware that we are interested in injecting faults into, QEMU checks the
current instruction pointer register (EIP). If that instruction pointer address is
within the range of the target application (obtained during the probe phase),
QEMU now is aware that the application we are targeting is running this par-
ticular instruction. At this point we are able to inject any number of faults and
have confidence that we are affecting only the desired application.
gdb
[probe] remove breakpoint
continue
The opcode fault injection code has several capabilities. Firstly, it can simply
flip a bit in the inputs of the operation. Flipping a bit in the input simulates
a soft error in the input registers used for this operation. Secondly, it can flip
a bit in the output of the operation. This simulates either a soft error in the
actual operation of the logic unit (such as a faulty multiplier) or soft error in
286 N. DeBardeleben et al.
the register after the data value is stored. Currently the bit flipping is random
but can be seeded to produce errors in a specified bit-range. Thirdly, opcode
fault injection can perform complicated changes to the output of operations by
flipping multiple bits in a pattern consistent with an error in part but not all of
an opcodes physical circuitry. For example, consider the difference in the output
of adding two floating point numbers of differing exponents if the a transient
error occurs for one of the numbers while setting up the significant digits so that
they can be added. By carefully considering the elements of such an operation
we can alter the output of such an operation to reflect all the different possible
incorrect outputs that might occur.
The fault injector also has the ability to let some calls to the opcode go
unmodified. It is possible to cause the faults to occur after a certain number of
calls or with some probability. In this way the fault can occur every time which
closely emulates permanently damaged hardware or can be used to emulate
transient soft errors by causing a single call to be faulty.
3 Experiments
y = y ∗ 0.9 (1)
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Iterations
Fig. 5. The multiplication experiment uses the floating point multiply instruction
where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9. For five
different experiments a random bit was flipped in the output of the multiply at itera-
tion 10, simulating a soft error in the logic unit or output register.
0.44 0.035
0.42
0.4 0.03
0.38
0.025
0.36
0.34
0.02
0.32
0.3
0.015
0.28
10 11 12 35 36 37 38 39 40
(a) Multiply Experiment - area of interest: (b) Multiply Experiment - area of interest:
injected faults final results
Fig. 6. Experiment #1 with the focus on the injection point (a) and the effects on the
final solution (b). In (a) it can be seen that each of the five separately injected faults
all cause the value of y to change - once radically, the other times slightly. In (b) it can
be seen that the final output of the algorithm differs due to these injected faults.
288 N. DeBardeleben et al.
A B C D E F G H
30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0
31.0 31.125 32.0 481.0 23.0 8.5 128849018881.0 1966081.0
32.0 32.125 33.0 482.0 24.0 9.5 128849018882.0 1966082.0
y = y + 1.0 (2)
These experiments were crafted to demonstrate the capability of SEFI to inject
errors into specific instructions and clearly do not represent interesting applica-
tions. The next steps will be to inject faults into benchmark applications (such as
BLAS and LAPACK) to study the soft error vulnerability of those applications.
4 Intended Uses
It is our intention to use SEFI to study the susceptibility of applications to
soft errors (logic initially, and later followed by memory). We expect to be able
to produce reports on the vulnerability of applications at a fine grain level -
at least at the functional level and perhaps at the instruction level. We have
demonstrated that we can inject logic faults at specific assembly instructions but
translating those instructions back to original higher level language instructions
will likely prove complex.
Hardware designers expend a great deal of resources to protect soft errors
from propagating into the software stack. While current wisdom is that these
protections are necessary, there are a variety of applications that could survive
with a great deal less protection and would willingly trade resilience for increases
in performance or decreases in power or cost. We believe SEFI begins to present
a way to experiment with and quantify the level of resilience of an application
to soft errors and might be useful in co-design of future systems.
5 Related Work
The work presented in this paper builds on years of open source research on
QEMU[1], a processor emulator and virtual machine. Bronevetsky, et. al[3,4,2]
Experimental Framework for Injecting Logic Errors in a Virtual Machine 289
is probably the closest related work to SEFI in the high performance computing
field. In [2] they create a fault injection tool for MPI that simulates MPI faults
that are often seen on HPC systems, such as stalls and dropped messages. In
[3,4] they performed random bit flips of application memories and observed how
the application responded.
It is important to understand the difference between our approach and that
presented in the memory bit flipping work of Bronevetsky. Bronevetsky’s ap-
proach most likely closely simulates a bit flip caused by a transient soft error in
that the bit flip happens randomly in memory. While they target these bit flips
at a target application, there appears to be no correlation to whether the mem-
ory region will be used by the application. As stated, this closely approximates
a real transient soft error. Our work, on the other hand, directly targets specific
instructions and forces corruption to appear at those lines. This approach is
directly targeted more at hardening a code from soft errors. It is our intention
to add functionality similar to Bronevetsky’s approach as a plug-in to SEFI in
future work.
Naughton, et. al, in [9] developed a fault injection framework that either
uses ptrace or the Linux kernel’s built-in fault injection framework. The kernel
approach allows injection of three different types of errors: slab errors, page
allocation errors, and disk I/O errors. While both approaches in this work are
similar to SEFI, our technique allows us to probe a wider range of possible faults.
TEMU[11] is a tool built upon QEMU like SEFI. The TEMU BitBlaze infras-
tructure is used to analyze applications for “taint” in a security context. This
tool does binary analysis using the tracecap software. We have not yet had the
time to determine of this suite of tools is usable for our interests but it does
appear promising that we can build upon TEMU.
NFTAPE[12] is a tool which is similar to SEFI in that it provides a fault
injection framework for conducting experiments on a variety of types of faults.
NFTAPE is a commercial tool, however, and therefore we have not had the
luxury of experimenting with it to this point.
6 Future Work
In order to validate our simulation of soft errors in logic we plan to test the same
applications we use in the VM on actual hardware subjected to high neutron
fluxes. Neutrons are well known to be the component of cosmic ray showers
that causes the greatest damage to computer circuits[13]. Neutrons are known
to cause both transient errors due to charge deposition and hard failures due to
permanent damage. We will use the neutron beam at the Los Alamos Neutron
Science Center (LANSCE) to approximate the cosmic ray induced events in a
logic circuit over the lifetime of a piece of computational hardware. Previous
work using the LANSCE beam has shown its usefulness in inducing silent data
corruption (SDC) in applications of interest.
Future versions of SEFI will include plugins to simulate more sophisticated
types of faults. Logic errors are unlikely to consist of simple random bit flips.
290 N. DeBardeleben et al.
We believe the combination of SEFI testing and neutron beam validation will
allow us to build realistic models of specific types of logic failures. We also plan
on extending SEFI to model multi-bit memory errors which are undetectable by
current memory correction techniques.
7 Conclusion
In this paper we have demonstrated the capability to inject simulated soft errors
into a virtual machine’s instruction emulation facilities. More importantly, we
have demonstrated how to target these errors so as to be able to reasonably
conduct experiments on the soft error vulnerability of a target application. This
type of experimentation is usually complicated because faults that are introduced
cause errors in other portions of the system, especially the operating system, and
often results in outright crashes. This makes getting meaningful data about the
injected faults difficult. The approach presented in this paper gets around these
limitations and provides quite a bit of control.
References
1. Bellard, F.: Qemu, a fast and portable dynamic translator. In: Proceedings of the
Annual Conference on USENIX Annual Technical Conference, ATEC 2005, p. 41.
USENIX Association, Berkeley (2005)
2. Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B., Schulz, M., Anh, D.: Statis-
tical fault detection for parallel applications with automaded. In: IEEE Workshop
on Silicon Errors in Logic - System Effects, SELSE (March 2010)
3. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra
methods. In: Workshop on Silicon Errors in Logic - System Effects, SELSE (April
2007)
4. Bronevetsky, G., de Supinski, B.R., Schulz, M.: A foundation for the accurate
prediction of the soft error vulnerability of scientic applications. In: IEEE Workshop
on Silicon Errors in Logic - System Effects (March 2009)
5. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale
resilience. International Journal of High Performance Computing Applications 23,
374–388 (2009)
6. DeBardeleben, N., Laros, J., Daly, J., Scott, S., Engelmann, C., Harrod, B.: High-
end computing resilience: Analysis of issues facing the hec community and path-
forward for research and development (December 2009),
http://institute.lanl.gov/resilience/docs/HECResilience.pdf
Experimental Framework for Injecting Logic Errors in a Virtual Machine 291
7. Dongarra, J., et al.: The international exascale software project roadmap. Interna-
tional Journal of High Performance Computing Applications 25, 3–60 (2011)
8. Kogge, P., et al.: Exascale computing study: Technology challenges in achieving
exascale systems (2008)
9. Naughton, T., Bland, W., Vallee, G., Engelmann, C., Scott, S.L.: Fault injection
framework for system resilience evaluation: fake faults for finding future failures. In:
Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience
2009, pp. 23–28. ACM, New York (2009)
10. Quinn, H., Graham, P.: Terrestrial-based radiation upsets: A cautionary tale. In:
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Cus-
tom Computing Machines, pp. 193–202. IEEE Computer Society, Washington, DC
(2005)
11. Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M.G., Liang, Z., New-
some, J., Poonsankam, P., Saxena, P.: A high-level overview covering vine, temu,
and rudder. In: Proceedings of the 4th International Conference on Information
Systems Security (December 2008)
12. Stott, D., Floering, B., Burke, D., Kalbarczpk, Z., Iyer, R.: Nftape: a framework
for assessing dependability in distributed systems with lightweight fault injectors.
In: Proceedings of IEEE International Computer Performance and Dependability
Symposium, IPDS 2000, pp. 91–100 (2000)
13. Ziegler, J.F., Lanford, W.A.: The effect of sea level cosmic rays on electric devices.
Journal Applied Physics 528 (1981)
High Availability on Cloud with HA-OSCAR
1 Introduction
Cloud computing refers to a service-oriented paradigm where service providers offer the
computing resources such as hardware, software, storage and platforms as services
according to the demands of the user. The benefit of cloud computing is to increase
utilization of available computing resources and reduction of burden and responsibilities
of end-users by renting resources, and thus, increase economic efficiency [1]. Cloud
computing collects computing resources and manages them automatically through
dynamic provisioning and often virtualized resources. The user or client companies do
not deal with software and hardware administrative issues, as they can buy these virtual
resources through the cloud service providers depending on their needs [2]. The focus of
cloud computing is to provide easy, secure, fast, convenient and inexpensive net
computing and data storage service centered on the Internet. This however transfers
such responsibilities to service providers in order to ensure QoS.
HA systems are increasingly vital and important due to their ability to sustain
critical services to users. Also HA services are very important in clouds because
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 292–301, 2012.
© Springer-Verlag Berlin Heidelberg 2012
High Availability on Cloud with HA-OSCAR 293
companies and users depend on these cloud providers for their critical data. In order
for the cloud computing to be effective in business, scientific research etc., high
availability is a must. Thus, we foresee that it is critically important that we enable
cloud infrastructure with HA.
The HA-OSCAR [7] project originally started from OSCAR (Open Source Cluster
Application Resource) project. OSCAR is a cluster software stack that provides a high
performance computing runtime stack and tools for cluster computing [5]. The main
goal of the HA-OSCAR project was to leverage existing OSCAR technology, so the
HA-OSCAR project was formed to provide high-availability capabilities in OSCAR
clusters. HA-OSCAR then introduces several enhancements and new features to
OSCAR mainly in areas of availability, scalability and security [5],[11]. Initially
HAOSCAR [8] only supported OSCAR clusters, however, the current version
supports most Linux-based IT infrastructures, and not just OSCAR clusters. Thus,
HA-OSCAR is a capable cloud platform that enables not only scalability aspect via its
cluster computing capability with OSCAR, or the like, but also HA solutions.
In this paper we introduce a new system, HA-OSCAR 2.0 [14] (an Open Solution
HA-enabling framework for mission critical systems), capable of enhancing HA in
cloud platforms by adopting component redundancy to eliminate single-point-of-
failures. Thus, system critical resources are replicated so that failures of any resources
will not take down the entire system, thereby making cloud infrastructure highly
available, especially at the head node. Some of the new and improved features this
version incorporates are self-healing mechanisms, failure detection, automatic
synchronization, and, fail-over and fail-back functionality [7], [14].
2 Related Work
OSCAR –V [6] provides an enhanced set of tools and packages for the creation,
deployment and the management of virtual machines and host operating system
within a physical cluster. Virtualization in OSCAR clusters is needed to decouple the
operating system, customize the execution environment, and provide computing based
on the need of the user. Increased HA thus makes it perfectly compatible with HPC
cloud computation. OSCAR-V uses Xen as the virtualization solution and provides
V2M virtual machine management of physical clusters.
Another interesting and highly scalable cluster solution in cloud computing is
Rocks+ [9]. It can be used for running a public cloud or setting up an internal private
cloud. Rocks+ is based upon the well-known software Rocks. Rocks contain all the
necessary software components required to easily build or maintain a cluster or cloud.
Rocks+ can manage an entire data center running all the computational resources and
services necessary to operate a cloud infrastructure with a single management point.
Rocks+, with Rolls pre-owned software, allows users to build web servers, database
servers, and compute servers in the cloud. Rocks+ also provides a framework for
user-specific needs on clouds. Rocks+ provides CPU and GPU cluster management in
less time and with lower costs through software automation. [10].
294 T. Thanakornworakij et al.
3 HA-OSCAR 2.0
HA-OSCAR 2.0 is an open source framework that provides HA for mission critical
applications and ease of system provisioning and administration. The main goal of the
new HA-OSCAR Project is to provide an open solution, with improved flexibility,
which seeks to combine the power of HA and High Performance Computing (HPC)
solutions. It is capable of enhancing HA for potential cloud computing infrastructure
such as web services, and HPC clouds by providing the much-needed redundancy for
mission critical applications. To achieve HA, HA-OSCAR 2.0 uses HATCI (High
Availability Tools Configuration and Installation). HATCI is composed of three
components: Node Redundancy, Service Redundancy and Data Replication Services.
The installation process requires just a few steps with minimum user input. HA-
OSCAR 2.0 incorporates a feature to clone the system in the installation step to make
the data and software stacks consistent on the standby head node or a cloud gateway.
If the primary component fails, the cloned node takes over the responsibilities. HA-
OSCAR also features monitoring services with a flexible event-driven rule-based
system. Moreover, it provides data synchronization between the primary and
secondary system. All of these features are enabled in the installation process.
accordingly. This is a new module that was not available in earlier versions of HA-
OSCAR. During a fail-back event, data will synchronize from the standby to the
primary server. This backwards synchronization occurs to propagate changes made to
the secondary server while it is the head node. By default, HA-filemon will invoke
rsync 2 minutes after it detects the first change in files, to allow groups of changes to
transmit together. Users can change this time according to the need of their
applications. System Imager is used for cloning the primary node during the
installation process. It creates a standby head node image from the primary server.
Finally, HA-OSCAR 2.0 supports virtual machine management via integration with
OSCAR-V.
The incorporation of all the above services endows HA-OSCAR 2.0 with the
ability to provide true HA and high performance for cloud users, for whom critical
services must be guaranteed. Thus HA-OSCAR 2.0 is a potentially viable open source
solution for achieving HA in cloud computing.
A mission-critical web service is an example where HA-OSCAR 2.0 can be
applied to a cloud. Hardware or software failure, and routine maintenance are
potential factors affecting services unavailability. Installing HA-OSCAR to provide
component redundancy can alleviate this problem. In the installation process of HA-
OSCAR 2.0 for web services, a clone of the primary web server is made that acts as a
standby server, and maintains data synchronization between primary and standby
server for data consistency. The primary web server receives requests from clients and
serves the requests directly or reroutes them to a web farm via LVS [15] or the like.
When failure occurs on the primary web server, the standby web server will take over
as primary web server. It will automatically be configured with the same IP address as
the primary web server so that all the requests are redirected to the standby server in
the same cloud advertised address, making the web service highly available. When
the primary web server is available again, and the repair is completed, by default this
server will become the standby server. If users need the fixed server to be the primary
server, they have to run the fail-back script to make the fixed server work as primary
web server.
4 System Architecture
In this section, we first examine the OSCAR-V architecture as a potential HPC cloud
and its anatomy, in order to identify single-point-of-failure components. This will
provide an opportunity to introduce system level redundancy that will produce a HA
initiated improvement over the existing OSCAR-V cluster framework. However, only
a brief description of the proposed architecture is entailed here. Additional HA-
OSCAR details may be found in [14].
nodes and compute nodes. The head node provides service requests and routes
appropriate tasks to compute nodes. Compute nodes are primarily dedicated to
computation. The present OSCAR-V cluster architecture consists of a single server
node and a number of client nodes, where all the client nodes can be virtualized by
Xen virtualization solution.
5 System Model
cloud that will not suffer from single point of failure. We made several assumptions
for the state-space as follows:
• Time to failure for both virtualization system servers and switches is
exponentially distributed, with the parameters λv for the servers and λw for
the switches, respectively. We consider the failure of the Virtualization
system server that has virtualization and the physical server as one
component.
• Failed components can be repaired.
• Times to repair for a server and switches are exponentially distributed with
parameters μ and β.
• When the system is down, no further failure can take place. Hence, for the
OSCAR-V cluster, when the server is down, no further failure can take place
on the switch. Similarly, when the switch is down, no further failure can take
place on the server. For HA-OSCAR-V clusters, when both servers are down,
no further failure can take place on the switches. Similarly, when both
switches are down, no further failure can take place on the HA-OSCAR-V
cluster.
Figure 4 shows the CTMC [12], [13] model corresponding to HA-OSCAR-V cluster
system. Table 1 shows the states, number of operating components, and their
corresponding system status. The system is available for service in states 1, 2, 4 and 5,
and is unavailable in states 3, 6, 7 and 8. The system goes from one state to another at
the rates showed in the arrow lines in Figure 4.
High Availability on Cloud with HA-OSCAR 299
6 Availability Analysis
Let πi be the steady-state probability of state i of the CTMC. They will satisfy the
following equations:
and πQ = 0 ,
where Q is the infinitesimal generator matrix [13].
Let U be the set of up states, the availability of the system A is
, (1)
300 T. Thanakornworakij et al.
where
, (2)
where
6.3 HA Comparison
We assume that λv = 0.001hr-1, λw = 0.0005 hr-1, μ = 0.5 hr-1, and β = 1.0 hr-1.
With equations 1, 2, and 3 [13] , we can calculate the availability of the system. The
availability for OSCAR-V server cluster is 0.996, and the availability for the HA-
OSCAR-V cluster system is 0.99999. The downtime of the two systems in a year is
39.2 hours and 4.45 minutes, respectively. Typically a HA system is one that has a
downtime that does not exceed “five-nines” or 99.999%.
References
1. Zhang, S., Zhang, S., Chen, X., Huo, X.: Cloud Computing Research and Development
Trend. In: Second International Conference on Future Networks, ICFN 2010, January 22-
24, pp. 93–97 (2010)
High Availability on Cloud with HA-OSCAR 301
2. Zhang, S., Zhang, S., Chen, X., Wu, S.: Analysis and Research of Cloud Computing
System Instance. In: 2010 Second International Conference on Future Networks, ICFN
2010, pp. 88–92 (2010)
3. Jung, G., Joshi, K.R., Hiltunen, M.A.: Performance and Availability Aware Regeneration
for Cloud Based Multitier Application. In: Dependable Systems and Networks (DSN), pp.
497–506 (2010)
4. Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., Warfield, A.: Remus:
High Availability via Asynchronous Virtual Machine Replication. In: 5th USENIX
Symposium on Networked Systems Design and Implementation (2008)
5. Brim, M.J., Mattson, T.G., Scott, S.L.: OSCAR: Open Source Cluster Application
Resources. In: Ottawa Linux Symposium 2001, Ottawa, Canada (2001)
6. OSCAR-V, http://www.csm.ornl.gov/srt/oscarv/
7. Leangsuksun, C., Liu, T., Scott, S.L., Libby, R., Haddad, I., et al.: HA-OSCAR Release
1.0: Unleashing HABeowulf. In: International Symposium on High Performance
Computing Systems (HPCS), Canada (May 2004)
8. Haddad, I., Leangsuksun, C., Scott, S.L.: HA-OSCAR: the birth of highly available
OSCAR. Linux J. 2003(115), 1 (2003)
9. Rock+, http://www.clustercorp.com/
10. http://www.hpcwire.com/offthewire/
Clustercorp-Brings-Rocks-to-the-Cloud-108706864.html
11. Leangsuksun, C.B., Shen, L., Liu, T., Scott, S.L.: Achieving HA and performance
computing with an HA-OSCAR cluster. Future Generation Computing Syst. 21(4), 597–
606 (2005)
12. Leangsuksun, C., Shen, L., Song, H., Scott, S.L., Haddad, I.: The Modeling and
Dependability Analysis of High Availability OSCAR Cluster. In: The 17th Annual
International Symposium on High Performance Computing Systems and Applications,
Quebec, Canada, pp. 11–14 (May 2003)
13. Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science
Applications. John Wiley and Sons, New York (2001)
14. HA-OSCAR 2.0, http://hpci.latech.edu/blog/?page_id=45
15. Linux Virtual Server (LVS), http://www.linuxvirtualserver.org/
16. Nimbus, http://www.nimbusproject.org
17. Nicholas, B., Papaioannou, T.G., Aberer, K.: An Economic Approach for Scalable and
Highly-Available Distributed Applications. In: IEEE International Conference on Cloud
Computing (2010)
On the Viability of Checkpoint Compression
for Extreme Scale Fault Tolerance
1 Introduction
Over the past few decades, high-performance computing (HPC) systems have
increased dramatically in size, and these trends are expected to continue. On
the most recent Top 500 list [27], 223 (or 44.6.%) of the 500 entries have greater
than 8,192 cores, compared to 15 (or 3.0%) just 5 years ago. Also from this most
recent listing, four of the systems are larger than 200K cores; an additional six are
larger than 128K cores, and another six are larger than 64K cores. The Lawrence
Livermore National Laboratory is scheduled to receive its 1.6 million core system,
Sequoia [2], this year. Furthermore, future extreme systems are projected to have
on the order of tens to hundreds of millions of cores by 2020 [14].
It also is expected that future high-end systems will increase in complexity;
for example, heterogeneous systems like CPU/GPU-based systems are expected
to become much more prominent. Increased complexity generally suggests that
Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 302–311, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Checkpoint Compression 303
individual components likely will be more failure prone. Increased system sizes
also will contribute to extremely low mean times between failures (MTBF), since
MTBF is inversely proportional to system size. Recent studies indeed conclude
that system failure rates depend mostly on system size, particularly, the number
of processor chips in the system. These studies also conclude that if current HPC
system growth trend continues, expected system MTBF for the biggest systems
on the Top 500 lists will fall below 10 minutes in the next few years [10,26]
Checkpoint/restart [5] is perhaps the most commonly used HPC fault-tolerance
mechanism. During normal operation, checkpoint/restart protocols periodically
record process (and communication) state to storage devices that survive tol-
erated failures. Process state comprises all the state necessary to run a process
correctly including its memory and register states. When a process fails, a new
incarnation of the failed process is resumed from the intermediate state in the
failed process’ most recent checkpoint – thereby reducing the amount of lost com-
putation. Rollback recovery is a well studied, general fault tolerance mechanism.
However, recent studies [7,10] predict poor utilizations (approaching 0%) for ap-
plications running on imminent systems and the need for resources dedicated to
reliability.
If checkpoint/restart protocols are to be employed for future extreme scale
systems, checkpoint/restart overhead must be reduced. For the checkpoint com-
mit problem, saving an application checkpoint to stable storage, we can consider
two sets of strategies. The first set of strategies hide or reduce commit laten-
cies without actually reducing the amount of data to commit. These strategies
include concurrent checkpointing [17,18], diskless checkpointing [22] and check-
pointing filesystems [3]. The second set of strategies reduce commit latencies
by reducing checkpoint sizes. These strategies include memory exclusion [23],
incremental checkpointing [6] and multi-level checkpointing [19].
This work falls under the second set of strategies. We focus on reducing the
amount of checkpoint data, particularly via checkpoint compression. We have
one fundamental goal: to understand the viability of checkpoint compression
for the types of scientific applications expected to run at large scale on future
generation HPC systems. Using several mini-applications or mini apps from the
Mantevo Project [12] and the Berkeley Lab Checkpoint/Restart (BLCR) frame-
work [11], we explore the feasibility of state-of-the-field compression techniques
for efficiently reducing checkpoint sizes. We use a simple checkpoint compression
viability model to determine when checkpoint compression is a sensible choice,
that is, when the benefits of data reduction outweigh the drawbacks of compres-
sion latency.
In the next section, we present a general background of checkpoint/restart
methods, after which we describe previous work in checkpoint compression and
our checkpoint compression viability model. In Section 3, we describe the ap-
plications, compression algorithms and the checkpoint library that comprise our
evaluation framework as well as our experimental results. We conclude with a
discussion of the implications of our experimental results for future checkpoint
compression research.
304 D. Ibtesham et al.
2 Checkpoint Compression
commit-speed
< compression-factor (1)
compression-speed
In other words, if the ratio of the checkpoint commit speed to checkpoint com-
pression speed is less than the compression factor, checkpoint data compression
provides an overall time (and space) performance reduction. Our model assumes
that checkpoint commit is synchronous; that is, the primary application pro-
cess is paused during the commit operation and is not resumed until checkpoint
commit is complete. In Section 4, we discuss the implications of this assumption.
1
Mini apps are small, self-contained programs that embody essential performance
characteristics of key applications.
306 D. Ibtesham et al.
2
We do not present results for several other algorithms, for example gzip, that did
not perform well.
Checkpoint Compression 307
We vary two pbzip2 parameters. The first parameter is the same block size
parameter as in bzip2. The second parameter defines the file block size into
which the original input file is partitioned. This is labeled as pbzip2(x, y),
where x is the value of the first parameter and y is the value of the second
parameter.
– rzip: Rzip uses a very large buffer to take advantage of redundancies that
span very long distances. It finds and encodes large chunk of duplicate data
and then use bzip2 as a backend to compress the encoding.
We vary rzip’s parameter, which toggles the tradeoff between compression
factor and compression latency. As was the case for zip, this integer parame-
ter ranges from zero to nine, where one means fastest compression speed and
nine means best compression factor. In our charts we use the label rzip(x),
where x is the value of this parameter.
For each application, the average uncompressed checkpoint size ranged from 311
MB to 393 MB. Our first set of results, presented in Figure 1, demonstrate
how effective the various algorithms are at compressing checkpoint data. With
the exception of the Rzip(-0), all the algorithms achieve a very high compres-
sion factor of about 70% or higher, where compression factor is computed as:
1 − uncompressed
compressed size
size . This means, then that the primary distinguishing factor be-
comes the compression speed, that is, how quickly the algorithms can compress
the checkpoint data.
Figure 2 shows how long the algorithms take to compress the checkpoints.
In general, and not surprisingly, the parallel implementation of bzip2, pbzip2,
generally outperforms all the other algorithms.
3
For each algorithm, a different set of parameter values constitute a different test.
308 D. Ibtesham et al.
Fig. 1. Checkpoint compression ratios for the various algorithms and applications
4 Discussion
While the results of this preliminary study are promising, we observe several
shortcomings that we plan to address. These shortcomings include:
Fig. 2. Checkpoint compression times for the various algorithms and applications
Fig. 3. Checkpoint Compression Viability: Unless, checkpoint commit rate exceeds the
compression speed × compression factor product (y-axis), checkpoint compression is a
good solution
References
11. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clus-
ters. Journal of Physics: Conference Series 46(1) (2006)
12. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C.,
Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improv-
ing performance via mini-applications. Technical Report SAND2009-5574, Sandia
National Laboratory (2009)
13. Morse Jr., K.G.: Compression tools compared (137) (September 2005)
14. Kogge, P.: ExaScale Computing Study: Technology Challenges in Achieving Ex-
ascale Systems. Technical report, Defense Advanced Research Projects Agency
Information Processing Techniques Office (DARPA IPTO) (September 2008)
15. Lee, J., Winslett, M., Ma, X., Yu, S.: Enhancing data migration performance
via parallel data compression. In: Proceedings International on Parallel and Dis-
tributed Processing Symposium, IPDPS 2002, Abstracts and CD-ROM, pp. 444–
451 (2002)
16. Li, C.-C., Fuchs, W.: Catch-compiler-assisted techniques for checkpointing. In: 20th
International Symposium on Fault-Tolerant Computing, FTCS-20, Digest of Pa-
pers, pp. 74–81 ( June 1990)
17. Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpoint for paral-
lel programs. In: 2nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPOPP 1990), pp. 79–88. ACM, Seattle (1990)
18. Li, K., Naughton, J.F., Plank, J.S.: Low-latency, concurrent checkpointing for par-
allel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–
879 (1994)
19. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling,
and evaluation of a scalable multi-level checkpointing system. In: Proceedings of
the 2010 ACM/IEEE International Conference for High Performance Computing,
Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society,
Washington, DC (2010)
20. Moshovos, A., Kostopoulos, A.: Cost-effective, high-performance giga-scale check-
point/restore. Technical report, University of Toronto (November 2004)
21. Pavlov, I.: Lzma sdk (software development kit) (2007)
22. Plank, J., Li, K., Puening, M.: Diskless checkpointing. IEEE Transactions on Par-
allel and Distributed Systems 9(10), 972–986 (1998)
23. Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing
the performance of checkpointing systems. Software – Practice & Experience 29(2),
125–142 (1999)
24. Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. IEEE Par-
allel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)
25. Plank, J.S., Xu, J., Netzer, R.H.B.: Compressed differences: An algorithm for fast
incremental checkpointing. Technical Report CS-95-302, University of Tennessee
(August 1995)
26. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance
computing systems. In: Dependable Systems and Networks (DSN 2006), Philadel-
phia, PA (June 2006)
27. Top 500 Supercomputer Sites, http://www.top500.org/ (visited September 2011)
28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory 23(3), 337–343 (1977)
Can Checkpoint/Restart Mechanisms Benefit
from Hierarchical Data Staging?
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 312–321, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Hierarchical Data Staging 313
each component has only a very small chance of failure, the combination of all
components has a much higher chance of failure. The Mean Time Between Fail-
ures (MTBF) for typical HEC installations is currently estimated to be between
eight hours and fifteen days [19,7]. In order to continue computing past the
MTBF of the system, fault-tolerance has become a necessity. The most common
form of fault-tolerant solution on current generation system is checkpointing.
An application or library periodically generates a checkpoint that encapsulates
its state and saves it to a stable storage (usually a central parallel filesystem).
Upon a failure, the application can be rolled back to the last checkpoint.
Checkpoint/Restart support is provided by most of the commonly used MPI
stacks [8,12,6]. Checkpointing mechanisms are notoriously known for their heavy
I/O overhead to simultaneously dump images of many parallel processes to a
shared filesystem. Many studies have been carried out to tackle this I/O bot-
tleneck [16,5]. For example, SCR [15] proposes a multi-level checkpoint system
that stores data to the local storage on compute node, and relies on redundant
data copy to tolerate node failures. It requires a local disk or RAM disk to be
present at each compute node to store checkpoint data. There are many disk-less
clusters, and a memory-intensive application can effectively disable RAM disk
by using up most of system memory. Hence its applicability is constrained.
With the rapid advances in technology, many clusters are being built with high
performance commercial components such as high-speed low-latency networks
and advanced storage devices such as Solid State Drives (SSDs). These advanced
technologies provide an opportunity to redesign existing solutions to tackle the
I/O challenges imposed by Checkpoint/Restart. In this paper, we propose a
hierarchical data staging architecture to address the I/O bottleneck caused by
Checkpoint/Restart. Specifically we want to answer several questions:
1. How to design a hierarchical data staging architecture that can relieve com-
pute nodes from the relatively slow checkpoint writing, so that applications
can quickly resume execution?
2. How to leverage high speed network and new storage media such as SSD to
accelerate staging I/O performance?
3. How much of a performance penalty will the application have to pay to adopt
such a strategy?
We have designed a hierarchical data staging architecture that uses a dedicated
set of staging server nodes to offload checkpoint writing. Experimental results
show that the checkpoint time, as it appears to the application, can be 8.3 times
lesser compared to the basic approach for which each application process directly
writes checkpoint to a shared Lustre filesystem.
The rest of the paper is organized as follows. In section 2, we give a background
about the key components involved in our design. In Section 3, we propose
our hierarchical staging design. In section 4, we present our experiments and
evaluation. Related work is discussed in Section 5, and in section 6, we present
the conclusion and future work.
314 R. Rajachandrasekar et al.
Checkpoint
staging
Staging
node Background
transfer
Shared Shared
filesytem filesytem
Fig. 1. Comparison between the direct checkpoint and the checkpoint staging ap-
proaches
2 Background
Filesystem in Userspace (FUSE). Filesystem in Userspace (FUSE) [1] is
a software that allows the creation of a virtual filesystem in the user level. It
relies on a kernel module to perform privileged operations at the kernel level,
and provides a userspace library to communicate with this kernel module. FUSE
is widely used to create filesystems that do not really store the data itself but
relies on other resources to effectively store the data.
3 Detailed Design
The central principle of our Hierarchical Data Staging Framework is to provide
a fast and temporary storage area in order to absorb the I/O load burst induced
by a checkpointing operation. This fast staging area is governed by, what we
call, a Staging server. In addition to what a generic compute-node is configured
with, staging servers are over-provisioned with high-throughput SSDs and high-
bandwidth links. Given the fact that such hardware is expensive, this design
avoids the need to install them on every compute-node.
Figure 1 shows a comparison between the classic direct-checkpointing and our
checkpoint-staging approaches. On the left, with the classic approach, the check-
point files are directly written on the shared filesystem. Due to the heavy I/O
burden imposed on the shared filesystem by the checkpointing operation, the
parallel writes get multiplexed, and the aggregate throughput is reduced. This
increases the time for which the application blocks, waiting for the checkpoint-
ing operation to complete. On the right, with the staging approach, the staging
Hierarchical Data Staging 315
nodes are able to quickly absorb the large amount of data thrust upon them
by the client nodes, with the help of the scratch space provided by the staging
servers. Once the checkpoint data has been written to the staging nodes, the
application can resume. Then, the data transfer between the staging servers and
the shared filesystem takes place in background and overlaps with the compu-
tation. Hence, this approach reduces the idling time of application due to the
checkpoint. Regardless of which approach is chosen to write the checkpointing
data, it eventually has to reach the same media.
We have designed and developed an efficient software subsystem which can
handle large, concurrent snapshot writes from typical rollback recovery protocols
and can leverage the fast storage services provided by the staging server. We use
this software subsystem to study the benefits of hierarchical data staging in
Checkpointing mechanisms.
Figure 2 shows a global overview of our Hierarchical Data Staging Framework
which has been designed for use with these staging nodes. A group of clients,
governed by a single staging server, represents a staging group. These staging
groups are building blocks of the entire architecture. Our design imposes no
restriction on the number of blocks that can be used in a system. The internal
interactions between the compute nodes and a staging server are illustrated for
one staging group in the figure.
With the proposed design, neither the application nor the MPI stack needs
to be modified to utilize the staging service. We have developed a virtual filesys-
tem based on FUSE [1] to provide this convenience. The applications that run
on compute nodes can access this staging filesystem just like any other local
filesystem. FUSE provides the ability to intercept standard filesystem calls such
as open(), read(), write(), close() etc., and manipulate the data as needed at
user-level, before forwarding the call and the data to the kernel. This ability is
exploited to transparently send the data to the staging area, rather than writing
to the local or shared filesystem.
One of the major concerns with checkpointing is the high degree of concur-
rency with which multiple client nodes write process snapshots to a shared stable
storage subsystem. These concurrent write streams introduce severe contention
316 R. Rajachandrasekar et al.
at the Virtual Filesystem Switch (VFS) which impairs the total throughput. To
avoid this contention caused by small and medium-sized writes which is com-
mon in the case of checkpointing, we use the write-aggregation method proposed
and studied in [17]. It allows to coalesce the write requests from the applica-
tion/checkpointing library, and group them into fewer large-sized writes, which
in turn reduces the number of pages allocated to them from the page cache.
After aggregating the data buffers, instead of writing them to the local disk, the
buffers are en-queued in a work-queue which is serviced by a separate thread
that handles the network transfers.
The primary goal of this staging framework is to let the application which is
being checkpointed proceed with its computation as early as possible, without
penalizing it for the shortcomings of the underlying storage system. The Infini-
Band network fabric has RDMA capability which allows for direct reads/writes
to/from host memory without involving the host processor. This capability has
been exploited to directly read the data that is aggregated in the client’s mem-
ory, which then gets transferred to the staging node which governs it. The stag-
ing node writes the data to a high-throughput node-local SSD while it receives
chunks of data from the client node (step A1 in Fig. 2). Once the data has been
persisted in these Staging servers, the application can be certain that the check-
point has been safely stored, and can proceed with its computation phase. The
data from the SSDs on individual servers are then moved to a stable distributed
filesystem in a lazy manner (step A2 in Fig. 2).
Concerning the reliability of our staging approach, we have to notice that, af-
ter a checkpoint, all the checkpoint files are eventually stored in the same shared
filesystem as in the direct-checkpointing approach. So the both approaches pro-
vide the same reliability regarding the saved data. However, with the staging
approach, the checkpointing operation is faster. This reduces the odds of losing
the checkpoint data due to a compute node failure. During a checkpoint, the
staging servers introduce additional points of failure. To counter effects of such
a failure, we ensure that the previous set of checkpoint files are not deleted before
all the new ones are safely transferred to the shared filesystem.
4 Experimental Evaluation
4.1 Experimental Testbed
A 64-node InfiniBand Linux cluster was used for the experiments. Each client node
has eight processor cores on two Intel Xeon 2.33 GHz Quad-core CPUs. Each node
has 6 GB main memory and a 250 GB ext3 disk drive. The nodes are connected
with Mellanox MT25208 DDR InfiniBand HCAs for low-latency communication.
The nodes are also connected with a 1 GigE network for interactive logging and
maintenance purposes. Each node runs Linux 2.6.30 with FUSE library 2.8.5.
The primary shared storage partition is backed by Lustre. Lustre 1.8.3 is con-
figured using 1 MetaData Server (MDS) and 1 Object Storage Server (OSS), and
is set to use InfiniBand transport. The OSS uses a 12-disk RAID-0 configuration
which can provide a 300 MB/s write throughput.
Hierarchical Data Staging 317
530
510
490
1 2 3 4 5 6 7 8
Number of client nodes
Fig. 3. Throughput of a single staging server with varying number of clients and pro-
cesses per client (Higher is better)
The cluster also has 8 storage nodes, 4 of which have been configured to be
used as the “staging nodes”(as described in Fig. 2) for these experiments. Each
of these 4 nodes have PCI-Express based SSD cards with 80 GB capacity, two
of them being Fusion-io ioXtreme cards (350 MB/s write throughput) and two
others being Fusion-io ioDrive cards (600 MB/s write throughput).
1500
500
10 15 20 25 30
Number of client nodes (1 staging node for 8 client nodes)
Fig. 4. Throughput scalability analysis, with increasing number of Staging groups and
8 clients per group (Higher is better)
Figure 4 shows that the proposed architecture scales even as we increase the
number of groups. It is expected because it is designed in such a way that the I/O
resources are added proportionally to the number of computing resources. Con-
versely, the Lustre configuration does not offer such a possibility, so the Lustre
throughput stays constant. The maximal aggregated throughput observed for all
the staging nodes is 1,834 MB/s, which is close to the sum of write throughput
of the SSDs from these nodes (1,900 MB/s).
200
80
Time (seconds)
Time (seconds)
150
60
49.5 s
105.3 s
100
40
50
20
11.9 s 28.8 s
0
0
Lustre directly Staging approach Lustre directly Staging approach
Figure 5 reports the checkpointing time that we measured for the considered
application. For the proposed approach, two values are distinctly shown: the
checkpoint staging time (step A1 in Figure 2) and the background transfer time
(step A2 in Figure 2). The staging time is the checkpointing time as seen by the
application, i.e. the time during which the computation is suspended. The back-
ground transfer time is the time to transfer the checkpoint files from the staging
area to the Lustre filesystem, which takes place in parallel to the application
execution once the computation resumes.
For the classic approach, the checkpoint is directly written to the Lustre
filesystem, so we show only the checkpoint time (step B in figure 2). The appli-
cation is blocked on the checkpointing operation for the entire duration shown.
The direct checkpoint time and the background transfer time both write the
same amount of data to the same Lustre filesystem. The huge difference (twice
faster or more) between these data transfer times is because, thanks to our hi-
erarchical architecture, the contention on the shared filesystem is reduced. With
the direct-checkpointing approach, 128 or 144 processes write their checkpoint
simultaneously to the shared filesystem. With our staging approach, only 4 stag-
ing servers write simultaneously to the shared filesystem.
It is interesting to compare only the direct checkpoint time to the checkpoint
staging time because they correspond to the time which is seen by the application
(for classic approach and staging approach, respectively). Indeed, the background
transfer is overlapped by the computation.
Our results show the benefit of using the staging approach which considerably
reduces the time during which the application is suspended. For both our test
cases, the checkpoint time, as seen by the application, appears to be 8.3 times
faster. Then, the time gained can be used to make progress in the computation.
5 Related Work
Checkpoint/Restart is supported by several MPI stacks [8,12,6] to achieve fault
tolerance. Many of these stacks use FTB [9] as a back-plane to propagate fault
320 R. Rajachandrasekar et al.
staging. As part of the future work, we would like to extend this framework to
offload several other Fault-Tolerance protocols to the Staging server and relieve
the client of additional overhead.
References
1. Filesystem in userspace, http://fuse.sourceforge.net
2. IOzone filesystem benchmark, http://www.iozone.org
3. Top 500 supercomputers, http://www.top500.org
4. Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:
Datastager: Scalable data staging services for petascale applications. In: HPDC
(2009)
5. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte,
M., Wingate, M.: PLFS: a checkpoint filesystem for parallel applications. In: SC
(2009)
6. Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A.,
Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing
for large-scale fault tolerant mpi protocols. Future Generation Computer Systems
(2008)
7. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale
resilience. IJHPCA (2009)
8. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent check-
point/restart for mpi programs over infiniband. In: ICPP (2006)
9. Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Panda, D.K.,
Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant
systems. In: ICPP (2009)
10. Hargrove, P.H., Duell, J.C.: Berkeley Lab Checkpoint/Restart (BLCR) for Linux
Clusters. In: SciDAC (2006)
11. Hursey, J., Lumsdaine, A.: A composable runtime recovery policy framework sup-
porting resilient hpc applications. Tech. rep., University of Tennessee (2010)
12. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and imple-
mentation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS
(2007)
13. InfiniBand Trade Association: The InfiniBand Architecture,
http://www.infinibandta.org
14. Isaila, F., Garcia Blas, J., Carretero, J., Latham, R., Ross, R.: Design and evalua-
tion of multiple-level data staging for blue gene systems. TPDS (2011)
15. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and
evaluation of a scalable multi-level checkpointing system. In: SC (2010)
16. Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D.K.: Fast Check-
pointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore
Architecture. HiPC (2009)
17. Ouyang, X., Rajachandrasekhar, R., Besseron, X., Wang, H., Huang, J., Panda,
D.K.: CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In:
ICPP (2011) (to appear)
18. Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing
the performance of checkpointing systems. In: Software: Practice and Experience
(1999)
19. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Jour-
nal of Physics: Conference Series (2007)
Impact of Over-Decomposition on Coordinated
Checkpoint/Rollback Protocol
Abstract. Failure free execution will become rare in the future exas-
cale computers. Thus, fault tolerance is now an active field of research.
In this paper, we study the impact of decomposing an application in
much more parallelism that the physical parallelism on the rollback step
of fault tolerant coordinated protocols. This over-decomposition gives
the runtime a better opportunity to balance workload after failure with-
out the need of spare nodes, while preserving performance. We show
that the overhead on normal execution remains low for relevant factor of
over-decomposition. With over-decomposition, restart execution on the
remaining nodes after failures shows very good performance compared
to classic decomposition approach: our experiments show that the exe-
cution time after restart can be reduced by 42 %. We also consider a
partial restart protocol to reduce the amount of lost work in case of
failure by tracking the task dependencies inside processes. In some cases
and thanks to over-decomposition, this partial restart time can represent
only 54 % of the global restart time.
1 Introduction
1. Lack of flexibility after restart. Coordinated checkpoint implies that the ap-
plication will be recovered in the same configuration. Three approaches exist
to restart the failed processes. The first one is to wait for free nodes or for
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 322–332, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 323
2 Background
This work was motivated by the parallel domain decomposition applications. In
the remaining of the paper, we consider an iterative application called Poisson3D
which solves the Poisson’s partial differential equation with a 7-point stencil over
a 3D domain using finite difference method. The simulation domain is decom-
posed in d sub-domains. Then, the sub-domains are assigned to processes for the
computation (classically one sub-domain per MPI process).
Kaapi and Data Flow Model. Kaapi1 [11] is a task-based model for parallel
computing inherited from Athapascan [9]. Using the access mode specifications
(read, write) of the function-task parameters, the runtime is able to dynami-
cally compute the data flow dependencies between tasks from the sequence of
function-task calls, see [9,11]. These dependencies are used to execute concur-
rently independent tasks on the idle resources using work stealing scheduling [11].
Furthermore, this data flow graph is used to capture the application state for
many original checkpoint/rollback protocols [14,2].
1
http://kaapi.gforge.inria.fr
324 X. Besseron and T. Gautier
dom[0].0 dom[1].0 dom[2].0 dom[3].0 dom[4].0 dom[5].0 dom[0].0 dom[1].0 dom[2].0 dom[3].0 dom[4].0 dom[5].0
Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation
dom[0].1 dom[1].1 dom[2].1 dom[3].1 dom[4].1 dom[5].1 dom[0].1 dom[1].1 dom[2].1 dom[3].1 dom[4].1 dom[5].1
Fig. 1. Example of over-decomposition: the same data flow graph, generated for 6
sub-domains, is scheduled on 2 or 3 processors
Re-execution of Re-execution of
Checkpoint Failure the lost work Checkpoint Failure the lost work
Computation Computation
domain domain
1111
0000 11111
00000 111
000
P0
0000
1111 00000
11111 P0
000
111
0000
1111 00000
11111 1
0 000
111
P1
0000
1111
0000
1111 00000
11111
00000
11111
P1
0000
1111
0
1 000
111
000
111
0000
1111 00000
11111 0000
1111 000
111
P2
0000
1111
0000
1111 00000
11111
00000
11111
P2
0000
1111
0000
1111 000
111
000
111
0000
1111 00000
11111 0000
1111
0000
1111 000
111
000
111
0000
1111 00000
11111
partial
P3 W global
lost P3 W lost
0000
1111 00000
11111 0000
1111 000
111
P4
0000
1111 00000
11111 P4
0000
1111 000
111
global Time partial Time
Tlost Tre−execution Tlost Tre−execution
Fig. 2. Lost work and time to re-execute the lost work for global restart and partial
restart
to restart many processes on the same core (over-subscription), and that would
lead to poor execution performance after restart.
With Kaapi, an application checkpoint is made of its data flow graph [11,14].
Then it is possible to balance the workload after restart on the remaining pro-
cesses, without requiring new processes or new nodes. The over-decomposition
allows the scheduler to freely re-map tasks and data among processors in order
to keep a well-balanced workload. Experimental results on actual executions of
the Poisson3D application are presented in Section 4.1.
The partial restart for CCK [2] assumes that the application is checkpointed
periodically using a coordinated checkpoint. However, instead of restarting all
the processes from their last checkpoint as for global restart, partial restart only
needs to re-execute a subset of the work executed since the last checkpoint to
recover the application. We call the lost work, the computation that has been
executed before the failure, but that needs to be re-executed in order to restart
global
the application properly. Wlost is the lost work for global restart on Figure 2a
partial
and Wlost represents the lost work for partial restart on Figure 2b.
To allow the execution to resume properly, and similarly to the message logging
protocols [8], the non-failed processes have to replay the messages that have been
sent to the failed processes since their last checkpoint. Since these messages have
not been logged during execution, they will be regenerated by re-executing a subset
of tasks on the non-failed processes. This strictly required set of tasks is extracted
from the last checkpoint by tracking the dependencies inside the data flow graph [2].
This technique is possible because the result of the execution is directed by a
data flow graph where the reception order of the message does not impact the
computed value [9,11]. As a result, this ensures that the restarted processes will
reach exactly the same state as the failed processes before the failure [2].
326 X. Besseron and T. Gautier
Iteration time (s)
0.4
0.2
1 node
100 nodes
0.0
0 20 40 60 80 100
Number of sub−domains per node
Fig. 3. Iteration time in function of the number of sub domains per node with a
constant domain size per node (lower is better)
4 Experimental Results
We evaluate experimentally the techniques proposed in this paper with the Pois-
son3D application sketched in Section 2. The amount of the computation in
each iteration is constant, so the iteration time remains approximately constant
between steps. The following experiments report the average iteration time.
Our experimental testbed is composed on the Griffon and Grelon clusters
located at Nancy and part of the Grid’5000 platform2 . The Griffon cluster has
92 nodes with 16 GB of memory and two 4-core Intel Xeon L5420. The Grelon
cluster is composed of 120 nodes with 2 GB of memory and two 4-cores Intel Xeon
5110. All nodes from the both clusters are connected with a Gigabit Ethernet
network and 2 level of switches.
Over-decomposition overhead. Over-decomposition may introduce an over-
head at runtime due to the management of the parallelism. The purpose of this
first experiment is to measure this overhead. We use a constant domain size per
core: 107 double-type reals, i.e. 76 MB and we vary the decomposition d, which is
the number of sub-domains per core. We run this on 1 and 100 nodes. In both cases,
we use only 1 core for computation on each node to simplify the result analysis.
Figure 3 shows the results of the experiment on the Grelon cluster. Each
point is the average value of one hundred iterations and the error-bars show the
standard deviation. With one node, the iteration time for decomposition in 1
or 2 sub-domains is about 0.4 s. For 3 sub-domains, the execution time drops
by 35 % due to better cache use with small blocks. For higher decomposition
factors, the iteration time slowly increases linearly. It is the overhead due to the
2
http://www.grid5000.fr
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 327
Fig. 4. Slow-down after restart on 100 − p nodes compared to the execution before the
failures for different decompositions d (lower is better)
management of this higher level of parallelism. Compared to the best value (i.e.
for 3 sub-domains per node), the overhead is around 3 % for 10 sub-domains
per node and 25 % for 100 sub-domains. The curve shape with 100 nodes is
similar but it is shifted up, between 0.05 and 0.1 seconds higher, due to the
communication overhead.
We measure the gain on the iteration time due to the capacity to reschedule the
workload after the failure of p processors. We consider the following scenario:
The application is executed on n nodes with periodic coordinated checkpoint.
Then, p nodes fail3 . The application is restarted on n − p nodes using the global
restart approach and the load-balancing algorithm is applied.
to re−execute (in %)
Proportion of tasks
20 40 60 80
Experiemental measures
Simulation results
0
Fig. 5. Proportion of lost work for the partial restart in comparison to the classic
global restart approach (lower is better)
initial node number. When only one node failed, our results shows that execution
time after restart with over-decomposition is reduced by 42 %. We want to
emphasize that this improvement applies to all the iterations after the restart.
Thus, it will be beneficial to all the rest of the execution.
Load-balancing cost. This experiment evaluates the cost of the restart and
load-balancing steps. We measure this cost on the Griffon cluster using 80 nodes
with 8 cores, i.e. 640 cores, each with a domain size of 106 doubles per core,
i.e. 7.6 MB. One node fails and the application is restarted on 79 nodes. The
7s of the global restart time is decomposed as following: 2.1s for the process
coordination time and the checkpoint loading time; 1.7s to compute and apply
the new schedule; and finally, data redistribution between processes takes 3.2s.
29.6 59.9
50
Global restart Global restart
Restart time (s)
32.2
30
10
10
3.9
5
4.6
0
0
10 iterations 100 iterations 10 iterations 100 iterations
Checkpoint period (in iteration number) Checkpoint period (in iteration number)
Fig. 6. Comparison of the restart time between global restart and partial restart, for
different checkpoint periods and different computation grains (lower is better)
two different sub-domain computations allows to see the influence of the data
redistribution (the data size and the communication volume are kept identical).
The restart time for partial restart include the time to re-execute the lost
work, i.e. the strictly required set of tasks, and also the time to redistribute
the data, which can be costly. It is difficult to measure these two steps indepen-
dently because they are overlapped in the Kaapi implementation. For the global
restart, this time corresponds only to the time to re-execute the lost work: in
this experiment, there is no need to redistribute the data because the workload
remains the same as before the failure.
For a small computation grain, i.e. a sub-domain computation time of 2 ms,
the performance of partial restart is worse than global restart because the data
redistribution represents most of the cost of the partial restart, mainly because
the load-balancing algorithm used does not take in account the data locality.
For a coarser grain, i.e. a sub-domain computation time of 50 ms, the partial
restart achieves better performance. For a 100-iteration checkpoint period, the
partial restart time represents only 54 % of the global restart time (for a lost
work which corresponds to 51 %).
5 Related Works
On the fault tolerance aspect, most of the works focus in the message passing
model. Many protocols, like checkpoint/rollback protocols and message logging
protocols, has been designed [8] and they are widely used [5,10,13,23,6]. In [12],
communication determinism is leveraged to propose optimized approaches for
certain application classes.
Charm++ can use over-decomposition to restart an application only on the
remaining nodes after a failure [17] with coordinated checkpoint/restart and
message logging approaches. It relies on a dynamic load-balancing algorithm
which periodically collects load information of all the nodes and redistributes
the Charm++ objects if required.
Similarly to Charm++, our work in Kaapi allows to restart an application on
the remaining nodes using over-decomposition. Additionally, we leverage over-
decomposition to reduce the restart time of the application thanks to the original
partial restart approach. Also in our work, we consider a data flow model which
allows a finer representation of the application state. Furthermore, our load-
balancing algorithm is based on the data flow graph of the application and is
executed only after the restart.
References
1. Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortı́, E.S.,
Quintana-Ortı́, G.: Parallelizing dense and banded linear algebra libraries using
smpss. Concurr. Comput. : Pract. Exper. (2009)
2. Besseron, X., Gautier, T.: Optimised recovery with a coordinated check-
point/rollback protocol for domain decomposition applications. In: MCO 2008
(2008)
3. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou,
Y.: Cilk: An efficient multithreaded runtime system. Parallel and Distributed Com-
puting (1996)
4. Bongo, L.A., Vinter, B., Anshus, O.J., Larsen, T., Bjorndalen, J.M.: Using overde-
composition to overlap communication latencies with computation and take advan-
tage of smt processors. In: ICPP Workshops (2006)
5. Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V
project: a multiprotocol automatic fault tolerant MPI. High Performance Comput-
ing Applications (2006)
6. Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems.
In: IPDPS (2004)
7. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of
distributed systems. ACM Transactions on Computer Systems (1985)
8. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-
recovery protocols in message-passing systems. ACM Computing Surveys (2002)
9. Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building
data flow graph in a parallel language. In: PACT 1998 (1998)
10. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent check-
point/restart for mpi programs over infiniband. In: ICPP 2006 (2006)
11. Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system
for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)
12. Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated
checkpointing without domino effect for send-deterministic message passing appli-
cations. In: IPDPS (2011)
13. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and imple-
mentation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS
(2007)
14. Jafar, S., Krings, A.W., Gautier, T.: Flexible rollback recovery in dynamic hetero-
geneous grid computing. IEEE Transactions on Dependable and Secure Computing
(2008)
15. Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications
in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)
16. Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI Runtimes: Expe-
rience with MVAPICH. In: PGAS 2010 (2010)
17. Kale, L.V., Mendes, C., Meneses, E.: Adaptive runtime support for fault tolerance.
Talk at Los Alamos Computer Science Symposium 2009 (2009)
18. Kale, L.V., Zheng, G.: Charm++ and AMPI: Adaptive runtime strategies via
migratable objects. In: Advanced Computational Infrastructures for Parallel and
Distributed Applications. Wiley-Interscience (2009)
19. Naik, V.K., Setia, S.K., Squillante, M.S.: Processor allocation in multiprogrammed
distributed-memory parallel computer systems. Parallel Distributed Computing
(1997)
332 X. Besseron and T. Gautier
20. Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming
on clusters of multi-core SMP nodes. In: PDP 2009 (2009)
21. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra
algorithms on distributed-memory multicore systems. In: SC 2009 (2009)
22. Tamir, Y., Séquin, C.H.: Error recovery in multicomputers using global checkpoints.
In: ICPP 1984 (1984)
23. Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based
fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)
UCHPC 2011: Fourth Workshop
on UnConventional
High Performance Computing
Foreword
As the word “UnConventional” in the title suggests, the workshop focuses on
hardware or platforms used for HPC, which were not intended for HPC in the
first place. Reasons could be raw computing power, good performance per watt,
or low cost in general. Thus, UCHPC tries to capture solutions for HPC which
are unconventional today but perhaps conventional tomorrow. For example, the
computing power of platforms for games recently raised rapidly. This motivated
the use of GPUs for computing (GPGPU), or even building computational grids
from game consoles. The recent trend of integrating GPUs on processor chips
seems to be very beneficial for use of both parts for HPC. Other examples for ”un-
conventional” hardware are embedded, low-power processors, upcoming many-
core architectures, FPGAs or DSPs. Thus, interesting devices for research in
unconventional HPC are not only standard server or desktop systems, but also
relative cheap devices due to being mass market products, such as smartphones,
netbooks, tablets and small NAS servers. For example, smartphones seem to
become more performance hungry every day. Only imagination sets the limit for
use of the mentioned devices for HPC. The goal of the workshop is to present
latest research in how hardware and software (yet) unconventional for HPC is
or can be used to reach goals such as best performance per watt. UCHPC also
covers corresponding programming models, compiler techniques, and tools.
It was the 4th time the UCHPC workshop took place, with previous workshops
held in 2008 in conjunction with the International Conference on Computational
Science and Its Applications 2008, in 2009 with the ACM International Confer-
ence on Computing Frontiers 2009, and in 2010 with Euro-Par 2010. This year,
the organizers were able to accept five submissions (out of ten). In addition, we
were proud to present speakers for two invited talks. Both the invited talks and
papers were grouped around three topics which also formed the structure of the
workshop sessions, and made up for a very exciting half-day program:
– Heterogeneous Systems, starting with an invited talk by Raymond Namyst
about ”Programming Heterogeneous, Accelerator-based Multicore Machines:
a Runtime System’s Perspective”, followed by two regular talks on efficient
processor allocation and workload balancing on heterogeneous systems,
334 A. Hast, J. Weidendorfer, and J.-P. Weiss
September 2011
Anders Hast
Josef Weidendorfer
Jan-Philipp Weiss
PACUE: Processor Allocator Considering User
Experience
1 Introduction
Graphics Processing Unit (GPU) use has been extended to a wider range of comput-
ing purposes on the PC platform. GPU utilization purposes on PCs can be classified
into four purposes. The first is 3D graphics computation, such as 3D games and 3D-
graphics-based GUI shell (e.g., Windows Aero). The second is 2D graphics accelera-
tion, such as font rendering in modern web browsers. The third is video decoding and
encoding acceleration. Video player applications use the video decoding acceleration
function of the GPU to reduce CPU load and to increase the video quality. Also, some
of GPUs have video encoding acceleration units on the die of the GPU.The last purpose
is general-purpose computing, called General-Purpose computing on GPU (GPGPU).
On PCs, GPGPU is often used by video encoding applications and physics simulation
applications including 3D games.1
1
Some 3D games utilize GPU for general-purpose computing besides 3D graphics rendering.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 335–344, 2012.
c Springer-Verlag Berlin Heidelberg 2012
336 T. Horikawa et al.
In today’s PCs GPUs are utilized efficiently, because only a few of the applications
are accelerated at the same time; these applications do not compete each other on the
same GPU. Applications thus choose compute devices statically, such as by user selec-
tion in the application configuration menu of the GUI interface.
However, we envisage that more and more applications utilize GPUs. For example,
Open Computing Language (OpenCL) [2] allows applications to select the compute
device explicitly to execute some parts of the application. Therefore, efficient load bal-
ancing between compute devices consisting of CPUs and GPUs is essential for future
consumer PCs.
There are three technical challenges to achieve efficient compute device assignment
of heterogeneous processors in PCs. First, GPU acceleration is utilized for various pur-
poses, while GPUs are utilized mainly for general-purpose computing in super comput-
ers. In addition, some of tasks running in PCs strongly require specific processors. For
example, 3D rendering is normally processed by GPUs, and some of 3D graphics trans-
actions cannot be processed by CPUs, whereas some applications can be processed by
both CPUs and GPUs. When the GPU load is high, we could run the latter applications
explicitly on CPUs.
Second, we must not modify applications. Typically, most of applications installed
in major OSes such as Windows and Mac OS cannot be modified by a third person,
due to their software distribution policies. Application vendors may not be willing to
modify their applications either, because it will not benefit them straightforwardly. For
these reasons, existing runtime libraries or libraries to distribute tasks between compute
devices [6, 10, 7] proposed for HPC are not deployable on consumer PCs.
Third, performance metric for consumer PCs is complicated, because user preference
is one of the most important metrics for assigning compute devices to applications. It is
clearly different from general HPC’s metrics whose task distributing policy is usually
static, such as maximizing task transaction speed or maximizing performance per watt.
In PCs, task distributing policies and merits easily change depending on the use. For
example, when the user would like to play the 3D game smoothly, the other GPGPU
tasks should not be assigned to the GPU. On the other hand, sometimes the user might
be willing to transcode videos quickly rather than playing the trifling game smoothly.
The compute device selecting method must recognize user preferences to decide the
proper compute device to assign. However this is hard, thus user preference recognizing
cannot automate. Therefore, the resource management has to infer PC utilization and
the users have to be able to tell how they are using PC at that time.
In this paper, we propose PACUE which allocates compute devices to applications
efficiently. PACUE has two features, one is dynamic compute device redirecting feature
and the other is system-wide optimal device selecting feature. We strongly focus on
solving real problems which will occur when we distribute our system over the world
via web. Therefore, we prefer choosing politically safer method rather than technically
better method. Thus, first advantage of PACUE is the possibility of the deployment.
The second advantage of PACUE is designed to maximize PC users’ experience. Thus,
we bring a new metric for using accelerators, and it will be also beneficial for other
computers such as smart phones or game consoles.
PACUE: Processor Allocator Considering User Experience 337
Our experimental results show that PACUE can switch compute devices in 1 out of 2
applications, and all of 20 sample codes built with OpenCL. The reminder of this paper
is organized as follows: In Sec. 2, we describe the design of PACUE consisting of the
dynamic compute device redirecting and the system resource manager. In Sec. 3, we
evaluate our prototype implementation. The paper concludes with Sec. 4.
2 Designing PACUE
PACUE is constructed by two components; Dynamic Compute Device Redirector and
Resource Manager. We focus on applications built with OpenCL, a widely used frame-
work which supports many types of compute devices such as CPUs and GPUs.
OpenCL API Hooking. OpenCL abstracts compute devices and memory hierarchy
to utilize heterogeneous processors within its programming model. To utilize a com-
pute device, applications call OpenCL APIs and specify a compute device. Assigning
process are following: Secondly, select possible devices and create an OpenCL con-
text. Thirdly, select one device to use and create a command queue. Lastly, put tasks to
the queue created above. In the second and the third steps, the application specifies a
concrete device because OpenCL APIs needs device ID as its parameter, which makes
system-wide optimal device selection impossible. For optimal device selection, we re-
move the restriction that the applications need to choose the device by itself because the
decision is hard for applications and users. However, decisions by applications or users
are rarely optimal (See Sec. 2.2). PACUE hooks a part of OpenCL APIs which concern
device selecting, and implements asking function that asks which device to utilize.
There are several methods to hook APIs in Windows 7 where PACUE is imple-
mented. The first possibility is making a thread in the target application by calling a
Windows API CreateRemoteThread() [12]. With this method we implement an applica-
tion which make a thread in other applications and map external DLL containing over-
ridden target APIs. However, these applications and DLLs are hard to implement due to
complicated procedures. It has a risk being treated as malware by the anti-malware soft-
ware. The second possibility is Global Hook, the user application hooks specific APIs
of all application by calling Windows API SetWindowsHookEx() [13]. This method is
unsafe, because it has a risk of hooking unknown applications and causing unexpected
affect to them. The third possibility is making Wrapper DLL, which is a DLL with the
same file name of original DLL and has all APIs of original DLL. Wrapper DLL is
almost shell of original DLL, because most APIs are simply calls original DLL APIs
except APIs which actually need to do different transaction from original. This method
has the most chance of hooking APIs, because wrapper DLL located in the applica-
tion directory is always loaded prior to the other ones, such as DLLs located in system
338 T. Horikawa et al.
directories by default. In addition, when locating wrapper DLL in the directory which
target EXE located, only affects applications whose binary is located in the same di-
rectory. Therefore, this is really safe way to hook APIs. The last possibility is the use
of API hook libraries, such as [14]. These libraries are easy to use, however it has less
probability to success to hook APIs than Wrapper DLL. It also has a risk to be treated
as malware. From this comparison, we adopt the Wrapper DLL method. Fig. 1 illus-
trates the architecture to hook OpenCL APIs with this method. Other major PC OSes
such as MacOS or Linux do not provide any function like wrapper DLLs, still we can
implement a similar system by using API hooking functions offered by other OSes.
Another method to switch devices is making a virtual device. [5] On this method, ap-
plications will assign the virtual device and the resource management system choose a
real device. This method has a significant advantage that it can switch real devices at any
time, however it may conflict with Installable Client Driver(ICD) system of OpenCL.
Installer of OpenCL runtime libraries distributed by hardware vendors sometimes over-
write “OpenCL.dll” file, thus installing a virtual device or showing applications only
the virtual device is difficult on PCs.
device. Occasionally, applications cannot execute their OpenCL code on some de-
vice types. In this case, PACUE sets the cl device type value to the desired type,
such as CL DEVICE TYPE CPU or CL DEVICE TYPE GPU.
2. Context level camouflage
When creating an OpenCL context, PACUE overrides the cl device id value and
force OpenCL framework to build OpenCL binaries for each compute device. If
PACUE recognize that the target application support only specific type of compute
devices, PACUE will overwrite the cl device id value and limit device types for
context. In addition, PACUE overrides the cl device id value when applications
requests detailed device information. Therefore, application will see information
of the device PACUE selected. This contributes to application’s stability, because
acquired device information, such as the memory size corresponds to that of the
device actually will be used.
3. Command queue level camouflage
When the application calls clCreateCommandQueue() API, this is the last chance
to change the device. Because of the stability issue described above, PACUE tries
not to change device this timing, but if necessary, PACUE changes cl device id in
arguments of this API. In this situation, the device is camouflaged completely, thus
the application recognizes the camouflaged device as the device application speci-
fied. This is a terribly dangerous way to change device, still it improves application
compatibility. This is risky in terms of device dependent characteristics, such as the
memory size, however, we can switch the processor in more applications with this
method. Hence, this method is ace in the hole.
As shown in Table 1, there are several device assignment overriding ways by the com-
bination of these steps. Because they have a trade-off between application compatibility
and application stability, we have to make a rule for applying these methods, and some
hints are figured out in Sec. 3.
inconvenient that they select compute device every time the application runs. Some ad-
vanced PC users can choose proper compute device manually, however it is terribly
inconvenient. Besides, many PC users do not know detailed construction of the PC they
are using. These users cannot choose the proper compute device which satisfies their
preference accurately, even if the application allows the user to select the compute de-
vice on its GUI configuration menu. For achieving high user-experience, the resource
manager should select a compute device automatically according to user’s preferences.
There are many studies in HPC area that build a resource manager to select compute
device automatically [7, 8]. They show task distributing algorithm for heterogeneous
processors environment that optimized for some specific purposes, such as maximizing
performance or maximizing performance-per-watt. However, they cannot be applied to
resource management on PC because the requirements are different between PC and
HPC. The other approach to differentiate tasks, such as device-driver level approach [9]
would be a possibility for our goal. However, we still need a system wide resource
manager to consider heterogeneous processors and applications. These are three re-
quirements of the resource manager especially for PCs.
This resource manager has three features for satisfying the requirements explained
above. The first feature is information gathering. PACUE collects information about
how PC is utilized, such as whether an AC adapter is connected, temperatures and volt-
ages of components, and processor utilization level such as processor loads and the
running applications list. The second feature is the user preference inferring feature.
The user describes their requirements by creating several requirement patterns. PACUE
infers which pattern is the best for the present situation by using information acquired in
PACUE: Processor Allocator Considering User Experience 341
the first step. The third feature is compute device selection, which decides the OpenCL
device to be assigned to each application. We plan to implement a few compute device
selecting algorithms for several user preference patterns. PACUE will assign compute
devices to each application based on the algorithm which matches the inferred pattern
of user preference. The resource manager works as cycles of these steps:
1. Collect PC utilization information.
2. Guess which profile is the best for the present condition.
3. Wait an inquiry of application and answer which device should be used.
For evaluation purpose, we built a basic resource manager which has communication
function to order applications to utilize specific compute device. Because of lack of user
preference based compute device selecting algorithms, recent PACUE can only select
compute device by manual selection in the resource manager GUI. Still, it can receive
an inquiry of compute device selection and answer a compute device to utilize.
3 Evaluation
In this section we confirm PACUE provides compute devices redirection capability for
applications without modification on widely used applications. We first state the policy
of the evaluation, then show and analyze the results.
3.2 Results
DirectCompute & OpenCL Benchmark. Table 2 shows the results. PACUE can
redirect compute device perfectly on DirectCompute & OpenCL Benchmark, but only
with method D.
SiSoftware Sandra 2011. Device switching failed. When PACUE tried to switch the
device, Sandra 2011 exhibited strange behavior, such as showing the same device twice
in the GUI. Because Sandra 2011 is an information & diagnostic utility for PC, it gathers
device information by various APIs. Thus, the failure may be caused by the lack of
integrity between device information gathered by PACUE hooked OpenCL API and
information gathered by other APIs. However, PACUE do not make Sandra crashed.
342 T. Horikawa et al.
Override Method A-1 A-2 B-1 B-2 C-1 C-2 D-1 D-2
Sample Codes of “OpenCL Introduction” Book. These codes are a set of 20 sample
applications of OpenCL APIs. The device switching succeeded for all applications in
them. However, 1 sample uses device memory information for the optimized array size,
thus the result might depend on the device. The complete camouflaging device infor-
mation might thus be incompatible with the information expected by the sample. This
can cause the application crashing or errors, however it seemed to be working correctly
while the experiment.
3.3 Analysis
The results show that PACUE can switch the compute devices on real applications.
However, it fails for device dependent applications. They use detailed information of
the particular device, such as device memory size. Thus, they may crash or behave
strangely because of the information camouflaged by PACUE.
Among combinations of the device information overriding, we found the proper or-
der to apply on applications. Shown in Table 1, these methods have a trade-off between
application stability and application compatibility. In our evaluation, we found that the
complete camouflaging method significantly increase application compatibility for real
applications, such as DirectCompute & OpenCL Benchmark. However, it is realized
by giving applications the information of the device the application specified, instead
of giving the device information actually using. Original application creator is the only
one who knows if the application works correctly when using the complete camouflag-
ing method, thus we should avoid using this risky method if possible. In general, we
suggest the following method applying order;
1. Override device type ALL and override device id when creating context. (Table 1 B)
2. Override device type ALL and override device id when creating command queue.
(Table 1 D)
3. Keep original device type and override device id when creating command queue.
(Table 1 C)
4. Override device type CPU or GPU when application requests list of available de-
vices. (Table 1 A)
The first to the third methods similarly realize dynamic device selection. The upper is
safer, the lower has more compatibility. Applications that cannot switch devices with
the first method should use the second or the third method. The last one has the high-
est compatibility but it only provides static and restrictive device switching. Thus, this
method should be applied when all other methods fail.
PACUE: Processor Allocator Considering User Experience 343
Increase Compatibility for Applications. We will address the problem that PACUE
cannot switch compute devices in some applications. Also we will experiment applica-
tion stability tests on applications.
applications will use internal optimized assembly to execute its transaction and it is
often much faster than executing OpenCL code on CPUs. However, it has a disadvan-
tage that compute device cannot change until restarting the application, because the
application will never call OpenCL APIs again. Therefore, we will investigate each
application’s behavior concretely to decide how to let application to use CPUs.
References
1. DirectCompute & OpenCL Benchmark, http://www.ngohq.com/graphic-cards/
16920-directcompute-and-opencl-benchmark.html (accessed on August
21, 2011)
2. OpenCL 1.1 Specification,
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
3. Fixtars Corporation: OpenCL Introduction - Parallel Programming for Multicore CPUs and
GPUs. Impress Japan (January 2010) (in Japanese)
4. AMD. ATI Stream Technology,
http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/
STREAM-TECHNOLOGY/Pages/stream-technology.aspx (accessed on Au-
gust 21, 2011)
5. Aoki, R., Oikawa, S., Tsuchiyama, R., Nakamura, T.: Hybrid opencl: Connecting different
opencl implementations over network. In: Proc. IEEE CIT 2010, pp. 2729–2735 (2010)
6. Brodman, J.C., Fraguela, B.B., Garzarán, M.J., Padua, D.: New abstractions for data parallel
programming. In: Proc. USENIX HotPar, p. 16 (2009)
7. Diamos, G.F., Yalamanchili, S.: Harmony: an execution model and runtime for heteroge-
neous many core systems. In: Proc. ACM HPDC, pp. 197–200 (2008)
8. Gupta, V., Schwan, K., Tolia, N., Talwar, V., Ranganathan, P.: Pegasus: Coordinated Schedul-
ing for Virtualized Accelerator-based Systems. In: Proc. USENIX ATC, pp. 31–44 (2011)
9. Kato, S., Lakshmanan, K., Rajkumar, R., Ishikawa, Y.: TimeGraph: GPU Scheduling for
Real-Time Multi-Tasking Environments. In: Proc. USENIX ATC, pp. 17–30 (2011)
10. Liu, W., Lewis, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Luo, S., Saha, B.: A balanced pro-
gramming model for emerging heterogeneous multicore systems. In: Proc. USENIX HotPar,
p. 3 (2010)
11. Lucidlogix. Lucidlogix virtu,
http://www.lucidlogix.com/product-virtu.html (accessed on August 21,
2011)
12. Microsoft. CreateRemoteThread Function (Windows),
http://msdn.microsoft.com/en-us/library/ms682437.aspx (accessed
on August 21, 2011)
13. Microsoft. SetWindowsHookEx Function (Windows),
http://msdn.microsoft.com/en-us/library/ms644990.aspx (accessed
on August 21, 2011)
14. Microsoft Research. Detours - microsoft research,
http://research.microsoft.com/en-us/projects/detours/ (accessed
on August 21, 2011)
15. SiSoftware. Sisoftware zone, http://www.sisoftware.net/ (accessed on August
21, 2011)
Workload Balancing on Heterogeneous Systems:
A Case Study of Sparse Grid Interpolation
1 Introduction
Heterogeneous systems containing CPUs and accelerators allow us to reach
higher computational speeds while keeping power consumption at acceptable
levels. The most common accelerators nowadays, GPUs, are very different com-
pared to state-of-the-art general-purpose CPUs. While CPUs incorporate large
caches and complex logic for out-of-order execution, branch prediction, and spec-
ulation, GPUs contain significantly more floating point units. They have in-order
cores which hide pipeline stalls through interleaved multithreading, e.g. allow-
ing up to 1536 concurrent threads per core1 . Garland et al. [1] refer to CPUs
as latency oriented processors with complex techniques used for extracting In-
struction Level Parallelism (ILP) from sequential programs. In contrast, GPUs
are throughput oriented, containing a large number of cores (e.g. 16) with wide
SIMD units (e.g. 32 lanes), making them ideal architectures for vectorizable
codes. All applications can be run on CPUs but only a subset can be ported to
or deliver good performance on GPUs, making them special purpose processors.
In the following, we refer to GPUs and CPUs as processors, but of different type.
To support all kinds of heterogeneous systems in a portable way, we need to
make sure that even for GPU-friendly code parts, there is a fallback to execute
on CPU, as we also want to best exploit systems with powerful CPU parts. For
that, multiple code versions of the same function have to be provided. For multi-
core CPUs, OpenMP [2] is the de facto programming model. Nvidia GPUs on
1
In Nvidia terminology a core is called Streaming Multi-Processor.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 345–354, 2012.
c Springer-Verlag Berlin Heidelberg 2012
346 A. Muraraşu, J. Weidendorfer, and A. Bode
the other hand are best programmed using CUDA [3]. OpenCL [4] targets both
CPUs and GPUs. Still, for optimal performance, multiple versions are essential
to target the different hardware characteristics. Another crucial part for efficient
programming of heterogeneous systems is adequate workload distributing.
The main contribution of this paper consists of proposed solutions for load
balancing in the context of the decompression of high-dimensional data com-
pressed using the sparse grid technique [5]. This technique allows for an efficient
storage of high-dimensional functions. Sparse grid interpolation (or decompres-
sion) is the performance critical part. For realizing load balancing, we employ a
dynamic strategy in which the computation is decomposed at runtime into tasks
of a given size (the grain size) which are grabbed for execution by the CPU
and the GPU. We compare this strategy to a static approach, where the load
distribution is done at the beginning of the computation, according to the com-
putational power of the heterogeneous components. By this, we show that our
interpolation runs efficiently on heterogeneous systems. To the best of our knowl-
edge, this is the first implementation of sparse grid interpolation that optimally
combines code tuned for multi-core CPUs and Nvidia GPUs.
2 Related Work
Our work is complementary to the one described in [6]. There, space and time
efficient algorithms for the sparse grid technique are proposed. We use these
algorithms as basis for our implementation of sparse grid interpolation for CPU
and GPU. It is worth mentioning that in [6] the focus is on porting the sparse grid
technique to GPUs. While the GPU code is executed, the CPUs are idle. Instead
our goal is to avoid having idle processors and to further improve performance.
Similar to our approach, MAGMA [7] exploits heterogeneous systems by pro-
viding efficient routines for linear algebra. StarPU [8] is a framework that simpli-
fies the programming of heterogeneous systems. Programs are decomposed into
StarPU tasks (bundles of multi-version functions for every processor type) with
according task dependencies, and automatically mapped to available processors
(CPU / GPU). StarPU implements a distributed shared memory (DSM) over
the CPU and the GPU memory via software controlled coherence. This allows
for automatic data transfers to / from the GPU memory. Parameters exposed
by StarPU to programmers are e.g. task size, task priority, and schedulers.
conflicts, minimizing the number of branches, and utilizing the various memories
appropriately (global, shared, texture, constant) are important GPU optimiza-
tions. In contrast, CPU optimizations include cache blocking and vectorization.
When programming heterogeneous systems with CPUs and GPUs, we can
use an off-loading approach, as used in systems with co-processors for specific
tasks. We determine a mapping between each function and the type of processor
on which its execution time is minimal. As each function is executed by one
type of processor, there is a risk for idle compute resources2. The solution is to
move from off-loading to full function distribution. For this, we provide multi-
version functions. We design them such that the CPU and the GPU cooperate
for computing each function. Since this approach allows for a full utilization of
a heterogeneous system, we focus on it in the rest of the paper.
Multiple versions of the same function must be orchestrated by an upper layer
responsible for balancing the workload, either statically or dynamically. A static
approach distributes the workload according to the computational speed of the
processors. An initially determined distribution does not change during the ex-
ecution of the function. In contrast, dynamic load balancing allows for changing
the workload distribution after the computation has been started. It can be
triggered by overloaded (sender initiated) or underloaded (receiver initiated) re-
sources, can be executed in centralized or decentralized manner, and results in
direct rebalancing (e.g. work stealing) or in repartitioning the data mapped to
compute resources for the next iteration of the computation on that data. [9]
provides a good overview of dynamic load balancing strategies. A typical dy-
namic strategy is receiver initiated load balancing of pieces of work which are
not pre-mapped to given compute resources, but only distributed shortly before
execution (also known as self-scheduling). This is also found in the OpenMP
dynamic scheduling strategy for parallel for-loops. We call this the dynamic task
based approach. The computation is decomposed into tasks which are inserted
into a global queue. From there, the tasks are extracted by worker threads. Of-
ten, the tasks have dependencies, making the extraction more time-consuming.
Variations use multiple queues or scheduling strategies based on work stealing,
on greedy algorithms or algorithms that predict distribution costs. For hetero-
geneous systems, the worker threads invoke according versions of a function on
the CPU or the GPU.
While the dynamic task based approach adapts implicitly to different ma-
chines, different input parameters, and external system load, there is an overhead
for task queue management and distribution. Especially, the task size, called
grain size in the following, influences that overhead. If it is too large, load bal-
ancing may not be achievable. If it is too small, the overhead may dominate and
destroy any speedup. In contrast, the overhead of static balancing is minimal.
Obviously, there is no grain size problem, but it has to adapt to function input
parameters and machine type. If the workload depends not only on parameters
such as data size, but on data values, static balancing is not feasible.
2
Note that our objective is minimal execution time, not minimal energy consumption.
348 A. Muraraşu, J. Weidendorfer, and A. Bode
Fig. 1. Grain size impact. D/L/N = 6/12/5 × 105 (left), 20/6/3 × 106 (right)
We now focus on the importance of the grain size in the dynamic task based
approach. In addition to the previous general remarks, a highly tuned CPU
version of a function performs the best for a task size that matches or is a multiple
of the tile size used for cache blocking. On the GPU, the task size should match
or be a multiple of the maximum number of active threads. This would ensure
full utilization of the GPU cores, of the SIMD units, and of multithreading.
For sparse grid interpolation, we developed an according first-come first-served
scheduler strategy using OpenMP and CUDA (OMP + CUDA). Moreover, we
implemented our application with StarPU, using various schedulers available
there. Fig. 1 shows the performance of interpolation for different grain sizes with
different input parameters: number of dimensions (D), refinement level (L), and
number of interpolations (N). The measurements are done using a Quad-core
Nehalem and an Nvidia GTX480. Note that the optimal grain size depends on
these parameters, especially for StarPU eager and our OMP + CUDA scheduler.
The dmda scheduler assigns tasks based on a performance model that considers
execution history and PCIe transfer overheads. For more details we refer to [8].
Tcpu (w), Tgpu (w), ncpu , and ngpu are the unknowns. The first equation builds
the approximation Tcpu of the execution time on the CPU, Tcpu , as the product
between the number of tasks grabbed by a worker thread (ncpu ) and the duration
of a task as a function of workload (tcpu (w)), i.e. the workload is equivalent to
the number of points at which we interpolate. Similarly, the approximation of the
execution time on the GPU, Tgpu , is the sum between the duration of all tasks
executed on the GPU (ngpu · tgpu (w)) and the one-time overhead (tpcie ) caused
by transferring the compressed data over PCIe. The third equation means that
the total workload equals the sum of the workload handled by CPUs and the
workload handled by GPUs. ccpu is the number of CPU cores or CPU worker
threads and ncpu is the number of interpolations allocated to a core. cgpu is the
number of GPUs and ngpu is the number of interpolations per GPU. Finally, the
fourth equation expresses that the CPU and the GPU finish at the same time.
We now have to find good approximations (linear or piecewise) for the tcpu (w)
and the tgpu (w) functions depicted in Fig. 4. These can be considered cheap
operations since the definition domain of these functions is relatively small, i.e.
from 1 to 35000, compared to the common values for N, i.e. 106 or more. The
approximations are computed once for each combination of values for D and L.
We can subsequently reuse these functions for determining the total execution
time, Tcpu (w) or Tgpu (w) for any value of N. It is worth mentioning that in
the case of the CPU, for D/L/N = 6/12/5 × 105 , the optimal performance is
reached for a grain size of 4096. At the opposite end, a grain size of 1 makes
the execution up to 6 times slower. The optimal grain size changes with the
input parameters, i.e. for D/L/N = 10/10/5 × 105, it is 1024. Now it is trivial to
discover the optimal grain size, g, that minimizes Tcpu (w). Note that without our
optimization we would have to search the grain size that minimizes the execution
time for each tuple (D, L, N) we get as input. This means that for every value of
the grain size considered in the search we interpolate at a potentially large set
of points (e.g. 3 × 106 ) which can be very time-consuming for a large D or L.
352 A. Muraraşu, J. Weidendorfer, and A. Bode
6 Evaluation
We now describe our experimental setup and results. The tested hardware is:
– a system containing a Quad-core Intel Nehalem i7-920 (2.67 GHz) and an
Nvidia GTX480 (1.4 GHz, 15 cores, 32-lane SIMD)
– a system with 8 Intel Xeon L5630 cores (2.13 GHz) arranged in two sockets
and an Nvidia Tesla X2500 (1.15 GHz, 14 cores, 32-lane SIMD).
Workload Balancing on Heterogeneous Systems 353
160 200
180
140
160
120
140
100
120
GFlops
GFlops
80 100
80
60
60
40
StarPU dmda 40 StarPU dmda
StarPU eager StarPU eager
20 OMP + CUDA OMP + CUDA
Static 20 Static
Max Max
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Number of Dimensions Number of Dimensions
Fig. 5. Left: GFlops rate on 2 × Intel Xeon Quad-core + Nvidia Tesla x2050. Right:
GFlops rate on Nehalem Quad-core + Nvidia GTX480
7 Conclusion
References
1. Garland, M., Kirk, D.B.: Understanding Throughput-oriented Architectures. Com-
mun. ACM 53, 58–66 (2010)
2. OpenMP Application Programming Interface (2008)
3. NVIDIA. CUDA Programming Guide 4.0 (2011)
4. Khronos. The OpenCL Specification 1.1 (2010)
5. Bungartz, H.-J., Griebel, M.: Sparse Grids. Acta Numerica 13(-1), 147–269 (2004)
6. Murarasu, A.F., Weidendorfer, J., Buse, G., Butnaru, D., Pflüger, D.: Com-
pact Data Structure and Scalable Algorithms for the Sparse Grid Technique. In:
PPOPP, pp. 25–34 (2011)
7. MAGMA, Matrix Algebra on GPU and Multicore Architectures,
http://icl.cs.utk.edu/magma/index.html
8. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: A Unified
Platform for Task Scheduling on Heterogeneous Multicore Architectures. In: Sips,
H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 863–874.
Springer, Heidelberg (2009)
9. Osman, A., Ammar, H.: Dynamic Load Balancing Strategies for Parallel Comput-
ers. In: ISPDC, Romania (July 2002)
10. Butnaru, D., Pflüger, D., Bungartz, H.-J.: Towards High-Dimensional Computa-
tional Steering of Precomputed Simulation Data using Sparse Grids. Procedia CS 4,
56–65 (2011)
11. Intel. Intel Advanced Vector Extensions Programming Reference (2011)
Performance Evaluation of a Multi-GPU
Enabled Finite Element Method for
Computational Electromagnetics
1 Introduction
Efforts to exploit GPUs, for non-graphical applications have been underway since
2003 and has evolved into programmable and massively parallel computational
units with very high memory bandwidth. From this time to the present day a
review of research works aiming at harnessing GPUs for the acceleration of sci-
entific computing applications would hardly fit into one page. In particular, the
development of GPU enabled high order numerical methods for the solution of
partial differential equations is a rapidly growing field. Focusing on contributions
that are dealing with wave propagation problems, GPUs have been considered
for the first time for computational electromagnetics and computational geoseis-
mics applications respectively by Klöckner et al. [3] and by Komatitsch et al.
[5]-[4]. The present work shares several concerns with [3] which describes the de-
velopment of a GPU enabled discontinuous Galerkin (DG) method formulated
on an unstructured tetrahedral mesh for the discretization of hyperbolic systems
of conservation laws. As it is the case with the DG method considered in [3], the
approximation of the unknown field in a tetrahedron relies on a high order nodal
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 355–364, 2012.
c Springer-Verlag Berlin Heidelberg 2012
356 T. Cabel, J. Charles, and S. Lanteri
where the symbol ∂t denotes a time derivative and J (x, t) is a current source
term. These equations are set on a bounded polyhedral domain Ω of R3 . The
electric permittivity (x) and the magnetic permeability coefficients μ(x) are
varying in space, time-invariant and both positive functions. The current source
term J is the sum of the conductive current J σ = σE (where σ(x) denotes the
electric conductivity of the media) and of an applied current J s associated to a
localized source for the incident electromagnetic field. Our goal is to solve system
(1) in a domain Ω with boundary ∂Ω = Γa ∪ Γm , where we impose the following
boundary conditions: n × E= 0 on Γm , and L(E, H) = L(E inc , H inc ) on Γa
μ
where L(E, H) = n × E − n × (H × n). Here n denotes the unit outward
ε
normal to ∂Ω and (E inc , H inc ) is a given incident field. The first boundary
condition is called metallic (referring to a perfectly conducting surface) while
the second condition is called absorbing and takes here the form of the Silver-
Müller condition which is a first order approximation of the exact absorbing
boundary condition. This absorbing condition is applied on Γa which represents
an artificial truncation of the computational domain.
For the numerical treatment of system (1), the domain Ω is triangulated into
a set Th of tetrahedra τi . We denote by Vi the set of indices of the elements which
are neighbors of τi (i.e. sharing a face). In the following, to simplify the presenta-
tion, we set J = 0. For a given partition Th , we seek approximate solutions to (1)
in the finite element space Vpi (Th ) = {v ∈ L2 (Ω)3 : v |τi ∈ (Ppi [τi ])3 , ∀τi ∈ Th }
where Ppi [τi ] denotes the space of nodal polynomial functions of degree at
most pi inside τi . Following the discontinuous Galerkin approach, the electric
and magnetic fields (Ei , Hi ) are locally approximated as combinations of lin-
early independent basis vector fields ϕij . Let Pi = span(ϕij , 1 ≤ j ≤ di )
where di denotes the number of degrees of freedom inside τi . The approximate
fields (Eh , Hh ), defined by (∀i, Eh|τi = Ei , Hh|τi = Hi ), are thus allowed to be
completely discontinuous across element boundaries. For such a discontinuous
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 357
field Uh , we define its average {Uh }ik through any internal interface aik , as
{Uh }ik = (Ui|aik + Uk|aik )/2. Because of this discontinuity, a global variational
formulation cannot be obtained. However, dot-multiplying (1) by ϕ ∈ Pi , inte-
grating over each single element τi and integrating by parts, yields a local weak
formulation involving volume integrals over τi and surface integrals over ∂τi .
While the numerical treatment of volume integrals is rather straightfoward, a
specific procedure must be introduced for the surface integrals, leading to the
definition of a numerical flux. In this study, we choose to use a fully centered
numerical flux, i.e., ∀i, ∀k ∈ Vi , E|aik {Eh }ik , H|aik {Hh }ik . The local
weak formulation can be written as:
1 1
ϕ · i ∂t Ei = (curlϕ · Hi + curlHi · ϕ) − ϕ · (Hk × nik ),
τi 2 τi 2
k∈Vi aik
(2)
1 1
ϕ · μi ∂t Hi= − (curlϕ · Ei + curlEi · ϕ) + ϕ · (Ek × nik ).
τi 2 τi 2 aik
k∈Vi
where the symmetric positive definite mass matrices Miη (η stands for or μ), the
symmetric stiffness matrix Ki (both of size di × di ) and the symmetric interface
matrix Sik (of size di × dk ) are given by:
1
(Miη )jl = ηi t
ϕij · ϕil , (Sik )jl = t
ϕij · (ϕkl × nik ).
2 aik
τi
1
(Ki )jl = t
ϕij · curlϕil + t ϕil · curlϕij .
2 τi
The set of local systems of ordinary differential equations for each τi (3) can be
formally transformed in a global system. To this end, we suppose that all electric
(resp. magnetic) unknowns are gathered in a column vector E (resp. H) of size
Nt
dg = di where Nt stands for the number of elements in Th . Then system (3)
i=1
can be rewritten as:
dE dH
M = KH − AH − BH + CE E , Mμ = −KE + AE − BE + CH H, (4)
dt dt
where we emphasize that M and Mμ are dg × dg block diagonal matrices. if we
set S = K − A − B then system (4) rewrites as:
dE dH
M = SH + CE E , Mμ = − t SE + CH H. (5)
dt dt
358 T. Cabel, J. Charles, and S. Lanteri
Finally, system (5) is time integrated using a second-order leap-frog scheme as:
⎧
n+1
⎪
⎪ E − En 1
⎪
⎨ M
= SHn+ 2 + CE En ,
Δt
3 1 (6)
⎪
⎪ Hn+ 2 − Hn+ 2 n+ 12
⎪
⎩ M μ
= − t
SEn+1
+ C H H .
Δt
3 Implementation Aspects
3.1 DGTD CUDA Kernels
We describe here the implementation strategy adopted for the GT200 gener-
ation of NVIDIA GPUs and for calculations in single precision floating point
arithmetic. We first note that the main computational kernels of the DGTD-Ppi
method considered in this study are the volume and surface integrals over τi and
∂τi appearing in (2). Moreover, we limit ourselves to a uniform order method
i.e. p ≡ pi is the same for all the elements of the mesh, and we present ex-
perimental results for the values p = 1, 2, 3, 4. At the discrete level, these local
computations translate into the matrix-vector products appearing in (3). The
discrete equations for updating the electric and magnetic fields are composed
of the same steps and only differ by the fields they are applied to. They both
involve the same kernels that we will refer to in the sequel as intVolume (com-
putation of volume integrals), intSurface (computation of surface integrals)
and updateField (update of field components). All these kernels stick to the
following paradigm: (1) load data from device memory to shared memory, (2)
synchronize with all the other threads of the block so that each thread can safely
read shared memory locations that were populated by different threads, (3) pro-
cess the data in shared memory, (4) synchronize again to make sure that shared
memory has been updated with the results, (5) write the results back to device
memory. This paradigm ensures that almost all the operations on data allocated
in global memory are performed in a coalesced way.
We outline below the main characteristics of these kernels and refer to [6]
for a more detailed description. In our implementation, some useful elementary
matrices, such as the mass matrix computed on the reference element, are stored
in constant memory because they are small and are accessed following constant
memory patterns. For the sequel, we introduce the following notations: NBTET
is the number of tetrahedra that are treated by a block of threads. It depends
of the chosen interpolation order and it is taken to be a multiple of 16 because
of the way one load and write data to and from device memory; NDL is the
number of degrees of freedom (d.o.f) in an element τi for each field component,
for a given interpolation order; finally, NDF is the number of d.o.f on a face aik
for each field component, for a given interpolation order.
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 359
tetrahedron which allows a block to compute all the d.o.f for NBTET tetrahedra.
This approach is less efficient for the lower interpolation orders. The two versions
of the electric field update kernels need only one shared memory table. Indeed,
in the first step, the flux computed by the previous kernels is loaded in this
table, used to do some computations and then stored in a register. Therefore,
the shared memory table is no longer used at the end of this part. In the second
step, we load the previous values of the electric field in it in a coalesced way. In
a third step, we update the value of the field in the shared memory, and in the
last step, we write the new value of the field in the global memory. The update
of the magnetic field follows the same pattern as the update of the electric field.
4 Performance Results
We first note that GPU timings (for all the performance results presented here
and in the following subsections) are for single precision arithmetic computations
and include the data structures copy operations from the CPU memory to the
GPU device memory prior to the time stepping loop, and vice versa at the
end of the time stepping loop. Numerical experiments have been performed on
a hybrid CPU-GPU cluster with 1068 Intel CPU nodes and 48 Tesla S1070
GPU systems. Each Tesla S1070 has four GT200 GPUs and two PCI Express-
2 buses. The Tesla systems are connected to BULL Novascale R422 E1 nodes
with two quad-core Intel Xeon X5570 Nehalem processors operating at 2.93 GHz
themselves connected by an InfiniBand network.
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 361
Fig. 1. Geometrical model of head tissues and computed contour lines of the amplitude
of the electric field on the skin
simulation time using one GPU. Although the number of elements of this mesh is
well below the size of the mesh considered for the weak scalability analysis
(i.e. 3,072,000 elements for the DGTD-P1 and DGTD-P2 methods), superlinear
speedups are obtained. However, not surprisingly, the single GPU GFlops rates
are lower than the corresponding ones reported in Table 1 (32 instead of 63 for
the DGTD-P1 method, and 60 instead of 92 for the DGTD-P2 method). For the
two other meshes (i.e. M2 and M3), as expected the DGTD-P2 method is always
more scalable than the DGTD-P1 method because of a more favorable computa-
tion to communication ratio. Overall, acceleration factors ranging from 15 to 25
are observed between the multiple CPU and multiple GPU simulations. We note
however that this comparison is made with a CPU version whose parallel imple-
mentation relies on MPI only. In particular, we have not considered a possible op-
timization to hybrid shared-memory multi-core systems combining the OpenMP
and MPI programming models. Besides, an optimized CPU version in terms of
simulation times can be obtained by computing the surface integrals over ∂τi in
(2) through a loop over element faces and updating the flux balance of both ele-
ments τi and τj since the numerical flux between τj and τi is just the opposite of
that from τi and τj . Such an optimization would lower the simulation times of the
CPU version by approximately 30%. In the present implementation, each elemen-
tary numerical flux is computed twice (respectively for flux balances of τi and τj )
for maximizing the floating point performance in the CUDA SIMD framework.
364 T. Cabel, J. Charles, and S. Lanteri
5 Conclusion
References
1. Fezoui, L., Lanteri, S., Lohrengel, S., Piperno, S.: Convergence and stability of
a discontinuous Galerkin time-domain method for the 3D heterogeneous Maxwell
equations on unstructured meshes. ESAIM: Math. Model. Num. Anal. 39(6),
1149–1176 (2005)
2. Gödel, N., Nunn, N., Warburton, T., Clemens, M.: Scalability of higher-order discon-
tinuous Galerkin FEM computations for solving electromagnetic wave propagation
problems on GPU clusters. IEEE. Trans. Magn. 46(8), 3469–3472 (2010)
3. Klöckner, A., Warburton, T., Bridge, J., Hesthaven, J.: Nodal discontinuous
Galerkin methods on graphic processors. J. Comput. Phys. 228, 7863–7882 (2009)
4. Komatitsch, D., Erlebacher, G., Göddeke, D., Michéa, D.: High-order finite-element
seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput.
Phys. 229(20), 7692–7714 (2010)
5. Komatitsch, D., Göddeke, D., Erlebacher, G., Michéa, D.: Modeling the propagation
of elastic waves using spectral elements on a cluster of 192 GPUs. Comput. Sci. Res.
Dev. 25, 75–82 (2010)
6. Cabel, T., Charles, J., Lanteri, S.: Multi-GPU acceleration of a DGTD method for
modeling human exposure to electromagnetic waves. Tech. rep., INRIA Research
eport RR-7592 (2011), http://hal.inria.fr/inria-00583617
Study of Hierarchical N-Body Methods
for Network-on-Chip Architectures
Turku Center for Computer Science, Joukahaisenkatu 3-5 B, 20520, Turku, Finland
Department of Information Technology, University of Turku, 20014, Turku, Finland
{canxu,pasi.liljeberg,hannu.tenhunen}@utu.fi
1 Introduction
It is predictable that in the near future, hundreds or even more cores on a chip
will appear on the market. The number of circuits integrated on a chip have
been increasing continuously which leads to an exponential rise in the complex-
ity of their interaction. Traditional digital system design methods, e.g. bus-based
architectures will suffer from high communication delay and low scalability. To
address these problems, NoC communication backbone was proposed for future
multicore systems [1]. Network communication methodologies are brought into
on-chip communication. More transactions can occur simultaneously and thus
the delay of the packets is reduced and the throughput of the system is in-
creased. Moreover, as the links in NoC are based on point-to-point mechanism,
the communication among cores can be pipelined to further improve the system
This work is supported by Academy of Finland and Nokia Foundation. The authors
would like to thank the anonymous reviewers for their feedback and suggestions.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 365–374, 2012.
Springer-Verlag Berlin Heidelberg 2012
366 T.C. Xu, P. Liljeberg, and H. Tenhunen
1 1 1 1 1
: 5 (
1 1 1 1 1,
6
3(
1 1 1 1
performance. Figure 1 shows a NoC with 4×4 mesh (16 nodes). The underly-
ing network is comprised of network links and routers (R), each of which is
connected to a processing element (PE) via a network interface (NI). The ba-
sic architectural unit of a NoC is the tile/node (N) which consists of a router,
its attached NI and PE, and the corresponding links. Communication among
PEs is achieved via network packets. Intel 1 has demonstrated an 80 tile, 100M
transistor, 275mm2 2D NoC under 65nm technology [2]. An experimental mi-
croprocessor containing 48 x86 cores on a chip has been created, using 4×6 2D
mesh topology with 2 cores per tile [2]. The TILE-Gx processor from Tilera,
containing 16 to 100 general-purpose processors in a single chip, is available for
commercial use [3].
The N-Body problem is a classical problem of approximating the motion of
bodies that interact with each other continuously. The bodies are usually galax-
ies and stars in an astrophysical system. The gravitational force of bodies is
calculated according to Newton’s Principia [4]. The N-Body problem is used in
other computations and simulations as well, e.g. the interference of wireless cells
and protein folding [5]. Several algorithms have been developed for N-Body sim-
ulation. In principle, to be precise, the simulation requires the calculation of all
pairs, since the gravitational force is a long range force. However the computa-
tion complexity of this method is O(n2 ) [6]. J. Barnes et al. and L. Greengard
introduced two fast hierarchical methods [7,8]. A tree is build firstly according to
the position of the bodies in the physical space. The interactions are calculated
by traversing this tree. The computation complexities in these algorithms are
reduced to O(nlogn ), or even O(n) in some cases.
The performance of these two algorithms has been studied in traditional cache-
coherent shared address space multiprocessors, e.g. Standford DASH, KSR-1
and SGI-Challenge [9]. A simulator is used for examining the implications of
the two algorithms in a multiprocessor architecture [10]. However, the previous
works are based on conventional architectures, e.g. bus-based multiprocessors,
1
Intel is a trademark or registered trademark of Intel or its subsidiaries. Other names
and brands may be claimed as the property of others.
Study of Hierarchical N-Body Methods for NoC Architectures 367
The Tilera TILE processor family includes TILE64, TILEPro and TILE-Gx
members. The basic architecture of these processor are the same: an array of 16 to
100 general purpose RISC processor cores (tiles) in a on-chip mesh interconnect.
Each tile consists a core with related L1 and L2 caches. The memory controllers
are integrated on the chip as well.
Figure 2 shows the architecture diagram of TILE-Gx processor [3]. Each tile
consists of a 64-bit VLIW core with private L1 cache (32KB instruction and
32KB data) and shared L2 cache (256KB per tile). Four 64-bit DDR3 memory
controllers, duplexed to multiple ports, connect the tiles to the main memory.
MC0 MC1
Tile R
Core/
NI
L1$
L2$
MC3 MC2
Fig. 3. An 8×8 mesh-based NoC with memory controllers attached to up and down
sides
The L2 caches and the memory are shared by all processors. The processor
operates at 1.0 to 1.5GHz, with typical power consumption of 10 to 55W. The
I/O controllers are integrated on chip to save costs of north and south bridges.
The mesh network provides bandwidth up to 200Tbps.
To analyze the low-level behavior of an application, we model a NoC similar
to the Tilera TILE architecture. The processing core of the NoC is a Sun SPARC
RISC core [14], the area is 14mm2 with 65nm fabrication technology. Scaled to
32nm technology, each core has an area of 3.4mm2 . We simulate the character-
istics of a 16MB, 64 banks, 64-bit line size, 4-way associative, 32nm cache by
CACTI [15]. Results show that the total area of cache banks is 64.61mm2 . Each
cache bank, including data and tag, occupies 1mm2 . Routers are quite small
compared with processors and caches, e.g. we calculate a 5-port router to be
only 0.054mm2 under 32nm. The number of transistors required for a memory
controller is quite small compared with a chip (usually billions). It is presented
that a DDR3 memory controller is about 2,000 LCs with Xilinx Virtex-5 Field-
Programmable Gate Array (FPGA) [16]. The total area of the chip is estimated
to be around 300mm2 , comparable to the TILE-Gx. Figure 3 illustrates the
architecture of the aforementioned NoC.
In this section, we describe the two most important hierarchical N-Body al-
gorithms that we used for analysis: the Barnes-Hut method [7] and the Fast
Multiple Method (FMM) [8]. The two hierarchical methods build a structured
tree firstly. The tree is built by subdividing space cells until a certain condition,
e.g. reaching the maximum number of particles in a leaf cell. The physical space
is represented by a hierarchical tree. The computation of interactions is done by
Study of Hierarchical N-Body Methods for NoC Architectures 369
traversing this tree. The two algorithms differ in the steps they use to calculate
the interactions of particles.
In Barnes-Hut method, for each particle, the tree is traversed to compute
the forces. It starts at the root of the tree, and traverses every cell. To reduce
the computation complexity of long-range interactions, the subtree is approxi-
mated by the mass of the center cell, if the cell is far away from the particle.
The accuracy of this methods is thus dependent on the approximation metrics.
The Barnes-Hut method only computes the interactions for particle-particle and
particle-cell.
The FMM computes the interactions for cell-cell as well, compared with
Barnes-Hut. If two cells are far away from each other, the interaction between
them is computed by the multipole expansion of the cells. The computation
complexity is thus reduced. For uniform distributions, the complexity of FMM
is O(n), compared with O(nlogn ) in Barnes-Hut. To develop a multithreaded
program for both algorithms, the space is divided into several regions where
each core is assigned with a region. A tree for the regions is built for the respon-
sible core, and each core calculates its local tree. Most of the calculation time
is spent in traversals of the tree to compute the forces. In a NoC platform, the
performance of the algorithms will be affected by (a) long distance communica-
tion of nodes; (b) the initial distribution of particles; (c) the dynamic changing
of position of particles; (d) hot-spot traffic.
4 Experimental Evaluation
4.1 Experiment Setup
The simulation platform is based on a cycle-accurate NoC simulator which is
able to produce detailed evaluation results. The platform models the routers
and links accurately. State-of-the-art router in our platform includes a routing
computation unit, a virtual channel allocator, a switch allocator, a crossbar
switch and four input buffers. Deterministic XY routing algorithm has been
selected to avoid deadlocks.
We use a 64-core network which models a single-chip NoC for our experiments.
A full system simulation environment with 64 nodes, each with a core and related
cache, has been implemented. The simulations are run on the Solaris 9 operating
system based on the UltraSPARCIII+ instruction set in-order issue structure.
Each processor core is running at 2GHz, attached to a wormhole router and has
a private write-back L1 cache (split I+D, each 32KB, 4-way, 64-bit line, 3-cycle).
The 16MB L2 cache shared by all processors is split into banks (64 banks, each
256KB, 64-bit line, 6-cycle). The simulated memory/cache architecture mimics
SNUCA [17]. A two-level distributed directory cache coherence protocol called
MOESI based on MESI [18] has been implemented in our memory hierarchy in
which each L2 bank has its own directory. The protocol has five types of cache
line status: Modified (M), Owned (O), Exclusive (E), Shared (S) and Invalid
(I). We use Simics [19] full system simulator as our simulation platform. For
both methods, we use the Plummer model [20] for particle generation, instead
370 T.C. Xu, P. Liljeberg, and H. Tenhunen
160000
140000
120000
Packets100000
80000
60000
40000
20000
00
10
20
30
40 60
50 50
Time 60 40
70 30
80 20
90 10 Node ID
0
The time spent on force calculation in the Fast Multipole method is lower
than Barnes-Hut (Table 2), e.g. 58.8% in 4K to 70.3% in 64K. Nearly 10% of the
time are spent on tree building, and about 15% on barrier. The Fast Multipole
method scales worse than Barnes. The speedups for 64 processors are 36.6x and
53.3x for total execution time and force calculation time, respectively. This is
primarily due to the higher number of barriers in Fast Multipole method. It is
noteworthy that, in spite of poor scaling, the Fast Multipole method spends less
time for calculation. For example, it spends 54.3% of the total execution time
in 64p/64K, compared with Barnes. In consideration of better scalability, the
Barnes-Hut method could use shorter time in a systems with thousands of cores.
Figure 5 shows the network request rate of each processing core when running
FMM in a 64-core NoC. The horizontal axis is time, segmented in 1.69M-cycle
percentage fragments. The traffic trace has 57.4M packets. It is revealed that,
372 T.C. Xu, P. Liljeberg, and H. Tenhunen
70000
60000
50000
Packets 40000
30000
20000
10000
00
10
20
30
40 60
50 50
Time 60 40
70 30
80 20
90 10 Node ID
0
several nodes (e.g. N0 7.6%, N46 4.15%, N13 2.72% and N7 2.71%) generate more
data traffic than others. The network traffic is relatively low in the starting phase
(before 30% of the time slice). After that time point, FMM shows similar traffic
patterns as in Barnes. However, the hot-spot traffic in FMM is not as significant
as Barnes. We note that, in terms of point-to-point traffic, a small portion of
source-destination pairs generated a sizable portion of the traffic. For example,
only 4 (19-60, 13-44, 60-19 and 0-29) of the pairs (in totally 642 = 4, 096)
generated 1.42% traffic.
We evaluate other performance metrics of the two algorithms in terms of L2
cache miss rate (L2MR), misses per thousand instructions (MissPKI), Average
Link Utilization (ALU) and Average Network Latency (ANL). ALU is calculated
Barnes−Hut
Fast−Multipole
1.1
0.9
Normalized value
0.8
0.7
0.6
0.5
0.4
L2MR MissPKI ALU ANL
with the number of packets transferred between NoC resources per cycle. ANL
represents the average number of cycles required for the transmission of all mes-
sages. The number of required cycles for each message is calculated from the injec-
tion of the message header into the network at the source node, to the reception of
the tail flit at the destination node. Under the same configuration and workload,
lower values of these metrics are favorable. The results are shown in Figure 6. We
note that, in terms of L2MR and MissPKI, Barnes is lower than FMM (1.21% for
L2MR and 15.77% for MissPKI respectively). This reflects, FMM requires more
cache than Barnes. A system with limited cache could be unsuitable for FMM.
The ALU of Barnes is only 43.83% of FMM, which means an alleviated network
load. It is noteworthy that despite the fact that the value of Z axis in Figure 4 is
twice as larger than Figure 5, each time slice in Figure 4 represents 12.1M cycles,
compared with 1.69M cycles in Figure 5. Finally, the ANL of Barnes is 96.31%
that of FMM, indicating that the network performance of Barnes is better, and
hence having lower communication overhead.
5 Conclusion
The implementation of two hierarchical N-Body methods (Barnes-Hut and Fast
Multipole) in a NoC platform was studied in this paper. Both scalability and
network traffic for the two methods were analyzed. We studied an 8×8 NoC
model based on state-of-the-art systems. The time distribution of the two meth-
ods, with 1 to 64 processing cores, were explored. We investigated the advantages
and disadvantages of the two algorithms. The network requests rates of 64 pro-
cessing cores were illustrated for both methods. Our experiments have shown
that, the Barnes-Hut method generates more hot-spot traffic than Fast Multi-
pole. However, it scales better, and has lower overall pressure to the on-chip
network and caches, compared with Fast Multipole. The results of this paper
gave guidance for analyzing hierarchical N-Body methods in a NoC platform.
References
1. Dally, W.J., Towles, B.: Route packets, not wires: on-chip inteconnection networks.
In: Proceedings of the 38th Conference on Design Automation, pp. 684–689 (June
2001)
2. Intel: Intel research areas on microarchitecture (May 2011),
http://techresearch.intel.com/projecthome.aspx?ResearchAreaId=11
3. Tilera: Tile-gx processor family (May 2011),
http://www.tilera.com/products/processors/TILE-Gx_Family
4. Aarseth, S.J., Henon, M., Wielen, R.: A comparison of numerical methods for the
study of star cluster dynamics. Astronomy and Astrophysics 37, 183–187 (1974)
5. Perrone, L., Nicol, D.: Using n-body algorithms for interference computation in
wireless cellular simulations. In: Proc. of 8th Int. Symp. on Modeling, Analysis
and Simulation of Computer and Telecommunication Systems, pp. 49–56 (2000)
6. Salmon, J.: Parallel n log n n-body algorithms and applications to astrophysics.
In: Compcon Spring 1991, Digest of Papers, February-1 March, pp. 73–78 (1991)
374 T.C. Xu, P. Liljeberg, and H. Tenhunen
7. Barnes, J., Hut, P.: A hierarchical o(n log n) force-calculation algorithm. Nature
(1988)
8. Greengard, L.F.: The rapid evaluation of potential fields in particle systems. PhD
thesis, New Haven, CT, USA (1987) AAI8727216
9. Holt, C., Singh, J.P.: Hierarchical n-body methods on shared address space multi-
processors. In: Proc. of 7th SIAM Conf. on PPSC (1995)
10. Singh, J.P., Hennessy, J.L., Gupta, A.: Implications of hierarchical n-body methods
for multiprocessor architectures. ACM Tran. Comp. Sys. 13, 141–202 (1995)
11. Nyland, L., Harris, M., Prins, J.: Fast N-Body Simulation with CUDA. In: Nguyen,
H. (ed.) GPU Gems 3. Addison Wesley Professional (August 2007)
12. Jetley, P., Wesolowski, L., Gioachin, F., Kalé, L., Quinn, T.: Scaling hierarchical
n-body simulations on gpu clusters. In: SC 2010, pp. 1–11 (November 2010)
13. Hamada, T., Nitadori, K.: 190 tflops astrophysical n-body simulation on a cluster
of gpus. In: SC 2010, pp. 1–9 (November 2010)
14. Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-
scout-thread cmt sparc processor. In: ISSCC 2008, pp. 82–83 (February 2008)
15. Thoziyoor, S., Muralimanohar, N., Ahn, J.H., Jouppi, N.P.: Cacti 5.1. Technical
Report HPL-2008-20, HP Labs
16. Global, H.: Ddr 3 sdram memory controller ip core (May 2011),
http://www.hitechglobal.com/IPCores/DDR3Controller.htm
17. Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for
wire-delay dominated on-chip caches. In: ACM SIGPLAN, pp. 211–222 (October
2002)
18. Patel, A., Ghose, K.: Energy-efficient mesi cache coherence with pro-active snoop
filtering for multicore microprocessors. In: Proceeding of the Thirteenth Interna-
tional Symposium on Low Power Electronics and Design, pp. 247–252 (August
2008)
19. Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg,
J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform.
Computer 35(2), 50–58 (2002)
20. Dejonghe, H.: A completely analytical family of anisotropic Plummer models. Royal
Astronomical Society, Monthly Notices 224, 13–39 (1987)
Extending a Highly Parallel Data Mining
Algorithm to the Intel
R
Many Integrated Core
Architecture
Keywords: Intel R
Many Integrated Core Architecture, Intel
R
MIC Ar-
R
chitecture, Intel Knights Ferry, NVIDIA Fermi*, GPGPU, accelerators,
coprocessors, data mining, sparse grids.
1 Introduction
Experts expect that future exascale supercomputers will likely be based on het-
erogeneous architectures that consist of a moderate amount of “fat” cores and
use a large number of accelerators or coprocessors to deliver a high ratio of
GFLOPS/Watt [21]. Today, Graphic Processing Units (GPU) are very popu-
lar for accelerating highly parallel kernels like dense linear algebra or Monte
Carlo simulations [20,8]. However, the performance increase is not for free and
requires the ability to rewrite compute kernels in GPU-specific languages such
as CUDA [13] or OpenCL [10]. This implies serious porting and tuning effort
for legacy compute-intensive applications (CPU-optimized codes), which are ex-
ecuted in thousands of compute centers every day.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 375–384, 2012.
c Springer-Verlag Berlin Heidelberg 2012
376 A. Heinecke et al.
L2 L2 ... L2 L2
L2 L2 ... L2 L2
Ring network
L1 L1 L1 L1
...
Core Core Core Core
Fig. 1. High-level view on the Intel MIC Architecture (left) and NVIDIA Fermi (right)
taken from [12]
The Intel R
Many Integrated Core Architecture (Intel R
MIC Architecture) is
a massively parallel coprocessor based on Intel Architecture (IA). The existing
tool chain for software development on IA can be used to implement applications
for the Intel MIC Architecture. All traditional HPC programming models such as
OpenMP* and MPI* on top of C/C++ and Fortran will be available. Developers
do not need to accept the high learning curve and implementation effort to
(partially) rewrite their source code to retrofit it for a GPU-based accelerator.
In this paper, we compare a pre-release Intel coprocessor (“Knights Ferry”) of
the Intel MIC Architecture with a recent NVIDIA Tesla* C2050 GPU (Sect. 2).
We focus on the performance of an existing highly parallel workload and assess
the programming productivity during implementation. We use the SG++ data-
mining algorithm (Sect. 3) as the workload for the evaluation. As with most
HPC applications, SG++ is already available as highly optimized code for pro-
cessors compatible to Intel R
Xeon R
. Hence, we use this as our starting point
for the evaluation. The paper carries on with comparing the implementations
and performance in Sect. 4. For the comparison, we restrict ourselves to genuine
compilers and toolkits to ensure that the optimal software stack for the compute
platforms is evaluated.
In this section, we will investigate the differences and similarities between the
Intel MIC Architecture [7] and the NVIDIA Tesla 2050 accelerator [12]. The
Intel MIC Architecture has been announced at the International Supercomputing
Conference [18] as a massively parallel coprocessor based on IA. It is currently
available as pre-release hardware code-named Knights Ferry (based on Intel’s
previous Larrabee design [9]).
Fig. 1 gives an overview of the respective architectures. Knights Ferry offers 32
general-purpose cores with a fixed frequency of 1200 MHz. The cores are based
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
377
on a refreshed Intel R
Pentium R
(P54C) processor design [3] and have been
extended with 64-bit instructions and a 512-bit wide Vector Processing Unit
(VPU). Each of the cores offers four-way round robin scheduling of hardware
threads, i. e., in each clock cycle a core switches to the next instruction stream.
The cores of the Knights Ferry coprocessor own a local L1 and L2 cache with
32 KB and 256 KB, respectively. With a total of 32 cores, this coprocessor offers
a total of 8 MB shared L2 cache. The cores are connected through a high-speed
ring-bus that interconnects the L2 caches for fast on-chip communication. An
L3 cache does not exist in this design because of the high-bandwidth GDDR5
memory (1800 MHz). In total, the memory subsystem delivers a peak memory
bandwidth of 115 GB/sec.
Since the Intel MIC Architecture is based on IA, it can support the pro-
gramming models that are available for traditional IA-based processors. The
compilers for the Intel MIC Architecture support Fortran (including Co-Array
Fortran) and C/C++. OpenMP [15] and Intel R
Threading Building Blocks [17]
may be used for parallelization as well as emerging parallel languages such as
Intel
R
CilkTM Plus [6] or IntelR
Array Building Blocks [5]. The VPU can be ac-
cessed through the auto-vectorization capabilities of the Intel compiler as well as
low-level programming through intrinsic functions. The Intel MIC Architecture
greatly simplifies programming, as well-known traditional programming models
can be utilized to implement codes for it.
In contrast to the Intel MIC Architecture, the Tesla 2050 architecture [12]
does not contain general purpose compute cores. Instead it consists of 14 multi-
processors with 32 processing elements each. The processing elements run at a
clock speed of 1.15 GHz and a memory-bandwidth of 144 GB/sec. A 768 KB L2
cache is shared across the 14 multiprocessors.
Because of its special architecture the 2050, it only supports a limited set of
programming models. The most important ones are CUDA and OpenCL, which
are data-parallel programming languages that do not support arbitrary (task-
)parallel programming patterns. Some production compilers, such as the PGI
compiler suite [19] or HMPP [2] support offloading of Fortran code to the GPU,
but restrict the language features in order to fit the GPU programming model.
Fig. 2. Classical basis functions for the first three levels of discretization in 1d (left)
and modified ones with different types of basis functions (right). For both, a selection
of 2d basis functions and a 2d sparse grid of level 3 is shown.
dimensionality: regular grid with equidistant meshes and k grid points in each
dimension contain k d grid points in d dimensions. The exponential growth typi-
cally prevents considering more than 4 dimensions for reasonable discretizations.
We rely on adaptive sparse grids (see [1,16] for details) to mitigate the curse
of dimensionality. They are based on a hierarchical grid structure with basis
functions defined on several levels of discretization in 1d, a hierarchical basis,
and d-dimensional basis functions as products of one-dimensional ones. We em-
ploy two kinds of basis functions: uniform and modified non-uniform. Uniform
basis functions lead to grids with a large number of grid points on the domain’s
boundary, whereas modified non-uniform ones extrapolate towards the domain’s
boundary, which lead to a smaller grid structure; see Fig. 2.
The hierarchical tensor product approach allows to represent a function on
several scales. Trying to find out which scales contribute most to the overall
solution, it can been seen that plenty of grid points can be omitted in the hier-
archical representation as they have only little contribution—at least for suffi-
ciently smooth functions. The costs are reduced from O(k d ) to O(k log(k)d−1 ),
maintaining a similar accuracy as for full grids.
The function f should be as close to the data S as possible, minimizing the
mean squared error. At the same time, close data points should very likely have
similar function values to generalize from the data. We minimize the trade-off
between both regularization parameter λ (Eq. 1, left) and the hierarchical basis
allowing for a simple generalization functional. This leads to a system of linear
equations (Eq. 1, right), with matrix B, Bi,j = ϕi (xj ), and identity matrix I.
1
m N
2 1 1
arg min (f (x) − yi ) + λ α2j ⇒ T
BB + λI α = By. (1)
f m i=1 j=1
m m
Because of the storage required for the large matrices, the linear system is solved
iteratively, with repeated recomputation of B and B T . Both correspond to func-
tion evaluations, as (B T α)i = f (xi ). Unfortunately, from a parallelization point
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
379
Level Index α
Sum
Dataset y
Fig. 3. Data containers to manage adaptive sparse grids and datasets for streaming
access
of view, efficient algorithms for function evaluations on sparse grids are inherently
multi-recursive in both level and dimensionality. This imposes severe restrictions
on parallelization and vectorization, especially on accelerators.
A straightforward alternative approach evaluates all basis functions (even
those resulting to zero) for all data points and sums up the results as shown
in Fig. 3. This is less computationally efficient, but streaming access of the data
and the avoidance of recursive structures and branching easily pays back the
additional computation: it is arbitrarily parallelizable and can be vectorized.
In the following, we use two test scenarios, both with a moderate dimension-
ality of d = 5 and distinct challenges. The first dataset with 218 data points
classifies a regular 3 × · · · × 3 checkerboard pattern. The second one is a real-
world dataset from astrophysics, predicting spectroscopic redshifts of galaxies
based on more than 430,000 photometric measurements. For both, excellent nu-
merical results are obtained using our method, see [11] for details.
Listing 1.1. Offloading computation and data for execution on the Intel MIC Archi-
tecture coprocessor by preprocessor directives and calling the compute kernel.
Uniform basis functions. Due to the shader-style code (see [14]) the im-
plementation of B T α evaluates an instance of the dataset for each work item.
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
381
Table 1. Performance of both simple and optimized software port to Intel Knights
Ferry and to the NVIDIA C2050. Performance is measured in GFLOPS using single
precision floating point numbers.
NVIDIA suggests the local size (number of work items in a work group) to be a
multiple of 32. Our tests have shown that a local size of 64 gives the best per-
formance on the Fermi GPU. Although the Fermi architecture offers standard
L1 and L2 caches, the performance of a simple straightforward port is clearly
behind the Intel MIC Architecture performance as shown in Table 1.
Several kernel optimization techniques can be applied to improve performance.
First, the local storage of a workgroup can be used to prefetch data into the
caches. Second, the runtime compilation of OpenCL can be instructed to perform
runtime code generation. At runtime, the compiler knows about the loop length
and thus the loop over the dimensions can be completely unrolled to reduce the
amount of control flow in the kernel.
Because the multiplication of B is parallelized along grid points, an imple-
mentation difficulty arises. The grid may contain an arbitrary number of grid
points, but we have to map the grid to workgroups with a discrete distribution
of points. There are two options to mitigate this issue: First, we could use a
workgroup size of one (i. e., a workgroup is mapped to a work item). Second,
we may split the operator into a GPU and a CPU part. The GPU then han-
dles all multiples of 64 that are smaller than the number of grid points and the
CPU takes on the remainder. We make use of the second approach as it exhibits
better performance. However, besides an optimized GPU implementation, an
optimized Xeon Processor-based implementation is also needed. The Intel MIC
Architecture does not require such padding. Its cores can handle odd numbers
of iterations efficiently because of their standard IA-based instruction set.
However, modified non-linear basis functions reduce grid sizes and memory con-
sumption. On Knights Ferry, this halves runtime, whereas the C2050 suffers from
a 63% higher runtime.
Since the Intel MIC Architecture relies on a mixture of traditional threading
and vectorization, a suitable vectorization for the modified linear basis functions
is as follows. As the if branches are independent of the evaluation point, several
instances can be loaded into a vector register and one grid point is broadcast into
vector registers. Depending on a grid point’s property in a certain dimension,
the if condition can be computed for all data points that are currently stored
in vector registers, since there is no need to evaluate the if statement for each
data point. Hence, the GFLOPS rate only drops by about 40%.
The root-cause analysis for the NVIDIA C2050 exhibits two reasons for the
increase in runtime. First, noticeably more time is spent executing on the accel-
erator due to the frequent evaluation of the if statements. The if statements
slow down the code, as the GPU’s streaming processor executes them through
predicates and parts of the processor may execute no-op instructions. Second,
the grid sizes are significantly smaller for the non-uniform basis functions. Since
the operator B is parallelized over the number of grid points, a smaller grid leads
to a smaller degree of parallelism that can no longer satisfy the high number of
processing elements of the NVIDIA C2050.
Multi-device configurations. The offload model of compiler for the Intel MIC
Architecture directly supports multiple coprocessors. All Intel MIC Architecture
devices in the system are uniquely identified by an integer number and can be
selected by their ID in the offload pragma. For streaming applications, only the
length of the offloaded arrays has to be adjusted according to the number of
available devices. This boils down to simple mathematical expressions involv-
ing the array length, number of devices, and device ID. OpenCL multi-device
support is based on a replication of API objects such as buffers, kernels, and
command queues. Instead of simple handles for arrays, a second level of handles
must be introduced to keep track of arrays on different devices. This complicates
the implementation as it requires additional boilerplate code.
As the grids with modified linear basis functions need more tuning and rewrit-
ing of the algorithm to fully exploit the GPU, we restrict ourselves to grids with
standard basis functions when evaluating the performance of the dual-device
configuration. Table 1 lists the measured results on both platforms in the last
two rows. It is obvious that the additional padding needed for the GPU has
negative effects on the dual-GPU version especially for the small grids in early
stages of the learning process. Since the Intel MIC Architecture implementation
does not need host-CPU padding, both coprocessors can unfold their full power
when dealing with small grids. For all input data, the Knights Ferry coprocessor
achieves a speed-up of at least 1.9x when adding a second device, whereas a
second NVIDIA C2050 yields a speed-up of about 1.7x.
but with a more compute-intensive kernel). The code can fully exploit the 16 times
bigger general-purpose L2 cache of Knights Ferry, which explains the better base-
line performance of Knights Ferry over the Tesla C2050. Table 1 shows that prefetch-
ing for MIC only slightly speeds up the compute kernel, whereas adding manual
local storage loads boosts performance of the C2050 kernel. For the smaller DR5
input data, the Fermi GPU is not able to utilize its full power, while the Intel MIC
Architecture is less sensitive to the size of the input data.
Productivity. In total, only two workdays were spent to enable SG++ for the
Intel MIC Architecture, since the tool chain of Intel
R
Composer XE, Intel R
R TM
Debugger and Intel VTune Amplifier XE helped root-cause and fix perfor-
mance issues in a well-known workflow. Additional implementation complexity
arose from the workgroup padding needed for the C2050. We used the Visual
Compute Profiler to optimize the C2050 kernel. In total, five workdays were
required to implement the C2050 kernel. To keep the development time for the
devices comparable, all code variants have been developed by the same person
who is also one of the main developers of SG++ and has deep insight into the
mathematical structure of SG++ . Hence, we exclude the time needed to analyze
SG++ and acquaint the developer with the existing host implementation. The
developer had access to the documentation for Knights Ferry and the develop-
ment tools. For NVIDIA, the developer had access to both the official OpenCL
documents as well as best-practice guides that can be found on the Internet.
On both platforms standard dense linear algebra benchmarks are clearly above
0.5 TFLOPS, which highlights the excellent performance of the implementations
presented.
5 Conclusion
We demonstrated that Intel MIC Architecture devices can easily be used to bring
highly parallel applications into, or even beyond, GPU performance regions. Us-
ing well-known programming models like OpenMP and vectorization, the Intel
MIC Architecture minimizes the porting effort for existing high-efficiency pro-
cessor implementations. Moreover, programming for the Intel MIC Architecture
does not require any special tools since its support is integrated into the com-
plete Intel tool chain ranging from compilers over math libraries to performance
analysis tools. As future HPC systems will most likely be hybrid machines with
fat cores and coprocessors, programming for the Intel MIC Architecture eases
the burden for developers; codes developed for the CPU portion of the system
can be re-used on the coprocessor without too much of a porting effort, while
achieving a better level of performance than with GPU-based accelerators.
References
1. Bungartz, H.-J., Griebel, M.: Sparse Grids. Acta Numerica 13, 147–269 (2004)
2. CAPS Enterprise. Rapidly Develop GPU Accelerated Applications (2011)
384 A. Heinecke et al.
Intel, Pentium, and Xeon are trademarks or registered trademarks of Intel Corporation
or its subsidiaries in the United States and other countries.
* Other brands and names are the property of their respective owners.
** Performance tests are measured using specific computer systems, components, soft-
ware, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to as-
sist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. System configuration: Intel Shady
Cove Platform with 2S Intel Xeon processor X5680 [4] (24GB DDR3 with 1333 MHz,
SLES11.1) and single Intel 5520 IOH, Intel Knights Ferry with D0 ED silicon (GDDR5
with 3.6 GT/sec, driver v1.6.501, flash image/micro OS v1.0.0.1140/1.0.0.1140-EXT-
HPC, Intel Composer XE for MIC v048), and NVIDIA C2050 (GDDR5 with 3.0
GT/sec, driver v270.41.19, CUDA 4.0).
VHPC 2011: 6th Workshop on Virtualization
in High-Performance Cloud Computing
1 Introduction
Intrinsic trade-off between efficient resource utilization and performance isola-
tion arises in cloud computing environments where various services are provided
based on a shared pool of computing resources. For high resource utilization,
cloud providers typically service a virtual machine (VM) as an isolated compo-
nent and enable multiple VMs to share underlying physical resources. Although
aggressive resource sharing among customers gives a provider more profit, perfor-
mance interference from the sharing could degrade quality of service customers
expect. Performance isolation that ensures quality of service makes it difficult for
providers to increase VM consolidation level. Many researchers have addressed
this conflicting goal focusing on several sharable resources [2,4,5,7,12].
Among those sharable resources, machine memory is known as a resource
that primarily inhibits high degree of consolidation due to the expensive cost
of hardware extension and power consumption [6]. In order to deal with the
memory space restriction, memory deduplication has drawn traction as a way
of increasing available memory by eliminating redundant memory. Since the
memory deduplication was introduced by the VMware ESX server [15], it has
This work was supported by the National Research Foundation of Korea(NRF) grant
funded by the Korea government(MEST) (No. 2011-0000371).
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 387–397, 2012.
c Springer-Verlag Berlin Heidelberg 2012
388 S. Kim, H. Kim, and J. Lee
been well-studied on how effectively to find redundant memory and how to take
advantage of saved memory [6,11]. Due to its effectiveness in reducing memory
footprint for hosting requested VM instances, memory deduplication has been
appealing to cloud providers who aim to save the total cost of ownership.
Existing memory deduplication techniques, however, lack the functionality of
performance isolation in spite of their efficiency. The problem stems from the
system-wide operation of memory deduplication across all VMs that reside in
a physical machine. In virtualized clouds, a physical machine can accommodate
several VMs that belongs to different customers who do not want their sensitive
memory contents to be shared with other customers’ VMs. Existing schemes
do not provide a knob to confine the deduplication process to a group of VMs
that want to share their memory one another (e.g., VMs in the same customer or
cooperative customers). In addition, the resource usage for system-wide dedupli-
cation cannot be properly accounted to corresponding VMs that are involved in
sharing. Since resource usage for memory deduplication itself is nontrivial [11,8],
appropriate accounting for the expense of deduplication is a requisite support
for cloud computing, which typically employs pay-per-use model.
This paper proposes a group-based memory deduplication scheme that allows
the hypervisor to run multiple deduplication threads, each of which is in charge
of its designated group. Our scheme provides an interface for a group of VMs,
which want to share their memory, to be managed by a dedicated deduplication
thread. The group-based memory deduplication has the following advantages.
Firstly, memory contents of one group are securely protected from another group.
This feature prevents security breaches that exploit memory deduplication [14].
Secondly, the resource usage of deduplication is properly accounted to a cor-
responding group. Thirdly, a deduplication thread can be customized based on
the characteristics and demands of its group. For example, deduplication rates
(i.e., scanning rates) can be differently set for each group based on workloads.
Finally, memory pages that are reclaimed by a per-group deduplication thread
can be readily redistributed to their corresponding group.
The rest of this paper is organized as follows: Section 2 describes the back-
ground and motivation behind this work. Section 3 explains the design and
implementation of the group-based memory deduplication. Then, Sect. 4 shows
experimental results and Sect. 5 discusses issues arising in our scheme and fur-
ther improvement. Finally, Sect. 6 presents our conclusion and future work.
Memory Memory
Redistributor Redistributor
VM 1 VM 2 VM 3 VM 4 VM 5
Machine Reclaim
Reclaim
Memory
Deduplication Deduplication
Thread Thread
Cgroup interface
Administrator
3.2 Implementation
We implemented the prototype of our scheme by extending the Linux KSM [1].
The current KSM conducts system-wide memory deduplication over virtual ad-
dress spaces that are registered via the madvise system call. When Kernel Virtual
Machine (KVM) [9] creates a VM instance, it automatically registers the VM’s
entire memory regions to KSM. Once KSM is initiated, a global deduplication
thread, named ksmd, performs deduplication with respect to all VM’s memory
regions. For group-based memory deduplication, we modified this system-wide
deduplication algorithm by splitting the global ksmd into per-group ksmds. Each
per-group ksmd operates with its own data structures that are completely iso-
lated from other ksmds.
For a grouping interface, we used the cgroup [10], which is a general component
to group threads via the Linux VFS. We added the KSM cgroup subsystem for
administrators or user applications to easily define deduplication groups. Each
group directory includes several logical files, which indicate a scan rate and the
number of shared pages to interact with its per-group ksmd.
Taking advantage of the cgroup interface, the memory redistributor is simply
implemented as a user-level script. This script periodically checks the number
of reclaimed pages for each group and reprovisions them to a corresponding
group by interacting with a guest-side balloon driver. Regarding intra-group
reprovisioning, our current version evenly supplies given memory to VMs within
a group. However, more sophisticated policies can be applied by using working
set estimation techniques.
4 Evaluation
In this section, we present preliminary evaluation results to show how the group-
based memory deduplication scheme impacts on memory sharing and redistri-
bution behaviors.
4.1 Experimental Environments
Our prototype is installed on a machine with Intel i5 quad core CPU 760
2.80GHz, 4GB of RAM, and two 1TB HDDs. This host machine runs Ubuntu
10.10 with the qemu-kvm 0.14.0 and our modified Linux kernel 2.6.36.2. We com-
pared our scheme, called GRP, with two baseline schemes: NOGRP-equal and
NOGRP-SE. While the two baselines have non-group memory deduplication in
common, they have different reprovisioning policies. NOGRP-equal reprovisions
reclaimed memory evenly to existing VMs, whereas NOGRP-SE gives a VM
reclaimed memory in proportion to its sharing entitlement, which means how
much contribution a VM makes to save memory; this reprovisioning scheme was
proposed by Milos et al [11]. For example, if two VMs make all reclaimed pages,
they deserve to receive all additional memory they contribute. From the per-
spective of isolation, we believe that this scheme is more suitable than the equal
reprovisioning for cloud environments.
We evaluated a two-group scenario where each group has two VMs configured
as follows:
392 S. Kim, H. Kim, and J. Lee
To minimize interferences between groups, we used cpu, cpuset, and blkio cgroup
subsystems for both NOGRP and GRP. NOGRP baselines allow a global ksmd to
belong to its own group, while our scheme makes each per-group ksmd belong to
its corresponding group so that deduplication cost is accounted to its group. The
groups of main workloads including ksmd group (NOGRP case) has sufficiently
higher CPU shares than other system threads in order to minimize the effect of
system daemon activities.
We evaluated the performance and memory changes with sharing trends for two
configurations, in which one group has enough memory to cover working set
while the other does not. F IOlow indicates that the FIO group does not have
enough memory to cover its working set (MR-VM:FIO-VM=640MB:640MB),
whereas M Rlow indicates that the MR group lacks memory for its working set
(MR-VM:FIO-VM=384MB:896MB). With respect to our scheme, we varied scan
rates for each group; GRP-x:y means the ratio of scan rates for MR and FIO.
To compare the performance across all policies, we make the sum of scan rates
for each policy equal (10,000 pages/sec).
Figure 2 shows the normalized throughput of each group for different policies.
The first thing to note is two NOGRPs show different performance. In the case of
F IOlow , the FIO group of NOGRP-equal shows much higher performance than
that of NOGRP-SE. To investigate this difference, Fig. 3 shows the changes in
memory for each VM with the amount of reclaimed memory as time progresses.
For both cases, the MR group emits a large amount of reclaimed memory for
25–60 seconds. Although the MR group has the contribution for the reclaimed
pages during the period, NOGRP-equal reprovisions them evenly to the two
groups. Since the FIO group lacks the memory in F IOlow , such aid of additional
memory boosts its performance. Furthermore, the increased memory helps the
FIO group make more reclaimed memory by sharing more pages. On the other
hand, NOGRP-SE reprovisions the initial reclaimed memory to only the MR
group based on its sharing entitlement, so that the FIO group cannot benefit
from any additional memory during the initial period.
Conversely, in the case of M Rlow , the MR group of NOGRP-SE achieves
higher performance than that of NOGRP-equal. As shown in Fig. 4, NOGRP-
SE makes the MR group quickly receive more memory contributed by its own
sharing during the initial period, thereby boosting the performance of the MR
Group-Based Memory Deduplication for Virtualized Clouds 393
1.4 1.4
MR group MR group
Normalized performance
Normalized performance
1.2 FIO group 1.2 FIO group
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
N
G
O
O
R
R
G
P-
P-
P-
P-
P-
P-
P-
P-
P-
P-
R
R
1:
3:
5:
7:
9:
1:
3:
5:
7:
9:
P-
P-
P-
P-
9
1
eq
SE
eq
SE
ua
ua
l
l
(a) F IOlow (b) M Rlow
Fig. 2. Normalized performance for NOGRP-equal, NOGRP-SE, and GRP with vari-
ous scan rates (x :y is the scan rates of MR:FIO)
VM Memory (MB)
Fig. 3. Memory changes in the NOGRP cases with reclaimed memory (F IOlow )
VM Memory (MB)
Fig. 4. Memory changes in the NOGRP cases with reclaimed memory (M Rlow )
group. The results of F IOlow and M Rlow imply that neither of the non-group
schemes (NOGRP-equal and NOGRP-SE) always achieves the best performance,
since each group’s memory demands are different.
Figure 2 also shows the results of the group-based memory dedupication with
various scan rate settings. As shown in the figure, the best performance results
are achieved on certain scan rate ratios: 1:9 for F IOlow and 9:1 for M Rlow . It is
394 S. Kim, H. Kim, and J. Lee
VM Memory (MB)
VM3(FIO) VM3(FIO)
800 VM4(FIO) 800 800 VM4(FIO) 800
0 0 0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Time (sec) Time (sec)
Fig. 5. Memory changes in the best performance cases of GRP with reclaimed memory
intuitive that a higher scan rate makes a group that lacks memory quickly reap
additional memory, thereby improving its performance. Figure 5 shows the two
cases of the best performance. As expected, a high scan rate quickly produces
reclaimed memory, which is then reprovisioned to a group that desires more
memory. Although a low scan rate slowly emits a small amount of reclaimed
memory, the performance of a group that has enough memory is not affected.
As a result, the group-based memory deduplication can achieve the best per-
formance if a scan rate for each group is appropriately chosen. Considering that
NOGRP-SE is currently the most suitable approach for clouds, due to its cap-
italism, it does not have room for customization on the basis of each group’s
memory demand and workload characteristic. In Sect. 5.3, we discuss our plan
to devise the dynamic adjustment of per-group scan rates.
5 Discussion
In this section, we discuss promising applicability of the group-based dedupli-
cation focusing on VM colocation, various grouping policies, and feasible cus-
tomization of per-group deduplication.
5.1 VM Colocation
For the group-based memory deduplication to be effective, multiple VMs within
the same group should be colocated in a physical machine. Assuming that a
group is established based on a customer, there are several cases to colocate
VMs from the same customer. Firstly, as novel hardware (e.g., many core proces-
sors and SR-IOV network cards) has been increasingly supporting consolidation
scalability [7], a physical machine becomes capable of colocating the increasing
number of VMs. This trend increases the likelihood that VMs from the same
customer are colocated. Secondly, VM colocation policies that favor cloud-wide
resource efficiency (e.g., memory footprint [16] and network bandwidth [13])
would encourage a cloud provider to colocate VMs from the same customer.
For example, if a cloud customer leases VMs for distributed computing on the
MapReduce framework, the VMs have homogeneous software stack, common
Group-Based Memory Deduplication for Virtualized Clouds 395
working set, and much communication traffic among them. In this case, a cloud
provider seeks to colocate such VMs in a physical machine for efficiency as long
as their SLAs are satisfied.
Although the same customer’s VMs are not colocated, there are still chances
to take advantage of the group-based memory deduplication. As cloud computing
has been embracing various services, there are growing opportunities to share
data among related services. CloudViews [3] presents a blueprint of rich data
sharing among cloud-based Web services. We expect that such direction allows
our scheme to group cooperative customers who agree with data sharing. In
addition, intra-VM memory deduplication may not be negligible depending on
workloads when a VM is solely located in a group. Some scientific workloads
have a considerable amount of duplicate pages in native environments [1].
References
1. Arcangeli, A., Eidus, I., Wright, C.: Increasing memory density by using ksm. In:
Proc. OLS (2009)
2. Cucinotta, T., Giani, D., Faggioli, D., Checconi, F.: Providing Performance
Guarantees to Virtual Machines Using Real-Time Scheduling. In: Guarracino,
M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla,
F., Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010.
LNCS, vol. 6586, pp. 657–664. Springer, Heidelberg (2011)
3. Geambasu, R., Gribble, S.D., Levy, H.M.: Cloudviews: Communal data sharing in
public clouds. In: Proc. HotCloud (2009)
4. Gordon, A., Hines, M.R., da Silva, D., Ben-Yehuda, M., Silva, M., Lizarraga, G.:
Ginkgo: Automated, application-driven memory overcommitment for cloud com-
puting. In: Proc. RESoLVE (2011)
5. Gupta, D., Cherkasova, L., Gardner, R., Vahdat, A.: Enforcing Performance Iso-
lation Across Virtual Machines in Xen. In: van Steen, M., Henning, M. (eds.)
Middleware 2006. LNCS, vol. 4290, pp. 342–362. Springer, Heidelberg (2006)
6. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A.C., Varghese, G., Voelker,
G.M., Vahdat, A.: Difference engine: Harnessing memory redundancy in virtual
machines. In: Proc. OSDI (2008)
7. Keller, E., Szefer, J., Rexford, J., Lee, R.B.: Nohype: Virtualized cloud infrastruc-
ture without the virtualization. In: Proc. ISCA (2010)
8. Kim, H., Jo, H., Lee, J.: XHive: Efficient cooperative caching for virtual machines.
IEEE Transactions on Computers 60(1), 106–119 (2011)
9. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: KVM: The Linux virtual
machine monitor. In: Proc. OLS (2007)
10. Menage, P.B.: Adding generic process containers to the Linux kernel. In: Proc.
OLS (2007)
Group-Based Memory Deduplication for Virtualized Clouds 397
11. Milós, G., Murray, D.G., Hand, S., Fetterman, M.A.: Satori: Enlightened page
sharing. In: Proc. USENIX ATC (2009)
12. Nathuji, R., Kansal, A., Ghaffarkhah, A.: Q-clouds: Managing performance inter-
ference effects for qos-aware clouds. In: Proc. EuroSys (2010)
13. Sonnek, J., Greensky, J., Reutiman, R., Chandra, A.: Starling: Minimizing commu-
nication overhead in virtualized computing platforms using decentralized affinity-
aware migration. In: Proc. ICPP (2010)
14. Suzaki, K., Iijima, K., Yagi, T., Artho, C.: Memory deduplication as a threat to
the guest OS. In: Proc. EuroSec (2011)
15. Waldspurger, C.A.: Memory resource management in VMware ESX server. In:
Proc. OSDI (2002)
16. Wood, T., Tarasuk-Levin, G., Shenoy, P., Desnoyers, P., Cecchet, E., Corner, M.D.:
Memory buddies: Exploiting page sharing for smart colocation in virtulized data
centers. In: Proc. VEE (2009)
A Smart HPC Interconnect for Clusters
of Virtual Machines
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 398–406, 2012.
c Springer-Verlag Berlin Heidelberg 2012
A Smart HPC Interconnect for Clusters of Virtual Machines 399
MPI or shared memory libraries. Our approach exports basic RDMA semantics
to VM’s user-space using the following operations:
Initialization: The guest side of our framework is responsible for setting up an
initial communication path between the application and the backend. Frontend-
Backend communication: This is achieved by utilizing the messaging mechanism
between the VM and the backend. This serves as a means for applications to
instruct the backend to transmit or wait for communication, and for the back-
end to inform the guest and the applications of error conditions or completion
events. We implemented this mechanism using event channels and grant refer-
ences. Export interface instance to user-space: To support this type of mechanism
we utilize endpoint semantics. The guest side provides operations to open and
close endpoints, in terms of allocating or deallocating and memory mapping
control structures residing on the backend.
Memory registration: In order to
perform RDMA operations from
user-space buffers, applications
have to inform the kernel to ex-
clude these buffers from memory
handing / relocation operations.
To transfer data from application
buffers to the network, the back-
end needs to access memory ar-
eas. This happens as follows: the
frontend pins the memory pages,
grants them to the backend and
the latter accepts the grant in or- Fig. 1.
der to gain access to these pages.
An I/O Translation Look-aside Buffer (IOTLB) is used to cache the translations
of pages that will take part in communication. This approach ensures the valid-
ity of source and destination buffers, while enabling secure and isolated access
multiplexing. Guest-to-Network: The backend performs a look-up in the IOTLB,
finds the relevant machine address and informs the NIC to program its DMA
engines to start the transfer from the guest’s memory. The DMA transfer is
performed directly to the NIC and as a result, packets are encapsulated into
Ethernet frames, before being transmitted to the network. We use a zero-copy
technique on the send path in order to avoid extra, unnecessary copies. Packet
headers are filled in the backend and the relevant (granted) pages are attached
to the socket buffer. Network-to-Guest: When an Ethernet frame is received from
the network, the backend invokes the associated packet handler. The destination
virtual address and endpoint are defined in the header so the backend performs
a look-up on its IOTLB and is performs the necessary operations. Data are
then copied (or DMA’d) to the relevant (already registered) destination pages.
Wire protocol: Our protocol’s packets are encapsulated into Ethernet frames con-
taining the type of the protocol (a unique type), source and destination MAC
addresses.
402 A. Nanos et al.
Data Movement: Figure 1 shows the data paths either for control or data
movement: Proposed approach: Applications issue requests for RDMA operations
through endpoints. The frontend passes requests to the backend using the event
channel mechanism (dashed arrow, b1 ). The backend performs the necessary
operation, either registering memory buffers (filling up the IOTLB), or issuing
transmit requests to the Ethernet driver (dashed arrow, c1 ). The driver, then,
informs the NIC to DMA data from application to the on-chip buffers (dashed
arrow, d1 ). Ideal approach: Although the proposed approach relaxes the system
from processing and context-switch overheads, ideally, VMs could communicate
directly with the hardware, lowering the multiplexing authority to the NIC’s
firmware (solid arrows).
4 Performance Evaluation
We use a custom synthetic microbenchmark to evaluate our approach over our
interconnect sending unidirectional RDMA write requests. To obtain a baseline
measurement, we implement our microbenchmark using TCP sockets. TCP/IP
results were verified using netperf [16] in TCP STREAM mode and varying message
sizes. As a testbed, we used two Quad core Intel Xeon 2.4GHz with an Intel 5500
chipset, with 4GB main memory. The network adapters used are two PCIe-4x
Myricom 10G-PCIE-8A 10GbE NICs in Ethernet mode, connected back-to-back.
We used Xen version 4.1-unstable and Linux kernel version 2.6.32.24-pvops both
for the privileged guest and the VMs. The MTU was set to 9000 for all tests. We
use 1GB memory for each VM and 2GB for the privileged guest. CPU utilization
results are obtained from /proc/stat. To eliminate Linux and Xen scheduler
effects we pinned all vCPUs to physical CPUs and assigned 1 core per VM and
2 cores for the privileged guest, distributing interrupt affinity to each physical
core for event channels and the Myrinet NICs In the following, TCP SOCK refers
to the TCP/IP network stack and ZERO COPY refers to our proposed framework.
4.1 Results
To obtain a baseline for our experiments, we run the pktgen utility of the Linux
kernel. This benchmark uses raw Ethernet and, thus, this is the upper bound
of all approaches. Figure 2(a) plots the maximum achievable socket buffer pro-
duction rate when executed in vanilla Linux (first bar), inside the Privileged
Guest (second bar) and in the VM (third bar). Clearly, the PVops Linux kernel
encounters some issues with Ethernet performance, since the privileged guest
can achieve only 59% of the vanilla Linux case. As mentioned in Section 2, Xen
VMs are offered a virtual Ethernet device via the netfront driver. Unfortunately,
in the default configuration, this device does not feature specific optimizations
or accelerators and, as a result, its performance is limited to 416MiB/sec (56%
of the PVops case)1 .
1
For details on raw Ethernet performance in Xen PVops kernel see
http://lists.xensource.com/archives/html/xen-users/2010-04/msg00577.html
A Smart HPC Interconnect for Clusters of Virtual Machines 403
800
PKTGEN
ZERO COPY
TCP SOCKETS
700
600
1400 Linux
Xen Driver Domain 500
Bandwidth (MB/s)
Xen VM
1200
Throughput (MiB/sec)
400
1000
300
800
600 200
400
100
200
0
128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
0
pktgen Message Size (Bytes)
80
1000
60
500
40
0
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
20
Fig. 2.
800
700 steal_time steal_time
softirq 250 softirq
600 irq irq
iowait iowait
500 200
Time (us)
system system
Time(usec)
400 nice nice
user 150 user
300
100
200
50
100
0 0
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK
4K 8K 16K 32K 64K 128K 256K 4K 8K 16K 32K 64K 128K 256K
(a) CPU time breakdown for the (b) CPU time breakdown for the
driver domain VM (send and receive path)
Fig. 3. CPU time breakdown for both the driver domain and the guests
CPU time for RDMA writes: In the HPC world, nodes participating in clusters
except for low-latency and high-bandwidth communication, require computa-
tional power.
Our approach bypasses the TCP/IP stack; we assume that, in this case, the
CPU utilization of the system is relaxed. In order to validate this assumption
we examine the CPU time spent in both approaches. We measure the total CPU
time when two VMs perform RDMA writes of varying message sizes over the
network (TCP and ZERO COPY approach). In Figure 2(d), we plot the CPU
time both for the driver domain and the VM. It is clearly shown that for 4K
to 32K messages the CPU time of our framework is constant, as opposed to the
TCP case where CPU time increases proportionally to the message size. When
the 64K boundary is crossed, TCP CPU time increases by an exponential factor
due to intermediate switches and copies both on the VM and the driver domain.
Our framework is able to sustain low CPU time on the Privileged Guest and
almost negligible CPU time on the VM. To further investigate the sources of
CPU time consumption, we plot the CPU time breakdown for the Privileged
Guest and the VM in Figures 3(a) and 3(b), respectively.
In the driver domain (Figure 3(a)): (a) Our framework consumes more CPU
time than the TCP case for 4KB and 8KB messages. This is due to the fact
that we use zero-copy only on the send side; on the receive side, we have to copy
data from the socket buffer provided by the NIC to pages originating from the
VM. (b) For messages larger than 32KB, our approach consumes at most 30%
CPU time of the TCP case, reaching 15% (56 vs. 386) for 32K messages. (c)
In our approach, system time is non-negligible and varying from 20% to 50%
of the total CPU time spent in the Privileged Guest. This is due to the fact
that we haven’t yet implemented page swapping on the receive path. In the VM
(Figure 3(b)): (d) Our approach consumes constant CPU time for almost all
message sizes (varying from 30μsecs to 60μsecs). This constant time is due to
the way the application communicates with the frontend (IOCTLs). However, in
the TCP case, for messages larger than 64K, CPU time increases significantly.
This is expected, as all the protocol processing (TCP/IP) is done inside the
VM. Clearly, system time is almost 60% of the total VM CPU time for 256K
messages reaching 75% for 128K. (e) Our approach exhibits negligible softirq
A Smart HPC Interconnect for Clusters of Virtual Machines 405
time (apparent mostly in the receive path). This is due to the fact that the
privileged guest is responsible for placing data originating from the network to
pages we have already pinned and granted. On the other hand, the TCP case
consumes softirq time as data elevate on the TCP/IP network stack to reach the
application’s socket.
5 Conclusions
References
1. Whitaker, A., Shaw, M., Gribble, S.D.: Denali: Lightweight virtual machines for
distributed and networked applications. In: Proc. of the USENIX Annual Technical
Conference (2002)
2. PCI SIG: SR-IOV (2007),
http://www.pcisig.com/specifications/iov/single_root/
3. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I.A., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003:
Proc. of the 19th ACM Symposium on Operating Systems Principles, pp. 164–177.
ACM, NY (2003)
406 A. Nanos et al.
4. Recio, R., Culley, P., Garcia, D., Hilland, J.: An RDMA Protocol Specification
(Version 1.0) This document is a Release Specification of the RDMA Consortium
5. Goglin, B.: Design and Implementation of Open-MX: High-Performance Message
Passing over generic Ethernet hardware. In: CAC 2008: Workshop on Commu-
nication Architecture for Clusters, held in conjunction with IPDPS 2008. IEEE
Computer Society Press, Miami (2008)
6. Shalev, L., Satran, J., Borovik, E., Ben-Yehuda, M.: IsoStack—Highly Efficient
Network Processing on Dedicated Cores. In: USENIX ATC 2010: USENIX Annual
Technical Conference (2010)
7. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the Performance Impact
of Xen on MPI and Process Execution For HPC Systems. In: 1st Intern. Workshop
on Virtualization Techn. in Dstrb. Computing. VTDC 2006 (2006)
8. Nanos, A., Goumas, G., Koziris, N.: Exploring I/O Virtualization Data Paths for
MPI Applications in a Cluster of VMs: A Networking Perspective. In: Guarra-
cino, M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla,
F., Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010.
LNCS, vol. 6586, pp. 665–671. Springer, Heidelberg (2011)
9. Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen.
In: ATEC 2006: Proceedings of the Annual Conference on USENIX 2006 Annual
Technical Conference, p. 2. USENIX, Berkeley (2006)
10. Ram, K.K., Santos, J.R., Turner, Y.: Redesigning xen’s memory sharing mechanism
for safe and efficient I/O virtualization. In: WIOV 2010: Proceedings of the 2nd
Conference on I/O Virtualization, p. 1. USENIX, Berkeley (2010)
11. Dong, Y., Dai, J., Huang, Z., Guan, H., Tian, K., Jiang, Y.: Towards high-quality
I/O virtualization. In: SYSTOR 2009: Proceedings of SYSTOR 2009: The Israeli
Experimental Systems Conference, pp. 1–8. ACM, NY (2009)
12. Santos, J.R., Turner, Y., Janakiraman, G., Pratt, I.: Bridging the gap between
software and hardware techniques for I/O virtualization. In: ATC 2008: USENIX
2008 Annual Technical Conference on Annual Technical Conference, pp. 29–42.
USENIX, Berkeley (2008)
13. Ram, K.K., Santos, J.R., Turner, Y., Cox, A.L., Rixner, S.: Achieving 10 Gb/s us-
ing safe and transparent network interface virtualization. In: VEE 2009: Proceed-
ings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual
Execution Environments, pp. 61–70. ACM, NY (2009)
14. Liu, J., Huang, W., Abali, B., Panda, D.K.: High performance VMM-bypass I/O
in virtual machines. In: ATEC 2006: Proceedings of the Annual Conference on
USENIX 2006 Annual Technical Conference, p. 3. USENIX, Berkeley (2006)
15. Nanos, A., Koziris, N.: MyriXen: Message Passing in Xen Virtual Machines over
Myrinet and Ethernet. In: 4th Workshop on Virtualization in High-Performance
Cloud Computing, The Netherlands (2009)
16. Jones, R.: Netperf, http://www.netperf.org
Coexisting Scheduling Policies Boosting I/O
Virtual Machines
1 Introduction
Currently, cloud computing infrastructures feature powerful VM containers, that
host numerous VMs running applications that range from CPU– / memory–
intensive to streaming I/O, random I/O, real-time, low-latency and so on. VM
containers are obliged to multiplex these workloads and maintain the desirable
Quality of Service (QoS), while VMs compete for a time-slice. However, running
VMs with contradicting workloads within the same VM container leads to sub-
optimal resource utilization and, as a result, to degraded system performance.
For instance, the Xen VMM [1], under a moderate degree of overcommitment (4
vCPUs per core), favors CPU–intensive VMs, while network I/O throughput is
capped to 40%.
In this work, we argue that by altering the scheduling concept on a busy VM
container, we optimize the system’s overall performance. We propose a frame-
work that provides multiple coexisting scheduling policies tailored to the work-
loads’ needs. Specifically, we realize the following scenario: the driver domain
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 407–415, 2012.
c Springer-Verlag Berlin Heidelberg 2012
408 D. Aragiorgis, A. Nanos, and N. Koziris
is decoupled from the physical CPU sets that the VMs are executed and does
not get preempted. Additionally, VMs are deployed on CPU groups according
to their workloads, providing isolation and effective resource utilization despite
their competing demands.
We implement this framework in the Xen paravirtualized environment. Based
on an 8-core platform, our approach achieves 2.3 times faster I/O service, while
sustaining no less than 80% of the default overall CPU-performance.
2 Background
To comprehend how scheduling is related to I/O performance, in this section we
refer shortly to the system components that participate in an I/O operation.
Hypervisor. The Xen VMM is a lightweight hypervisor that allows multiple
VM instances to co-exist in a single platform using ParaVirtualization (PV). In
the PV concept, OS kernels are aware of the underlying virtualization platform.
Additionally, I/O is handled by the driver domain, a privileged domain having
direct access to the hardware.
Breaking down the I/O path. Assuming for instance that a VM application
transmits data to the network, the following actions will occur: i) Descending
the whole network stack (TCP/IP, Ethernet) the netfront driver (residing in the
VM) acquires a socket buffer with the appropriate headers containing the data.
ii) The netfront pushes a request on the ring (preallocated shared memory) and
notifies the netback driver (residing in driver domain) with an event (a virtual
IRQ) that there is a pending send request that it must service. iii) The netback
pushes a response to the ring and en-queues the request to the actual driver. iv)
The native device driver, who is authorized to access the hardware, eventually
transmits the packet to the network.
In PV, multiple components, residing in different domains, take part in an
I/O operation (frontend: VM, backend–native driver: driver domain). The whole
transaction stalls until pending tasks (events) are serviced; therefore the targeted
vCPU has to be running. This is where the scheduler interferes.
The Credit Scheduler. Currently, Xen’s default scheduler is the Credit sched-
uler and is based on the following algorithm: (a) Every physical core has a local
run-queue of vCPUs eligible to run. (b) The scheduler picks the head of the
run-queue to execute for a time-slice of 30ms at maximum. (c) The vCPU is
able to block and yield the processor before its time-slice expires. (d) Every
10ms accounting occurs which debits credits to the running domain. (e) New
allocation of credits occurs when all domains have their own consumed. (f ) A
vCPU is inserted to the run-queue after all vCPUs with greater or equal priori-
ty. (g) vCPUs can be in one of 4 different priorities (ascending): IDLE, OVER,
UNDER, BOOST. A vCPU is in the OVER state when it has all its credits
consumed. BOOST is the state when one vCPU gets woken up. (h) When a
run-queue is empty or full with OVER / IDLE vCPUs, Credit migrates neigh-
boring UNDER / BOOST vCPUs to the specific physical core (load-balancing).
Coexisting Scheduling Policies Boosting I/O Virtual Machines 409
3 Motivation
3.1 Related Work
Recent advances in virtualization technology have minimized overheads associ-
ated with CPU sharing when every vCPU is assigned to a physical core. As a
result, CPU–bound applications achieve near-native performance when deployed
in VM environments. However, I/O is a completely different story: intermediate
virtualization layers impose significant overheads when multiple VMs share net-
work or storage devices [6]. Numerous studies present significant optimizations
on the network I/O stack using software [5,8] or hardware approaches [3].
These studies attack the HPC case, where no CPU over-commitment occurs.
However, in service-oriented setups, vCPUs that belong to a vast number of
VMs and run different types of workloads, need to be multiplexed. In such a
case, scheduling plays an important role.
Ongaro et al. [7] examine the Xen’s Credit Scheduler and expose its vulner-
abilities from an I/O performance perspective. The authors evaluate two basic
existing features of Credit and propose run-queue sorting according to the credits
each VM has consumed. Contrary to our approach, based on multiple, co-existing
scheduling policies, the authors in [7] optimize an existing, unified scheduler to
favor I/O VMs.
Cucinotta [2] in the IRMOS1 project proposes an real-time scheduler to fa-
vor interactive services. Such a scheduler could be one of which coexist in our
concept.
Finally, Hu et al. [4] propose a dynamic partitioning scheme using VM mon-
itoring. Based on run–time I/O analysis, a VM is temporarily migrated to an
isolated core set, optimized for I/O. The authors evaluate their framework using
one I/O–intensive VM running concurrently with several CPU–intensive ones.
Their findings suggest that more insight should be obtained on the implications
of co-existing CPU– and I/O– intensive workloads. Based on this approach, we
build an SMP-aware, static CPU partitioning framework taking advantage of
contemporary hardware. As opposed to [4], we choose to bypass the run-time
profiling mechanism, which introduces overhead and its accuracy cannot be guar-
anteed.
Specifically, we use a monitoring tool to examine the bottlenecks that arise
when multiple I/O–intensive VMs co-exist with multiple CPU–intensive ones.
1
More information is available at: http://www.irmosproject.eu
410 D. Aragiorgis, A. Nanos, and N. Koziris
We then deploy VMs to CPU-sets (pools) with their own scheduler algorithm,
based on their workload characteristics. In order to put pressure on the I/O
infrastructure, we perform our experiments in a modern multi-core platform,
using multi-GigaBit network adapters. Additionally, we increase the degree of
overcommitment to apply for a real-world scenario. Overall, we evaluate the
benefits of coexisting scheduling policies in a busy VM container with VMs run-
ning various types of workloads. Our goal is to fully saturate existing hardware
resources and get the most out of the system’s performance.
% Overall Performance
100 100
80 80
60 CPU 60 CPU
40 I/O 40 I/O
20 20
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of VMs Number of VMs
(a) CPU or I/O VMs (exclusive) (b) CPU and I/O VMs (concurrently)
Exclusive CPU– or I/O–intensive VMs. Figure 1(a) shows that the overall
CPU operations per second are increasing until the number of vCPUs becomes
equal to the number of physical CPUs. This is expected as the Credit scheduler
provides fair time-sharing for CPU intensive VMs. Additionally, we observe that
the link gets saturated but presents minor performance degradation in the max-
imum degree of overcommitment as a result of bridging all network interfaces
together while the driver domain is being scheduled in and out repeatedly.
Concurrent CPU– and I/O–intensive VMs. Figure 1(b) points out that when
CPU and I/O VMs run concurrently we experience a significant negative effect
on the link utilization (less than 40%).
default default
2 pool 2 pool
3 pool 3 pool
100% 100%
% ( of maximum performance )
% ( of maximum performace )
80% 80%
60% 60%
40% 40%
20% 20%
0% 0%
3+3 9+9 15+15 3+3 9+9 15+15
VMs (I/O+CPU) VMs (I/O+CPU)
(a) CPU Overall Performance (b) I/O Overall Performance
Taking a look back at Figure 2, we observe that the latency between domU and
dom0 (dark area) is eliminated. That is because dom0 never gets preempted and
achieves maximum responsiveness. Moreover the time lost in the other direction
(light area) is apparently reduced; more data rate is available and I/O domains
can batch more work.
Figure 3 plots the overall performance (normalized to the maximum observed),
as a function of concurrent CPU and I/O VMs. The first bar (dark area) plots
the default setup (Section 3.2), whereas the second one (light area) plots the ap-
proach discussed in this Section. Figure 3(b) shows that even though the degree
of over-commitment is maximum (4 vCPUs per physical core) our framework
achieves link saturation. On the other hand, CPU performance drops propor-
tionally to the degree of over-commitment (Figure 3(a)).
The effect on CPU VMs is attributed to the driver domain’s ability to process
I/O transactions in a more a effective way; more data rate is available and I/O
VMs get notified more frequently; according to Credit’s algorithm I/O VMs get
boosted and eventually steal time-slices from the CPU VMs.
Trying to eliminate the negative ef-
fect to the CPU–intensive VMs, we 100
% of native maximum
5 Discussion
5.1 Credit Vulnerabilities to I/O Service
The design so far has decoupled I/O– and CPU–intensive VMs achieving iso-
lation and independence, yet a near optimal utilization of resources. But is the
Credit scheduler ideal for multiplexing only I/O VMs? We argue that slight
changes can benefit I/O service.
Time-slice allocation: Having 100
achieved isolation between different 90
80
Link Utilization %
70
60
1 VCPU
#VCPU=#NICs to a physical core, by setting the
50
40
smp affinity of the corresponding
30 irq. Thus the NIC’s driver does not
20
10 suffer from interrupt processing con-
0
1Gbps 2Gbps 3Gbps 4Gbps tention. However, we observe that af-
ter 2Gbps the links do not get sat-
Fig. 6. Multiple GigaBit NICs urated. Preliminary findings suggest
that this unexpected behavior is due
to Xen’s network path. Nevertheless, this approach is applicable to cases where
the driver domain or other stub-domains have demanding responsibilities such
as multiplexing accesses to shared devices.
6 Conclusions
In this paper we examine the impact of VMM scheduling in a service orient-
ed VM container and argue that co-existing scheduling policies can benefit the
overall resource utilization when numerous VMs run contradicting types of work-
loads. VMs are grouped into sets based on their workload characteristics, suf-
fering scheduling policies tailored to the need of each group. We implement our
approach in the Xen virtualization platform. In a moderate overcommitment
scenario (4 vCPUs/ physical core), our framework is able to achieve link satura-
tion compared to less than 40% link utilization, while CPU-intensive workloads
sustain 80% of the default case.
Our future agenda consists of exploring exotic scenarios using different types
of devices shared across VMs (multi-queue and VM-enabled multi-Gbps NICs,
Coexisting Scheduling Policies Boosting I/O Virtual Machines 415
References
1. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I.A., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003:
Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles,
pp. 164–177. ACM, New York (2003)
2. Cucinotta, T., Giani, D., Faggioli, D., Checconi, F.: Providing Performance Guaran-
tees to Virtual Machines Using Real-Time Scheduling. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A.,
Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586,
pp. 657–664. Springer, Heidelberg (2011)
3. Dong, Y., Yu, Z., Rose, G.: SR-IOV networking in Xen: architecture, design and
implementation. In: WIOV 2008: Proceedings of the First Conference on I/O Vir-
tualization, p. 10. USENIX Association, Berkeley (2008)
4. Hu, Y., Long, X., Zhang, J., He, J., Xia, L.: I/o scheduling model of virtual machine
based on multi-core dynamic partitioning. In: IEEE International Symposium on
High Performance Distributed Computing, pp. 142–154 (2010)
5. Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen.
In: ATEC 2006: Proceedings of the Annual Conference on USENIX 2006 Annual
Technical Conference, p. 2. USENIX Association, Berkeley (2006)
6. Nanos, A., Goumas, G., Koziris, N.: Exploring I/O Virtualization Data Paths for
MPI Applications in a Cluster of VMs: A Networking Perspective. In: Guarracino,
M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F.,
Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS,
vol. 6586, pp. 665–671. Springer, Heidelberg (2011)
7. Ongaro, D., Cox, A.L., Rixner, S.: Scheduling i/o in virtual machine monitors. In:
Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments, VEE 2008, pp. 1–10. ACM, New York (2008)
8. Ram, K.K., Santos, J.R., Turner, Y.: Redesigning xen’s memory sharing mechanism
for safe and efficient I/O virtualization. In: WIOV 2010: Proceedings of the 2nd
Conference on I/O Virtualization, p. 1. USENIX Association, Berkeley (2010)
PIGA-Virt: An Advanced Distributed MAC
Protection of Virtual Systems
1 Introduction
A virtualization layer, i.e. an hypervisor, brings isolation between multiple sys-
tems, i.e. Virtual Machines, hosted on the same hardware. The hypervisor re-
duces the interferences between the vms. But the virtualization is not a security
guarantee. It increases the attack surface and adds new attack vectors. As a
consequence, the virtualization must not be the sole technology for providing
isolation within a Cloud. For example, in [14], the isolation is broken through
drivers that allow the access of the underlying hardware from inside a vm. In-
deed, these drivers can access the physical memory without passing through the
1
http://www.agence-nationale-recherche.fr/magazine/actualites/detail/
resultats-du-defi-sec-si-systeme-d-exploitation
-cloisonne-securise-pour-l-internaute/
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 416–425, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Advanced Distributed MAC Protection of Virtual Systems 417
kernel or the hypervisor thus by passing the protection layers. Preventing such
attacks requires to guarantee the integrity of 1) the vm, 2) the hypervisor and
3) the underlying Operating System. Moreover, a vm can produce information
flows that come to another vm in order to break some of the requested secu-
rity objectives. Accordingly, a vm can attack the integrity, confidentiality and
availability of other vms that run on the same hardware.
With cloud paradigm, the data and the entire computing infrastructure is
outside the scope of the final users. Thus, security is one of the top concerns
in clouds [9]. Indeed, with a cloud infrastructure relying on virtualization, the
hardware is shared between multiple users (multi-tenancy) and these users can
be adversaries. Moreover, as explained in [13,4], various securities and function-
alities are needed to enforce the criticality of missions within a cloud. Thus, the
major goal of this work is to increase the security assurance by 1) hardening the
isolation of the virtualization layer and 2) providing a mission-aware security
component for the virtualization layer and 3) balancing the security with the
performance.
The first section defines the precise objectives of our solution. Second, the
paper describes the different protection modes supported by PIGA-Virt. Third,
it describes how PIGA-Virt enforces the protection. Forth, it gives the efficiency
for the different modes of the versatile PIGA-Virt solution. Fifth, it describes
the related works. Finally, the paper concludes by defining the future works.
2 Motivation
In-depth end-to-end Mandatory Protection Inside a vm
The mandatory control minimizes the privileges that a process (a subject) has
regarding the various objects. But existing mac approaches mainly deal with
direct flows. A first set of objectives consider the control of the flows inside a
given virtual machine.
The first purpose is thus to control indirect information flows transiting through
intermediate resources or processes (covert channels). The second purpose is to
ease the definition of a large set of security objectives, such as separation of priv-
ileges or indirect accesses to the information through covert channels. The third
purpose is to provide a mandatory protection controlling all the levels (in-depth
protection) of a virtual machine, such as processes, graphic interface, network,
etc. Our fourth purpose is to provide an efficient mandatory protection that
guarantees all the supported security properties with satisfying performance in
the context of the virtual machines sharing the same host. In contrast with our
previous works [5], the fourfth objective is addressed in that paper since perfor-
mances need to be improved for hosting multiple vms on the same machine.
In-depth end-to-end Protection between vms
With multiple multiple vms sharing the same host, the flows between the vms
must also be efficiently controlled. For example, the protection must prevent a
malicious information flow, coming from (vm1), from going to (vm2) through
a nfs share. But, some NFS flows between vm1 and vm2 must be allowed
418 J. Briffaut et al.
while others must be denied. A second set of objectives address specifically that
control of the flows between the vms. The corresponding objectives have not
been addressed during our previous works.
A fifth purpose is that the in-depth end-to-end protections must control the
flows between the different vms. Such a protection consists in controlling 1) all
the indirect flows that are visible to the vms and 2) all the indirect flows using,
intermediate entities of the target host, that are invisible to the vms. A sixth
purpose is to have a protection independent from the target hypervisor. It is
an important issue since several kinds of hypervisor technologies can cohabit
in the context of a cloud. A seventh purpose is that the proposed protection
must be easy to configure. A eigth purpose is to ease the tuning of the protec-
tion efficiency. Thus, the administrator can tune the protection to balance the
performance with the security.
3 Architecture of PIGA-Virt
1 define confidentiality( sc1 in SCS, sc2 in SCO ) [ ¬(sc2 > sc1) AND ¬(sc2 >> sc1) ];
In contrast with our previous works [5], the major improvement is the way the
dedicated virtual machine computes the controls in the shared mode. The PIGA
shared decision engine computes independant data for each virtual machine. The
PIGA shared decision engine communicates with the different piga-kernels
available into the different virtual machines. When PIGA shared decision finds
a real activity of a virtual machine matching with the precomputed set of illegal
activities associated with that virtual machine, it sends a deny to the corre-
sponding piga-kernel in order to cancel the corresponding system call.
The following rule guarantees the confidentiality of the /etc files in vm1 re-
garding the users of vm2. In contrast with previous works, a virtual machine
identifier is added to the SELinux context. Thus, the administrator easily ex-
press the control of the flows between two different virtual machines. The corre-
sponding control is only available in the shared mode. The PIGA shared decision
engine breaks an illegal activity into different subactivities for the correspond-
ing virtual machines. When all the subactivities are detected, the PIGA shared
decision engines sends a deny for the lattest system call.
1 confidentiality(user u:user r:user.∗ t:vm2, system u:object r:etc t:vm1);
5 Experimentations
PIGA-Virt is integrated into a Scientific Linux 6 host with kvm as hypervisor.
The PIGA-Kernel consists in a small patch that captures the SELinux hooks.
PIGA-Decision is available as a Java process. Experimentations run on an AMD
Phenom(tm) II X4 965 Processor with 8 Giga bytes of RAM.
An Advanced Distributed MAC Protection of Virtual Systems 421
Performances
Figure 2 presents several types of PIGA-Virt instances for protection two Linux
virtual machines V M 1 and V M 2. PIGA-Virt runs a) without SELinux, b) with
the targeted SELinux policy, c) with the strict SELinux policy, d) in local mode
to detect the violations of the requested properties and e) in shared mode to
detect the violations, f) in local mode to prevent the intra violations and g) in
shared mode to prevent the inter violations. The local mode controls the flows
inside a vm whereas the shared mode controls the flows between vms. In contrast
with the PIGA-Virt detection, the PIGA-Virt protection enables to evaluate the
overhead introduced to prevent the violations of the required security properties.
Several benchmarks (open/close of files, executing the ls -lR command to
parse the whole file system, fork and file access latency) show the performances
of PIGA-Virt. As shown in the figure 2, the overhead due to the in-depth end-to-
end protection of PIGA-Virt is a very low. The performance of the environment
without any mac protection corresponds to the a) column i.e. SELinux OFF.
The performance of the controls inside the vm is given by the f) column i.e. local
protection. The performance of the controls between the vms, is given by the
g) column i.e. shared protection. Sometimes the mac protections improve the
performances e.g. the ls command takes more time without any mac protection
since the mac protections minimize the accesses to the file system. In contrast
with SELinux, the PIGA-Virt protection either reduces or equals the overhead.
The only exception is the fork result but this benchmark is very stressful since
it corresponds to unusual millions of simultaneous fork operations.
Globally, the PIGA-Virt protection brings a very low overhead. In contrast
with no mac protection, PIGA-Virt improves the performances. In contrast with
the local mode associated with our previous works, the shared mode factories the
PIGA-Decision within a single instance. So, our new shared approach minimizes
the overhead due to the security mechanisms. Moreover, the shared mode uses
TCP connections between the vms and PIGA-Decision. So, PIGA-Decision can
be run on a dedicated machine with high performance capabilities, improving
thus more the CPU consumption.
Protection Efficiency
In contrast with the local mode i.e. our previous work, the shared mode is of
major importance in term of security assurance. Indeed, it is the only way to
control the flows between the vms sharing the same host.
Let us give a small example of the protection carried out by the conf identiality
property. For example, the following global illegal activity, with a subactivity
on vm1 (user t reading /etc before writing into nf s t) and a subactivity on
vm2 (user t reading nf s t), is a violation of conf identiality($sc1 := ”user u :
user r : user. ∗ t : vm2”, $sc2 := system u : object r : etc t : vm1). In such a
case, the shared decision engine cancels the reading of nf s t on vm2 since it is
the lattest system call of the global activity. Such an activity corresponds, for
example, to a malware, executed from the user environment of vm1, and trans-
mitting the /etc/shadow password to a distant virtual machine vm2. Thus, the
shared mode eases the protection against generic malicious activities such as
422 J. Briffaut et al.
NFS threats. It enables to prevent illegal flows while authorizing safe flows since
it allows, for example, sysadm t to transmit data through NFS to a distant user.
PIGA-Virt is very efficient since defining a safe SELinux policy is a tricky
task. So, a couple of SPL rules is simpler than writing a SELinux policy 1)
including millions of rules and 2) that does not control the transitive flows.
Mission Efficiency
PIGA-Virt is a mission-aware environment. First, it takes into account the se-
curity objectives i.e. the requested security properties. Second, it provides the
efficiency of each security objective.
Table 1 provides the efficiency of the different properties used during our
experimentation. For example, the efficiency of the conf identiality property
is 108.045. That value shows 108.045 illegal activities enabling to violate the
confidentiality property within the considered SELinux policy. Such a value has
two meanings: 1) it gives the security enforcement of a property i.e. the higher
the value is, the stronger the property enforces the security and 2) it evaluates
the cost of the property i.e. the higher the value is, the higher the processing
time is needed by PIGA-Virt.
Mission Tuning
As demonstrated, more a security property is strong the higher the overhead
is. It is a well known relationship between security and performance. How-
ever, the important point here is that the administrator has a precise evaluation
An Advanced Distributed MAC Protection of Virtual Systems 423
of each security property. Thus, he can tune the security objectives to fit the
performance needings. For example, the dutiesseparationbash property is a
large overestimation of the separation of duties, that protects against malicious
scripts, since it prevents millions of potential vulnerabilities. In contrast with the
dutieseparationbash, the dutiesseparation property is less large since it protects
only against binary executions preventing thus only 208.240 illegal activities.
However, dutiesseparationbash and dutiesseparation do not tackle the same
security objective. In order to tune a property such as dutiesseparationbash,
several facilities are available.
PIGA-Virt eases the tuning of the security missions in different ways. Thus,
the administrator can adjust a security objective by 1) providing different secu-
rity contexts for the security canvas, 2) modifying the definition of the canvas
and 3) modifying the SELinux policy. That latter solution is usually the tricki-
est. However, PIGA-Virt facilitates this task. Let us consider the conf identiality
property preventing illegal activities including:
1 user u:user r:user t−(dbus{send msg})−>user u:user r:user dbusd t; user u:user r:user dbusd t−(
file{write})−>user u:object r:user home t; user u:user r:gpg agent t−(file{read})−>user u:
object r:user home t; user u:user r:gpg agent t−(file{write} )−>system u:object r:nfs t
Thus, the administrator sees that the dbus and gpg are involved into that threat.
PIGA-Virt shows that the problem can be corrected with a separation of duties
for dbus or gpg. Thus, the tuning consists in a new SELinux policy including, for
example, separation of duties for dbus (e.g. removing the permission of writing
into user home t for dbus t).
Property Efficiency
Security transitionsequence 101 533
mission notrereadconfigfile 2
ourreadconfigfile 4
dutiesseparation 208 240
dutiesseparationbash 194 629 680
confidentiality 108045
integrity 30
trustedpathexecution 8 715
trustedpathexecutionuser 204
trustedpathexecutionuser 26
consistentaccess 50 470
6 Related Works
A frequent approach is to use integrity verification technologies. [1] uses a dedi-
cated hypervisor to encrypt the data and the network transmission. GuardHype [2]
424 J. Briffaut et al.
and [10] verifies the integrity of the hypervisor itself or the integrity of the kernel
and critical applications. But these approaches are limited to statically verify the
integrity of an image, a binary or a part of the memory. However, those solu-
tions do not control the access to the ressources. The followed approach is to
put Madatory Access Control outside of the vms. Thus, the multiple virtual ma-
chines can be controlled consistently and safely using a single security monitor.
mac [6] is the only way to guarantee security objectives. In [3], that approach
is limited to the control inside an untrusted virtual machine and cannot guar-
anty the isolation between the virtual machines. For example, sHype [12] brings
Type Enforcement to control the inter-vm communications. But, sHype only
controls overt channels thus missing implicit covert channels. Moreover, they do
not propose a way to express security properties. The mac enforcement of the
hypervisor can be extended to the mac enforcement inside the virtual machine.
Thus, [8] divides the overall policy into specialized policies (one per vm and one
for the interaction between vm)s. For example, Shamon [7] is a prototype based
on Xen/XSM (Inter-vm mac) and SELinux (os Level mac) to control applica-
tions running on different vms. As explained in [11], the common way to analyze
mac policies is to search for illegal information flows inside them. In order to re-
duce the complexity, [11] analyses each layer (hypervisor then os). The analysis
is too complex and the illegal flows cannot be blocked in real time. So exist-
ing solutions cannot control in real-time advanced security properties associated
with multiple information flows between the different virtual machines.
7 Conclusion
That paper presents the first mission-aware security approach for vms that sup-
ports a large range of security objectives and provides a precise evaluation of the
security efficiency. In contrast with existing approaches, it provides a real time
protection of advanced security objectives with a very low overhead. Moreover,
PIGA-Virt eases the work of the administrator since around ten security rules
are generally sufficient to control efficiently the flows between the different vms
sharing the same host. Finally, PIGA-Virt is an extensible approach. Indeed, it
requires only security contexts associated with the different system resources.
For example, a Windows 7 module is available providing consistent security la-
bels that can be processed through PIGA-Virt. It is an excellent way to improve
the security of heterogeneous vms such as required in Cloud infrastructures. Fu-
ture works deal with distributed scheduling of vms as a security mission-aware
service providing Security as a Service ([Sec]aaS) in the context of anything as
Service approaches (XaaS Clouds).
References
1. BitVisor 1.1 Reference Manual (2010), http://www.bitvisor.org/
2. Carbone, M., Zamboni, D., Lee, W.: Taming virtualization. IEEE Security and
Privacy 6(1), 65–67 (2008)
An Advanced Distributed MAC Protection of Virtual Systems 425
3. Chen, X., Garfinkel, T., Christopher Lewis, E., Subrahmanyam, P., Waldspurger,
C.A., Boneh, D., Dwoskin, J., Ports, D.R.K.: Overshadow: a virtualization-based
approach to retrofitting protection in commodity operating systems. SIGOPS
Oper. Syst. Rev. 42, 2–13 (2008)
4. Jaeger, T., Schiffman, J.: Outlook: Cloudy with a chance of security challenges and
improvements. IEEE Security and Privacy 8, 77–80 (2010)
5. Briffaut, C.T.J., Peres, M.: A dynamic end-to-end security for coordinating multi-
ple protections within a linux desktop. In: Proceedings of the 2010 IEEE Workshop
on Collaboration and Security (COLSEC 2010), pp. 509–515. IEEE Computer So-
ciety, Chicago (2010)
6. Loscocco, P.A., Smalley, S.D., Muckelbauer, P.A., Taylor, R.C., Turner, S.J., Far-
rell, J.F.: The Inevitability of Failure: The Flawed Assumption of Security in Mod-
ern Computing Environments. In: Proceedings of the 21st National Information
Systems Security Conference, Arlington, Virginia, USA, pp. 303–314 (October
1998)
7. McCune, J.M., Jaeger, T., Berger, S., Caceres, R., Sailer, R.: Shamon: A sys-
tem for distributed mandatory access control. In: Proceedings of the 22nd Annual
Computer Security Applications Conference, pp. 23–32. IEEE Computer Society,
Washington, DC (2006)
8. Payne, B.D., Sailer, R., Cáceres, R., Perez, R., Lee, W.: A layered approach to
simplified access control in virtualized systems. SIGOPS Oper. Syst. Rev. 41,
12–19 (2007)
9. Pearson, S., Benameur, A.: Privacy, security and trust issues arising from cloud
computing. In: Proceedings of the 2010 IEEE Second International Conference on
Cloud Computing Technology and Science, CLOUDCOM 2010, pp. 693–702. IEEE
Computer Society, Washington, DC (2010)
10. Quynh, N.A., Takefuji, Y.: A real-time integrity monitor for xen virtual machine.
In: ICNS 2006: Proceedings of the International Conference on Networking and
Services, p. 90. IEEE Computer Society, Washington, DC (2006)
11. Rueda, S., Vijayakumar, H., Jaeger, T.: Analysis of virtual machine system poli-
cies. In: Proceedings of the 14th ACM Symposium on Access Control Models and
Technologies, SACMAT 2009, pp. 227–236. ACM, New York (2009)
12. Sailer, R., Jaeger, T., Valdez, E., Caceres, R., Perez, R., Berger, S., Griffin, J.L.,
Van Doorn, L., Center, I.B.M.T.J.W.R., Hawthorne, N.Y.: Building a MAC-based
security architecture for the Xen open-source hypervisor. In: 21st Annual Computer
Security Applications Conference, p. 10 (2005)
13. Sandhu, R., Boppana, R., Krishnan, R., Reich, J., Wolff, T., Zachry, J.: Towards
a discipline of mission-aware cloud computing. In: Proceedings of the 2010 ACM
Workshop on Cloud Computing Security Workshop, CCSW 2010, pp. 13–18. ACM,
New York (2010)
14. Wojtczuk, R.: Subverting the Xen hypervisor. BlackHat USA (2008)
An Economic Approach for Application QoS
Management in Clouds
1 Introduction
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 426–435, 2012.
c Springer-Verlag Berlin Heidelberg 2012
An Economic Approach for Application QoS Management in Clouds 427
2 Architecture
In this Section we describe the architecture of our solution. We detail the main
components and the interaction between them. We then describe the current im-
plementation of the proportional-share allocation algorithm and the assumptions
that we make regarding the infrastructure’s virtual currency management.
#
$
%
"
!
"
&"
instance of the virtual cluster and the manager’s willingness to pay for the al-
located resources, s (spending rate). After its virtual cluster is allocated, the
manager starts its application. During the application execution the manager
monitors the application and uses application performance metrics (e.g., number
of processed tasks/time unit), or system information (e.g., resource utilization
metrics) to adapt its resource request to its application performance goal. This
can be done in two different ways: (i) by changing the virtual cluster size; (ii)
by changing the spending rate for the virtual cluster.
The resource controller allocates a resource fraction (e.g., 10% CPU or 1MB
memory) on a physical node for each virtual machine instance of a virtual clus-
ter. This allocation is enforced by a Virtual Machine Monitor (e.g., Xen [1])
and is proportional with the manager’s spending rate and inversely proportional
with the current resource price. If the allocation becomes lower than the min-
imum resource allocation requested by the manager then the virtual cluster is
preempted.
Resource Allocation. The resource controller recomputes the allocations for all
running virtual machines periodically. At the beginning of each time period, the
resource controller aggregates all newly received and existing requests and dis-
tributes the total infrastructure capacity between them through a proportional-
share allocation rule. This rule is applied as follows.
We consider the infrastructure has a total capacity C that needs to be shared
between M virtual machine instances. Each virtual machine receives a resource
b
amount defined as a = Pj · C, where si is the spending rate per virtual machine
M
and P = Σi=1 si is the total resource price. However, because the capacity of the
infrastructure is partitioned between different physical nodes, after computing
the allocations we may reach a situation in which we cannot accommodate all the
virtual machines on the physical nodes. Thus, instead of computing the allocation
from the total infrastructure capacity, we compute the allocation considering
the node capacity and we try to minimize the resulting error. For simplicity we
assume that the physical infrastructure is homogeneous and we treat only the
CPU allocation case.
An Economic Approach for Application QoS Management in Clouds 429
The algorithm applied by the resource controller has the following steps. To
ensure that the allocation of the virtual machine instances belonging to the
same group is uniform, the spending rate of the group is distributed between
the virtual machine instances in an equal way. Then, the instances are sorted
in descending order by their spending rates s. Afterwards, each virtual machine
instance from each virtual cluster is assigned to the node with the smallest price
m
p = Σk=1 sk , given that there are m instances already assigned to it. This ensures
that the virtual machine gets the highest allocation for the current iteration,
fully utilizing the resources and minimizing the allocation error. The resource
allocations for the current period are computed by iterating through all nodes
and applying the proportional-share rule locally.
Finally, the application managers are charged with the cost of using resources
s M
for the previous period, c = M Σi=1 ui ; ui represents the total amount of used
resource by a virtual machine instance i belonging to the virtual cluster of size M.
Budget Management. The logic of distributing amounts of credits to application
managers is abstracted by the budget manager component of our architecture.
For now we consider that this entity applies a credit distribution policy that
follows the principle ”use it or loose it”. That is, each manager receives an
amount of credits at a large time interval. To prevent hoarding of credits, the
manager is not allowed to save any credits from one time interval (i.e., renew
period ) to another. We also consider that this amount of credits can come from
an user’s account, at a rate established by the user itself; we don’t deal with the
management of the user’s credits in the rest of this paper.
3 Use Cases
We illustrate how the agents can adapt either their spending rates or their virtual
cluster size to take advantage of the resource availability and to meet specific
application goals. We consider two examples: (i) a rigid application (e.g., MPI
job) that needs to execute before a deadline; (ii) an elastic application (i.e., bag-
of-tasks application) composed of a large number of tasks that can be executed
as soon as resources become available on the infrastructure; we assume that
a master component keeps the tasks in a queue and submits them to worker
components to be processed. For the first case the manager requires a virtual
cluster of fixed size to the resource controller and then it controls the virtual
cluster’s allocation by scaling its spending rate. For the second case the manager
requires a virtual cluster with an initial size which is then scaled according to
infrastructure’s utilization level. Both application models are well known in the
scientific community and are representative for scientific clouds.
We analyzed the behavior of our designed managers by implementing and
evaluating our architecture in CloudSim [2]. We don’t consider the overheads
of virtual machine operations as we only want to show the managers behavior
and not the architecture’s performance. As we focus on the proportional-share
of CPU resources, we consider that the memory capacity of the node is enough
to accommodate all submitted applications. We describe next the design and
behavior of each manager.
430 S. Costache et al.
where α and β are configurable parameters that establish the scaling rate of the
bid and pr is the minimum price of using resources. To avoid depleting its budget
before the application completion, the manager limits its maximum submitted
bid to an amount bmax . For a more efficient use of the budget, we choose the
smallest time period between the remaining time to the budget renew and the
estimated remaining execution time of the application and we distribute the cur-
rent budget over it. The remaining execution time is estimated as the remaining
time to completion if the application continues to make pcurrent progress each
scheduling period. Given a budget B, the manager computes bmax as follows:
has tasks in its queue, the manager expands the virtual cluster. To ensure that
the application’s tasks already submitted to virtual machines are processed as
fast as possible, the manager shrinks the virtual cluster when the existing virtual
machines don’t have enough CPU. The virtual cluster size (i.e., the number of
virtual machines), n, is updated as follows:
n + α, if aavg ≥ Ta and remaining tasks to process > 0
n= (4)
2nβ , otherwise
where α and β are configurable parameters that establish the scaling rate of the
virtual cluster size and Ta is a threshold on the virtual cluster allocation.
4 Related Work
Many recent research efforts focused on designing algorithms for dynamic
resource provisioning in shared platforms. However few of them decouple the
An Economic Approach for Application QoS Management in Clouds 433
1500
1.4
1400
18
Proportional-share
1300 Static allocation
1200 1.2
1100 15
1000 1
Fig. 3. Application allocation in terms of CPU (a), number of virtual machines (b)
and datacenter utilization (c) in time
5 Conclusions
In this paper we presented a new architecture for managing applications and
resources in a cloud infrastructure. To allocate resources between multiple com-
petitive applications, this architecture uses a proportional-share economic model.
The main advantage of this model is the decentralization of the resource control.
Each application is managed by an independent agent that requests resources
by submitting bids to a resource controller. The manager’s bid is limited by its
given budget. To meet its application performance goals the manager can apply
different strategies to vary its bid in time. Through this approach, our archi-
tecture supports different types of applications and allows them to meet their
performance goals while having a simple resource management mechanism.
We validated our architecture by designing and simulating application man-
agers for rigid and elastic applications. We showed how managers can use simple
feedback-based policies to scale the allocation of their applications according
to a given goal. This opens the path towards designing more efficient managers
that optimize their budget management to meet several application performance
goals. For example, in the elastic application case, the manager would take de-
cisions to manage its budget and scale its virtual cluster based on an estimated
finish time of the tasks and a possible deadline. A further step would be then
to consider applications with time-varying resource demands. Optimizing the
resource allocation mechanism and adding support for multiple resource types
will also be our next focus. To improve the support of many application types,
we plan to add the possibility for applications to express placement preferences.
Finally, we plan to implement and validate our architecture in a real system.
An Economic Approach for Application QoS Management in Clouds 435
References
[1] Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of
the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 2003),
pp. 164–177. ACM Press, New York (2003)
[2] Calheiros, R.N., Ranjan, R., Beloglazov, A., De Rose, C.A.F., Buyya, R.:
Cloudsim: a toolkit for modeling and simulation of cloud computing environments
and evaluation of resource provisioning algorithms. Software: Practice and Expe-
rience 41(1), 23–50 (2011)
[3] Carrera, D., Steinder, M., Whalley, I., Torres, J., Ayguade, E.: Utility-based place-
ment of dynamic web applications with fairness goals. In: IEEE Network Opera-
tions and Management Symposium, pp. 9–16 (2008)
[4] Chun, B.N., Culler, D.E.: REXEC: A Decentralized, Secure Remote Execution
Environment for Clusters. In: Falsafi, B., Lauria, M. (eds.) CANPC 2000. LNCS,
vol. 1797, pp. 1–14. Springer, Heidelberg (2000)
[5] Lai, K., Rasmusson, L., Adar, E., Zhang, L., Huberman, B.: Tycoon: An imple-
mentation of a distributed, market-based resource allocation system. Multiagent
and Grid Systems 1(3), 169–182 (2005)
[6] Nguyen Van, H., Dang Tran, F., Menaud, J.-M.: SLA-aware virtual resource man-
agement for cloud infrastructures. In: 9th IEEE International Conference on Com-
puter and Information Technology (CIT 2009), pp. 1–8 (2009)
[7] Norris, J., Coleman, K., Fox, A., Candea, G.: Oncall: Defeating spikes with a free-
market application cluster. In: Proceedings of the First International Conference
on Autonomic Computing (2004)
[8] Sandholm, T., Lai, K.: Dynamic Proportional Share Scheduling in Hadoop.
In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253,
pp. 110–131. Springer, Heidelberg (2010)
[9] Sotomayor, B., Montero, R., Llorente, I., Foster, I.: An Open Source Solution for
Virtual Infrastructure Management in Private and Hybrid Clouds. IEEE Internet
Computing 13(5), 14–22 (2009)
[10] Stratford, N., Mortier, R.: An economic approach to adaptive resource manage-
ment. In: Proceedings of the The Seventh Workshop on Hot Topics in Operating
Systems, HOTOS 1999. IEEE Computer Society (1999)
[11] Tesauro, G., Kephart, J.O., Das, R.: Utility functions in autonomic systems. In:
ICAC 2004: Proceedings of the First International Conference on Autonomic Com-
puting, pp. 70–77. IEEE Computer Society (2004)
[12] Yeo, C.S., Buyya, R.: A taxonomy of market-based resource management systems
for utility-driven cluster computing. Softw. Pract. Exper. 36, 1381–1419 (2006)
Evaluation of the HPC Challenge Benchmarks
in Virtualized Environments
Piotr Luszczek, Eric Meek, Shirley Moore, Dan Terpstra, Vincent M. Weaver,
and Jack Dongarra
1 Introduction
With the advent of cloud computing, more and more workloads are being moved
to virtual environments. High Performance Computing (HPC) workloads have
been slow to migrate, as it has been unclear what kinds of trade-offs will occur
This material is based upon work supported in part by the National Science Foun-
dation under Grant No. 0910812 to Indiana University for “FutureGrid: An Ex-
perimental, High-Performance Grid Test-bed.” Partners in the FutureGrid project
include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U.
Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia,
Purdue I., and T-U. Dresden.
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 436–445, 2012.
c Springer-Verlag Berlin Heidelberg 2012
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 437
when running these workloads in such a setup [10,13]. We evaluated the over-
heads of several different virtual environments and investigated how different
aspects of the system are affected by virtualization.
The virtualized environments we investigated were VMware Player, KVM and
VirtualBox. We used the HPC Challenge (HPCC) benchmarks [6] to evaluate
these environments. HPC Challenge examines performance of HPC architectures
using kernels with memory access patterns more challenging than those of the
High Performance LINPACK (HPL) benchmark used in the TOP500 list. The
tests include four local (matrix-matrix multiply, STREAM, RandomAccess and
FFT) and four global (High Performance Linpack – HPL, parallel matrix trans-
pose – PTRANS, RandomAccess and FFT) kernel benchmarks.
We ran the benchmarks on an 8-core system with Core i7 processors using
Open MPI. We ran on bare hardware and inside each of the virtual environments
for a range of problem sizes.
As expected, the HPL results had some overhead in all the virtual environ-
ments, with the overhead becoming more significant with larger problem sizes
and VMware Player having the least overhead. The latency results showed higher
latency in the virtual environments, with KVM being the highest.
We do not intend for this paper to provide a definitive answer as to which
virtualization technology achieves the highest performance results. Rather, we
seek to provide guidance on more generic behavioral features of various virtual-
ization packages and to further the understanding of VM technology paradigms
and their implications for performance-conscious users.
2 Related Work
There have been previous works that looked at measuring the overhead of HPC
workloads in a virtualized environment. Often the works measure timing external
to the guest, or, when they use the guest, they do not explain in great detail what
problems they encountered when trying to extrapolate meaningful performance
measurements: the very gap we attempt to breach with this paper.
Youseff et al. [14] measured HPC Challenge and ASCI Purple benchmarks.
They found that Xen has better memory performance than real hardware, and
not much overhead.
Walters et al. [12] compared the overheads of VMWare Server (not ESX), Xen
and OpenVZ with Fedora Core 5, Kernel 2.6.16. They used NetPerf and Iozone
to measure I/O and the NAS Parallel benchmarks (both serial, OpenMP and
MPI) for HPC. They found Xen best in networking, OpenVZ best for filesystems.
On serial NAS, most are close to native, some even ran faster. For OpenMP runs,
Xen and OpenVZ are close to real hardware, but VMware has large overhead.
Danciu et al. [1] measured both high-performance and high-throughput work-
loads on Xen, OpenVZ, and Hyper-V. They used LINPACK and Iometer. For
timing, they used UDP packets sent out of the guest to avoid timer scaling is-
sues. They found that Xen ran faster than native on many workloads, and that
I/O did not scale well when running multiple VMs on the same CPU.
438 P. Luszczek et al.
Han et al. [2] ran Parsec and MPI versions of the NAS Parallel benchmarks
on Xen and kernel 2.6.31. They found that the overhead becomes higher when
more cores are added.
Huang et al. [4] ran the MPI NAS benchmarks and HPL inside of Xen. They
measured performance using the Xenoprof infrastructure and found most of the
overhead to be I/O related.
Li et al. [5] ran SPECjvm2008 on a variety of commercial cloud providers.
Their metrics include cost as well as performance.
Mei et al. [7] measured performance of webservers using Xenmon and Xentop.
Performance of OpenMP benchmarks was studied in detail and showed a
wide range of overheads that depended on the work load and parallelization
strategies [9].
3 Setup
3.1 Self-monitoring Results
When conducting large HPC experiments on a virtualized cluster, it would be
ideal if performance results could be gathered from inside the guest. Most HPC
workloads are designed to be measured that way, and doing so requires no change
to existing code.
Unfortunately measuring time from within the guest has its own difficulties.
These are spelled out in detail by VMware [11]. Time that occurs inside a guest
may not correspond at all to outside wallclock time. The virtualization software
will try its best to keep things relatively well synchronized, but, especially if
multiple guests are running, there are no guarantees.
On modern Linux, either gettimeofday() or clock gettime() are used by
most applications to gather timing information. PAPI, for example, uses
clock gettime() for its timing measurements. The C library translates these calls
into kernel calls and executes them either by system call, or by the faster VDSO
mechanism that has lower overhead. Linux has a timer layer that supports these
calls. There are various underlying timers that can be used to generate the tim-
ing information, and an appropriate one is picked at boot time. The virtualized
host emulates the underlying hardware and that is the value passed back to
the guest. Whether the returned value is “real” time or some sort of massaged
virtual time is up to the host.
A list of timers that are available can be found by looking at the file
/proc/timer_list.
There are other methods of obtaining timing information. The rdtsc call
reads a 64-bit time-stamp counter on all modern x86 chips. Usually this can be
read from user space. VMs like VMware can be configured to pass through the
actual system TSC value, allowing access to actual time. Some processor imple-
mentations stop or change the frequency of the TSC during power management
situations, which can limit the usefulness of this resource.
The RTC (real time clock) can also generate time values and can be accessed
directly from user space. However, this timer is typically virtualized.
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 439
Others have attempted to obtain real wall clock time measurements by sending
messages outside the guest and measuring time there. For example, Danciu et
al. [1] send a UDP packet to a remote guest at the start and stop of timing,
which allows outside wallclock measurements. We prefer not to do this, as it
requires network connectivity to the outside world that might not be available
on all HPC virtualized setups.
For our measurements we use the values returned by the HPC Challenge
programs, which just call the gettimeofday() interface invoked by MPI Wtime().
50
VMware Player
VirtualBox
48 Bare metal
KVM
46
44
42
40
18 20 22 24 26 28 30
7
6
Percentage difference
5
4
3
2
1
0
18 20 22 24 26 28 30
LOG2 ( Problem size )
Fig. 1. Variation in percentage difference between the measured CPU and wall clock
times for MPIRandomAccess test of HPC Challenge. The vertical axis has been split
to offer a better resolution for the majority of data points.
We may readily observe nearly a 50% difference between the reported CPU and
wall clock times for small problem sizes on bare metal. The discrepancy dimin-
ishes for larger problem sizes that require longer processing times and render
low timer resolution much less of an issue. In fact, the difference drops below a
single percentage point for most problems larger than 222 . Virtual environments
do not enjoy such consistent behavior though. For small problem sizes, both
timers diverge only by about 5%. Our understanding attributes this behavior to
a much greater overhead imposed by virtual environments on system calls such
as the timing primitives that require hardware access. More information on the
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 441
70
VMware Player
VirtualBox
KVM
60
50
Percentage difference
40
30
20
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Matrix size
Fig. 2. Variation in percentage difference between the measured wall clock times for
HPL (a computationally intensive problem) for ascending and descending orders of
problem sizes during execution
sources of this overhead may be found in Section 2. In summary, for a wide range
of problem sizes we observed nearly an order of magnitude difference between
the observed behavior of CPU and wall clock time.
Another common problem is a timer inversion whereby the system reports
that the process’ CPU time exceeds the wall clock time: TCPU > Twall . On bare
metal, the timer inversion occurs due to a large difference in relative accuracy
of both timers and is most likely to occur when measuring short periods of time
that usually result in large sampling error [3]. This is not the case inside all of the
virtual machines we tested. The observed likelihood of timer inversion for virtual
machines is many-fold greater than the bare metal behavior. In addition, the
inversions occur even for some of the largest problem sizes we used: a testament
to a much diminished accuracy of the wall clock timer that we attribute to the
virtualization overhead.
that this may lead to inconsistent results due to, in our understanding, a change
of state within software underlying the experiment. We understand that this
may be related to the fact that VMs maintain internal data structures that
evolve over time and change the virtualization overheads. To illustrate this phe-
nomenon, we ran HPL, part of the HPC Challenge suite, twice with the same
set of input problem sizes. In the first instance, we made the runs in ascending
order: starting with the smallest problem size and ending with the largest. Then
after a reboot, we used the descending order: the largest problem size came first.
On bare metal, the resulting performance shows no noticeable impact. This is
not the case inside the VMs as shown in Figure 2. We plot the percentage dif-
ference of times measured for the same input problem size for the ascending and
descending orders of execution:
Tdescending
− 1 × 100% . (2)
Tascending
The most dramatic difference was observed for the smallest problem size of
1000. In fact, it was well over 50% for both VirtualBox and VMware Player. We
attribute this to the short running time of this experiment and, consequently,
a large influence of the aforementioned anomaly. Even if we were to dismiss
this particular data point, the effects of the accuracy drift are visible across the
remaining problem sizes but the pattern of influence is appreciably different.
For VMware Player, the effect gradually abates for large problem sizes after
attaining a local maximum of about 15% at 4000. On the contrary, KVM shows
comparatively little change for small problem sizes and then drastically increases
half way through to stay over 20% for nearly the rest of the large problem sizes.
And finally, the behavior of VirtualBox initially resembles that of VMware Player
and later the accuracy drift diminishes to fluctuate under 10%.
From a performance measurement standpoint this resembles the problem
faced when benchmarking file systems. The factors influencing the results include
the state of the memory file cache and the level of file and directory fragmenta-
tion [8]. In virtual environments, we also observe this persistence of state that
in the end influences the performance of VMs and the results observed inside its
guest operating systems. Clean boot results proved to be the most consistent in
our experiments. However, we recognize that, for most users, rebooting the VM
after each experiment might not be the feasible deployment requirement.
6 Results
In previous sections we have outlined the potential perils of accurate performance
evaluation of virtual environments. With these in mind, we attempt to show in
this section the performance results we obtained by running the HPC Challenge
suite across the tested VMs. We consider the ultimate goal for a VM to match
the performance of the bare metal run. In our performance plots then we use
a relative metric – the percentage fraction of bare metal performance that is
achieved inside a given VM:
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 443
Fraction of bare metal performance
80 80
60 60
40 40
20 20
0 0
0
20
40
60
80
10
12
14
16
18
20
40
60
80
10
12
14
16
18
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
0
0
Problem size Problem size
Fraction of bare metal performance
120
KVM
100
80
60
40
20
0
0
20
40
60
80
10
12
14
16
18
00
00
00
00
00
00
00
00
00
0
Problem size
Fig. 3. Percentage of bare metal performance achieved inside VMware Player, Virtu-
alBox, and KVM for HPC Challenge’s HPL test. Each data bar shows the standard
deviation bar to indicate the variability of the measurement.
performanceVM
× 100% . (3)
performancebare metal
Due to space constraints we cannot present a detailed view of all of our results.
Instead, we focus on two tests from the HPC Challenge suite: HPL and MPI-
RandomAccess. By selecting these two tests we intend to contrast the behavior
of two drastically different workloads. The former represents codes that spend
most of their running time inside highly optimized library kernels that nearly op-
timally utilize the cache hierarchy and exhibit very low OS-level activity which
could include servicing TLB misses and network card interrupts required for
inter-process communication. Such workloads are expected to suffer little from
the introduction of a virtualization layer and our results confirm this as shown
in Figure 3. In fact, we observed that virtualization adds very little overhead for
such codes and the variability of the results caused by the common overheads is
relatively small across a wide range of input problem sizes. On the contrary, MPI-
RandomAccess represents workloads that exhibit high demand on the memory
subsystem including TLBs and require handling of very large counts of short
messages exchanged between processors. Each of these characteristics stresses
the bare metal setup and is expected to do so inside a virtualized environment.
Our results from Figure 4 fully confirm this prediction. The virtualization over-
head is very high and could reach 70% performance loss. Furthermore, a large
444 P. Luszczek et al.
Fraction of bare metal performance
50
KVM
48
46
44
42
40
38
36
34
32
30
18 19 20 21 22 23 24 25 26 27 28 29
log( Problem size )
Fig. 4. Percentage of bare metal performance achieved inside VMware Player, Virtu-
alBox, and KVM for HPC Challenge’s MPIRandomAccess test. Each data bar shows
the standard deviation bar to indicate the variability of the measurement.
testing Xen and VMware ESXi to see how our observations carry over to these
technologies. They are much closer to the hardware and we believe that it will
give them an advantage over the virtualization platforms presented in this paper.
References
1. Danciu, V.A., gentschen Felde, N., Kranzlmüller, D., Lindinger, T.: High-
performance aspects in virtualized infrastructures. In: 4th International DMTF
Academic Alliance Workshop on Systems and Virtualization Management,
pp. 25–32 (October 2010)
2. Han, J., Ahn, J., Kim, C., Kwon, Y., Choi, Y.-r., Huh, J.: The Effect of Multi-
core on HPC Applications in Virtualized Systems. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A.,
Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586,
pp. 615–623. Springer, Heidelberg (2011)
3. Hines, S., Wyatt, B., Chang, J.M.: Increasing timing resolution for processes and
threads in Linux (2000) (unpublished)
4. Huang, W., Liu, J., Abali, B., Panda, D.: A case for high performance computing
with virtual machines. In: Proceedings of the 20th Annual International Conference
on Supercomputing (2006)
5. Li, A., Yang, X., Kandula, S., Zhang, M.: CloudCmp: comparing public cloud
providers. In: 10th Annual Conference on Internet Measurement (2010)
6. Luszczek, P., Bailey, D., Dongarra, J., Kepner, J., Lucas, R., Rabenseifner, R.,
Takahashi, D.: The HPC challenge HPCC benchmark suite. In: SuperComputing
2006 Conference Tutorial (2006)
7. Mei, Y., Liu, L., Pu, X., Sivathanu, S.: Performance measurements and analy-
sis of network I/O applications in virtualized cloud. In: IEEE 3rd International
Conference on Cloud Computing, pp. 59–66 (August 2010)
8. Smith, K.A., Selzter, M.: File layout and file system performance. Computer Sci-
ence Technical Report TR-35-94, Harvard University (1994)
9. Tao, J., Fürlinger, K., Marten, H.: Performance Evaluation of OpenMP Appli-
cations on Virtualized Multicore Machines. In: Chapman, B.M., Gropp, W.D.,
Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 138–150.
Springer, Heidelberg (2011)
10. Tsugawa, M., Fortes, J.A.B.: Characterizing user-level network virtualization: per-
formance, overheads and limits. International Journal of Network Management
(2009), doi:10.1002/nem.733
11. Timekeeping in VMware Virtual Machines: VMware ESX 4.0/ESXi 4.0, VMware
workstation 7.0 information guide
12. Walters, J., Chaudhary, V., Cha, M., Guercio, S.J., Gallo, S.: A comparison of vir-
tualization technologies for HPC. In: 22nd International Conference on Advanced
Information Networking and Applications, pp. 861–868 (March 2008)
13. Younge, A.J., Henschel, R., Brown, J.T., von Laszewski, G., Qiu, J., Fox, G.C.:
Analysis of virtualization technologies for High Performance Computing environ-
ments. In: Proceedings of The Fourth IEEE International Conference on Cloud
Computing (CLOUD 2011), Washington Marriott, Washington DC, USA, July
4-9 (2011); technical Report (February 15, 2011), updated (April 2011)
14. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Paravirtualization for HPC Sys-
tems. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA
Workshops 2006. LNCS, vol. 4331, pp. 474–486. Springer, Heidelberg (2006)
DISCOVERY, Beyond the Clouds
DIStributed and COoperative Framework to Manage
Virtual EnviRonments autonomicallY:
A Prospective Study
1 Introduction
Since the first proposals almost ten years ago [15,20], the use of virtual technolo-
gies has radically changed the perception of distributed infrastructures. Through
an encapsulation of software layers into a new abstraction – the virtual machine
(VM) –, users can run their own runtime environment without considering, in
most cases, software and hardware restrictions which were formerly imposed by
computing centers. Relying on specific APIs, users can create, configure and up-
load their VMs to cloud computing providers, which in turn are in charge of
deploying and running the requested virtual environment (VE) on their physical
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 446–456, 2012.
c Springer-Verlag Berlin Heidelberg 2012
DISCOVERY, Beyond the Clouds 447
2 Related Work
– And finally the application of each reconfiguration that may occur concur-
rently throughout the infrastructure.
According to the lack of solutions that try to address these concerns and con-
sidering that the LRT component is central for the DISCOVERY architecture,
we chose to start our investigation from it.
4 DISCOVERY in a Nutshell
of several VETs). Once the VET has been assigned to another node and once
all VMHs have been relocated (or suspended), the peer can properly leave.
Regarding reliability, two cases must be considered: the crash of VMs and the
crash of nodes. In the first case, the reliability relies on (i) the snapshots of the
VE, which is periodically performed by the VEH and (ii) the heartbeats that are
periodically sent by each VMHs to the VEH. If the VEH does not receive one
of the VMHs’ heartbeats, it has to suspend all remaining VMs and resume the
whole VE from its latest consistent state. This process is similar to the starting
one: the missing VMHs are launched locally and the LRT is in charge of assigning
them throughout the DISCOVERY network. When the LRT completes this op-
eration, the VMHs receive a notification and in turn contact the VET to resume
all VMs from their latest consistent state. Before resuming each VM, the VET
checks whether it has to deliver the snapshot images to the nodes. Regarding
the crash of a node, the recovery process relies on DHT mechanisms used by the
DNT. When a VET starts a new VEH, the description of the associated VE is
stored in the DHT. Similarly, this description is updated/completed each time
the VEH snapshots the VE (mainly to update the locations of the snapshots).
By such a way, when a failure of a node is detected (either by leveraging DHT
principles or simply by implementing a heartbeat approach between nodes), the
“neighbor” node is able to restart the VET and the associated VEHs from the
information that have been previously replicated through the DHT. Once all
VEHs have recovered, the VMHs heartbeat mechanism is used either to reat-
tach the VMHs to the VEH or to resume the VE from its latest consistent state
if it is needed.
5 Conclusion
It is undeniable: virtualization technology has become a key element of dis-
tributed architectures. Although there have been considerable improvements, a
lot of works continue to focus on virtualization internals and only few actions
address design and implementation concerns of the frameworks that leverage
virtualization technologies to manage distributed architectures. Considering the
growing size of infrastructures in terms of nodes and virtual machines, new
proposals relying on more autonomic and decentralized approaches should be
discussed to overcome the limitations of traditional server-centric solutions.
In this paper, we introduced the DISCOVERY initiative that aims at leverag-
ing recent contributions on virtualization technologies and previous distributed
operating systems proposals to design and implement a new kind of virtualiza-
tion frameworks insuring scalability, reliability and reactivity of the whole sys-
tem. Our proposal relies on micro-kernel approaches and peer-to-peer models.
Starting from the point that each node may be seen as a bare-hardware provid-
ing basic functionalities to manipulate VMs and monitor resources usages, we
design an agent composed of several services that cooperate in managing virtual
environments throughout the DISCOVERY network.
Although the design may look simple at the first sight, the implementation
of each block will require specific expertise. As an example, strong assumptions
454 A. Lèbre et al.
on the internals of the Virtual Environments Tracker have been done (consider-
ing that the three layers: Image, Network and Reliability were available). Each
of them requires deeper investigations with the contributions of the scientific
community. Furthermore, the DISCOVERY framework should be extended with
other concerns such as security, user quota . . . to meet our objective to design
and implement a complete distributed OS of VMs. Again, this cannot be done
without querying the scientific community.
References
1. Nimbus is cloud computing for science, http://www.nimbusproject.org/
2. Openqrm, http://www.openqrm.com/
3. Openstack: The open source, open standards cloud. open source software to build
private and public clouds, http://www.openstack.org/
4. XenServer Administrator’s Guide 5.5.0. Tech. rep., Citrix Systems (February 2010)
5. Anedda, P., Leo, S., Gaggero, M., Zanetti, G.: Scalable Repositories for Virtual
Clusters. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R.,
Sousa, L., Streit, A. (eds.) Euro-Par 2009 Workshop. LNCS, vol. 6043, pp. 414–423.
Springer, Heidelberg (2010)
6. Anedda, P., Leo, S., Manca, S., Gaggero, M., Zanetti, G.: Suspending, migrating
and resuming hpc virtual clusters. Future Generation Computer Systems 26(8),
1063–1072 (2010)
7. Bolte, M., Sievers, M., Birkenheuer, G., Niehörster, O., Brinkmann, A.: Non-
intrusive virtualization management using libvirt. In: Proceedings of the Con-
ference on Design, Automation and Test in Europe, DATE 2010, pp. 574–579.
European Design and Automation Association, Leuven (2010)
8. Borthakur, D.: The Hadoop Distributed File System: Architecture and Design.
The Apache Software Foundation (2007)
9. Bose, S.K., Brock, S., Skeoch, R., Rao, S.: CloudSpider: Combining Replication
with Scheduling for Optimizing Live Migration of Virtual Machines Across Wide
Area Networks. In: 11th IEEE/ACM International Symposium on Cluster, Cloud,
and Grid Computing (CCGrid), Newport Beach, California, U.S.A (May 2011)
10. Bradford, R., Kotsovinos, E., Feldmann, A., Schiöberg, H.: Live wide-area migra-
tion of virtual machines including local persistent state. In: Proceedings of the
3rd International Conference on Virtual Execution Environments, VEE 2007, pp.
169–179. ACM, San Diegoe (2007)
11. Chanchio, K., Leangsuksun, C., Ong, H., Ratanasamoot, V., Shafi, A.: An efficient
virtual machine checkpointing mechanism for hypervisor-based hpc systems. In:
High Availability and Performance Computing Workshop, Denver, USA (2008)
12. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I.,
Warfield, A.: Live migration of virtual machines. In: Proceedings of the 2nd con-
ference on Symposium on Networked Systems Design & Implementation, NSDI
2005, vol. 2, pp. 273–286. USENIX Association, Berkeley (2005)
13. Claudel, B., Huard, G., Richard, O.: Taktuk, adaptive deployment of remote exe-
cutions. In: Proceedings of the 18th ACM International Symposium on High Per-
formance Distributed Computing, HPDC 2009. ACM, Munich (2009)
14. DMTF: Open Virtualization Format Specification (January 2010),
http://www.dmtf.org/standards/ovf
DISCOVERY, Beyond the Clouds 455
15. Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual
machines. In: Proceedings of the 23rd International Conference on Distributed
Computing Systems (ICDCS). IEEE, Washington, DC (2003)
16. Ghosh, R., Longo, F., Naik, V.K., Trivedi, K.S.: Quantifying resiliency of iaas
cloud. In: SRDS, pp. 343–347. IEEE (2010)
17. Hermenier, F., Lèbre, A., Menaud, J.M.: Cluster-wide context switch of virtual-
ized jobs. In: Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing, HPDC 2010. ACM, New York (2010)
18. Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using
adaptive pre-paging and dynamic self-ballooning. In: Proceedings of the 2009 ACM
SIGPLAN/SIGOPS International Conference on Virtual Execution Environments,
VEE 2009, pp. 51–60. ACM, Washington, DC (2009)
19. Jin, H., Deng, L., Wu, S., Shi, X., Pan, X.: Live virtual machine migration with
adaptive, memory compression. In: IEEE International Conference on Cluster
Computing and Workshops, CLUSTER 2009, pp. 1–10 (September 2009)
20. Keahey, K.: From sandbox to playground: Dynamic virtual environments in the
grid. In: Proceedings of the 5th International Workshop on Grid Computing (2004)
21. Keahey, K., Tsugawa, M., Matsunaga, A., Fortes, J.: Sky computing. IEEE Internet
Computing 13, 43–51 (2009)
22. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1),
41–50 (2003)
23. Lagar-Cavilla, H.A., Whitney, J., Bryant, R., Patchin, P., Brudno, M., de Lara,
E., Rumble, S.M., Satyanarayanan, M., Scannell, A.: Snowflock: Virtual ma-
chine cloning as a first class cloud primitive. Transactions on Computer Systems
(TOCS) 19(1) (February 2011)
24. Lowe, S.: Introducing VMware vSphere 4, 1st edn. Wiley Publishing Inc., Indi-
anapolis (2009)
25. McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford,
J., Shenker, S., Turner, J.: OpenFlow: Enabling Innovation in Campus Networks.
SIGCOMM Comput. Commun. Rev. 38(2), 69–74 (2008)
26. McNett, M., Gupta, D., Vahdat, A., Voelker, G.M.: Usher: An Extensible Frame-
work for Managing Clusters of Virtual Machines. In: Proceedings of the 21st Large
Installation System Administration Conference (LISA) (November 2007)
27. Nicolae, B., Bresnahan, J., Keahey, K., Antoniu, G.: Going back and forth: Effi-
cient multi-deployment and multi-snapshotting on clouds. In: Proceedings of the
20th ACM International Symposium on High Performance Distributed Computing,
HPDC 2011. ACM, New York (2011)
28. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,
Zagorodnov, D.: The eucalyptus open-source cloud-computing system. In: Pro-
ceedings of the 9th IEEE/ACM International Symposium on Cluster Computing
and the Grid, CCGRID, Washington, DC, USA (2009)
29. Quesnel, F., Lebre, A.: Operating Systems and Virtualization Frameworks: From
Local to Distributed Similarities. In: Cotronis, Y., Danelutto, M., Papadopoulos,
G.A. (eds.) PDP 2011: Proceedings of the 19th Euromicro International Confer-
ence on Parallel, Distributed and Network-Based Computing, pp. 495–502. IEEE
Computer Society, Los Alamitos (2011)
30. Rowstron, A., Druschel, P.: Pastry: Scalable, Decentralized Object Location, and
Routing for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware
2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)
456 A. Lèbre et al.
31. Ruth, P., Rhee, J., Xu, D., Kennell, R., Goasguen, S.: Autonomic live adaptation
of virtual computational environments in a multi-domain infrastructure. In: IEEE
International Conference on Autonomic Computing, ICAC 2006 (June 2006)
32. Sotomayor, B., Montero, R., Llorente, I., Foster, I., et al.: Virtual infrastructure
management in private and hybrid clouds. IEEE Internet Computing 13(5) (2009)
33. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek,
F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Transactions on Networking 11(1), 17–32 (2003)
34. Tsugawa, M., Fortes, J.: A virtual network (vine) architecture for grid computing.
In: International Parallel and Distributed Processing Symposium, p. 123 (2006)
35. Zhao, B.Y., Huang, L., Stribling, J., Rhea, S.C., Joseph, A.D., Kubiatowicz, J.D.:
Tapestry: a resilient global-scale overlay for service deployment. IEEE Journal on
Selected Areas in Communications 22(1), 41–53 (2004)
Cooperative Dynamic Scheduling of Virtual
Machines in Distributed Systems
1 Introduction
Scheduling jobs has been a major concern in distributed computer systems.
Traditional approaches rely on batch schedulers [2] or on distributed operating
systems (OS) [7]. Although batch schedulers are the most deployed solutions,
they may lead to a suboptimal use of resources. They usually schedule processes
statically – each process is assigned to a given node and stays on it until its
termination – according to user requests for resource reservations, that may
be overestimated. On the contrary, preemption mechanisms were developed for
distributed OSes to make them schedule processes dynamically, in line with
their effective resource requirements. However, these mechanisms were hard to
implement due to the problem of residual dependencies [1].
Using system virtual machines (VM) [14], instead of processes, allows to per-
form dynamic scheduling of jobs while avoiding the issue of residual dependen-
cies [4,12]. However, some virtual infrastructure managers (VIM) still schedule
VMs in a static way [6,10]; it conflicts with a common objective of virtual infras-
tructure providers: maximizing system utilization while ensuring the quality of
service (QoS). Other VIMs implement dynamic VM scheduling [5,8,15], which
enables a finer management of resources and resource overcommitment. However,
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 457–466, 2012.
c Springer-Verlag Berlin Heidelberg 2012
458 F. Quesnel and A. Lèbre
Service node
Worker node
Communication
between nodes
1. Monitoring
3. Applying 2. Computing
schedule schedule
they often rely on a centralized design, which prevents them to scale and to be
reactive. Scheduling is indeed an NP-hard problem, the time needed to solve it
grows exponentially with the number of nodes and VMs considered. Besides, it
takes time to apply a new schedule, because manipulating VMs is costly [4]. Dur-
ing the computation and the application of a schedule (cf. Fig. 1(a)), centralized
managers do not enforce the QoS anymore, and thus cannot react quickly to QoS
violations. Moreover, the schedule may be outdated when it is eventually applied
if the workloads have changed (cf. Fig. 1(b)). Finally, centralization can lead to
fault-tolerance issues: VMs may not be managed anymore if the master node
crashes, as it is a single point of failure (SPOF). Considering all the limitations
of centralized solutions, more decentralized ones should be investigated. Indeed,
scheduling takes less time if the work is distributed among several nodes, and
the failure of a node does not stop the scheduling anymore.
Several proposals have been made precisely to distribute dynamic VM man-
agement [3,13,17]. However, the resulting prototypes are still partially central-
ized. Firstly, at least one node has access to a global view of the system. Secondly,
several VIMs consider all nodes for scheduling, which limits scalability. Thirdly,
several VIMs still rely on service nodes, that are potential SPOFs.
In this paper, we introduce a VIM that enables to schedule and manage VMs
cooperatively and dynamically in distributed systems. We designed it to be non-
predictive and event-driven, to work with partial views of the system, and to
require no SPOF. We made these choices for the VIM to be reactive, scalable
and fault-tolerant. In our proposal, when a node cannot guarantee the QoS for
its hosted VMs or when it is under-utilized, it starts an iterative scheduling pro-
cedure (ISP) by querying its neighbor to find a better placement. If the request
cannot be satisfied by the neighbor, it is forwarded to the following one until the
ISP succeeds. This approach allows each ISP to consider a minimum number of
nodes, thus decreasing the scheduling time, without requiring a central point. In
addition, several ISPs can occur independently at the same moment throughout
the infrastructure, which significantly improves the reactivity of the system. It
Cooperative Dynamic Scheduling of VM in Distributed Systems 459
should be noted that nodes are reserved for exclusive use by a single ISP, to
prevent conflicts that can occur if several ISPs do concurrent operations on the
same nodes or VMs. In other words, scheduling is performed on partitions of the
system, that are created dynamically. Moreover, communication between nodes
is done through a fault-tolerant overlay network, which relies on distributed hash
table (DHT) mechanisms to mitigate the impact of a node crash [9]. We eval-
uated our prototype by means of simulations, to compare our approach with
the centralized one. Preliminary results were encouraging and showed that our
scheduler was reactive even if it had to manage several nodes and VMs.
The remainder of this article is structured as follows. Section 2 presents re-
lated work. Section 3 gives an overview of our proposal, while Sect. 4 details its
implementation and Sect. 5 compares it to a centralized proposal [5]. Finally,
Sect. 6 discusses perspectives and Sect. 7 concludes this article.
2 Related Work
This section presents some work that aim at distributing resource management,
especially those related to the dynamic scheduling of VMs. Contrary to previous
solutions that performed scheduling periodically, recent proposals tend to rely
on an event-based approach: scheduling is started only if an event occurs in the
system, for example if a node is overloaded.
In the DAVAM project [16], VMs are dynamically distributed among man-
agers. When one VM has not enough resources, its manager tries to relocate it
by considering all resources of the system (the manager builds this global view
by communicating with its neighbors).
Another proposal [13] relies on peer-to-peer networks. It is very similar to the
centralized approaches, except that there is no service node, so that it is more
fault-tolerant. When an event occurs on a node, this node collects monitoring
information on all nodes, finds which nodes can help it to fix the problem, and
performs appropriate migrations.
A third proposition [17] relies on the use of a service node that collects mon-
itoring information on all worker nodes. When an event occurs on a worker
node, this node retrieves information from the service node, computes a new
schedule and performs appropriate migrations. This approach does not consider
fault-tolerance issues.
Snooze [3] has a hierarchical design: nodes are dynamically distributed among
managers, a super manager oversees managers and has a global view of the
system. When an event occurs, it is processed by a manager that considers all
nodes it is in charge of. Snooze design is close to the Hasthi [11] one; the main
difference is that Snooze targets virtualized systems and single system images,
while Hasthi is presented to be system agnostic.
3 Proposal Overview
In this section, we describe the theoretical foundations of our proposal. After
giving its main characteristics, we explain shortly how it works.
460 F. Quesnel and A. Lèbre
When a node Ni retrieves its local monitoring information and detects a problem
(e.g. it is overloaded), it starts a new iterative scheduling procedure by generating
an event, reserving itself for the duration of this ISP, and sending the event to
its neighbor, node Ni+1 (cf. Fig. 2).
Node Ni+1 reserves itself, updates node reservations and retrieves monitoring
information on all nodes reserved for this ISP, i.e. on nodes Ni and Ni+1 . It then
computes a new schedule. If it fails, it forwards the event to its neighbor, node
Ni+2 .
Node Ni+2 performs the same operations as node Ni+1 . If the computation of
the new schedule succeeds, node Ni+2 applies it (e.g. by performing appropriate
VM migrations) and finally cancels the reservations, so that nodes Ni , Ni+1 and
Ni+2 are free to take part in another ISP.
Considering that a given node can take part only in one of these iterative
scheduling procedures at a time, several ISPs can occur simultaneously and
independently throughout the infrastructure, thus improving reactivity.
Note that if a node receives an event while it is reserved, it just forwards it
to its neighbor.
Cooperative Dynamic Scheduling of VM in Distributed Systems 461
4 Implementation
4.1 Current State
We implemented our proposal in Java. The prototype can currently process
‘overloaded node’ and ‘underloaded node’ events; these events are defined by
means of CPU and memory thresholds by the system administrator. Moreover,
the overlay network is a simple ring (cf. Fig. 3) without any fault-tolerance
mechanism, i.e. it cannot recover from a node crash. Furthermore, the prototype
manipulates virtual VMs, i.e. Java objects.
5 Experiments
We compared our approach with the Entropy [5] one by means of simulation.
Basically, the simulator injected a random CPU workload into each virtual VM
and waited until the VIM solves all ‘overloaded node’ issues. Comparison cri-
teria included the average time to solve an event, the time elapsed since the
load injection until all ‘overloaded node’ issues are solved, and the cost of the
schedule to apply. This cost is related to the kind of actions to perform on VMs
(e.g. migrations) and to the amount of memory allocated to the VMs that are
manipulated [5].
The experiments were done on a HP Proliant DL165 G7 with 2 CPUs (AMD
Opteron 6164 HE, 12 cores, 1.7 GHz) and 48 GB of RAM. The software stack
was composed of Debian 6/Squeeze, Sun Java VM 6 and Entropy 1.1.1. The
simulated nodes had 2 CPUs (2 GHz) and 4 GB of RAM. The simulated VMs
had 1 virtual CPU (2 GHz) and 1 GB of RAM. The virtual CPU load could take
only one of the following values (in percentage): 0, 20, 40, 60, 80, 100. Entropy
has timeouts to prevent it to spend too much time computing a new schedule;
these timeouts were set to twice the number of nodes considered (in seconds).
Our VIM considers that a node is overloaded if the VMs hosted try to consume
Cooperative Dynamic Scheduling of VM in Distributed Systems 463
more than 100% of CPU or RAM; it is underloaded if less than 20% of CPU
and less than 50% of RAM are used.
As we can see on Table 1, our VIM is more reactive, i.e. it quickly solved
individual events, especially the ‘overloaded node’ ones. This can be explained
by the fact that our VIM generally considers a few number of nodes, compared
to Entropy. This leads to a smaller cost for applying schedules.
In details, the first row shows the iteration length that corresponds to the
required time to solve all events occurring during one iteration. The second row
gives the time to solve one event. That is the time between the event appearance
and its resolution. The third row focuses on overloaded events. These events refer
to QoS violations and must be solved as quickly as possible. For these two rows,
we do not mention the values of the centralized approach since it relies on a
periodic scheme: Entropy monitors the configuration at the beginning of the
iteration, analyzes the configuration and applies the schedule at the end. The
fourth row shows the size of each partition: i.e. the number of nodes considered
for a scheduling. As we can see on the fifth row, the smaller the partition is, the
cheaper is the reconfiguration cost. However, it is worth nothing that the values
for the Entropy approach, as previously, consider the total cost for the whole
iteration whereas the cost of the reconfiguration related to one event is considered
for the DVMS approach. As a consequence, the sum of each reconfiguration in
464 F. Quesnel and A. Lèbre
the DVMS approach can be higher than the cost corresponding of the Entropy
one. However, since we are trying to solve each event as soon as possible, we are
not interested by the global cost but by the cost for one event. Finally, the last
row presents the consolidation rate, which is the percentage of nodes hosting
at least one VM. We can see that, despite the fact that our approach is more
reactive, it does not impact negatively the consolidation rate.
6 Future Work
Several ways should be explored to improve the prototype, with regard to event
management, fault-tolerance and network topology.
Network Topology. The current prototype does not take the network topology
into account. However, the knowledge of network bandwidth between each pair
of nodes could lead to faster migrations in a heterogeneous system.
7 Conclusion
In this article, we proposed a new approach to schedule VMs dynamically and
cooperatively in distributed systems, keeping in mind the following objective:
maximizing system utilization while ensuring the quality of service.
Cooperative Dynamic Scheduling of VM in Distributed Systems 465
References
1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I.,
Warfield, A.: Live migration of virtual machines. In: NSDI 2005: Proceedings of the
2nd Conference on Symposium on Networked Systems Design and Implementation,
NSDI 2005, pp. 273–286. USENIX Association, Berkeley (2005)
2. Etsion, Y., Tsafrir, D.: A Short Survey of Commercial Cluster Batch Schedulers.
Tech. rep., The Hebrew University of Jerusalem, Jerusalem, Israel (May 2005)
3. Feller, E., Rilling, L., Morin, C., Lottiaux, R., Leprince, D.: Snooze: A Scalable,
Fault-Tolerant and Distributed Consolidation Manager for Large-Scale Clusters.
Tech. rep., INRIA Rennes, Rennes, France (September 2010)
4. Hermenier, F., Lebre, A., Menaud, J.M.: Cluster-Wide Context Switch of Virtu-
alized Jobs. In: VTDC 2010: Proceedings of the 4th International Workshop on
Virtualization Technologies in Distributed Computing. ACM, New York (2010)
5. Hermenier, F., Lorca, X., Menaud, J.M., Muller, G., Lawall, J.: Entropy: a consoli-
dation manager for clusters. In: Hosking, A.L., Bacon, D.F., Krieger, O. (eds.) VEE
2009: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments, pp. 41–50. ACM, New York (2009)
6. Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good,
J.: On the Use of Cloud Computing for Scientific Workflows. In: ESCIENCE
2008: Proceedings of the 2008 Fourth IEEE International Conference on eScience,
pp. 640–645. IEEE Computer Society, Washington, DC (2008)
7. Lottiaux, R., Gallard, P., Vallee, G., Morin, C., Boissinot, B.: OpenMosix, OpenSSI
and Kerrighed: a comparative study. In: CCGRID 2005: Proceedings of the Fifth
IEEE International Symposium on Cluster Computing and the Grid, vol. 2,
pp. 1016–1023. IEEE Computer Society, Washington, DC (2005)
8. Lowe, S.: Introducing VMware vSphere 4, 1st edn. Wiley Publishing Inc., Indi-
anapolis (2009)
9. Milojicic, D.S., Kalogeraki, V., Lukose, R., Nagaraja, K., Pruyne, J., Richard, B.,
Rollins, S., Xu, Z.: Peer-to-Peer Computing. Tech. rep., HP Laboratories, Palo
Alto, CA, USA (July 2003)
466 F. Quesnel and A. Lèbre
10. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,
Zagorodnov, D.: The Eucalyptus Open-Source Cloud-Computing System. In: Cap-
pello, F., Wang, C.L., Buyya, R. (eds.) CCGRID 2009: Proceedings of the 2009
9th IEEE/ACM International Symposium on Cluster Computing and the Grid,
pp. 124–131. IEEE Computer Society, Washington, DC (2009)
11. Perera, S., Gannon, D.: Enforcing User-Defined Management Logic in Large Scale
Systems. In: Services 2009: Proceedings of the 2009 Congress on Services - I,
pp. 243–250. IEEE Computer Society, Washington, DC (2009)
12. Quesnel, F., Lebre, A.: Operating Systems and Virtualization Frameworks: From
Local to Distributed Similarities. In: Cotronis, Y., Danelutto, M., Papadopoulos,
G.A. (eds.) PDP 2011: Proceedings of the 19th Euromicro International Confer-
ence on Parallel, Distributed and Network-Based Computing, pp. 495–502. IEEE
Computer Society, Los Alamitos (2011)
13. Rouzaud-Cornabas, J.: A Distributed and Collaborative Dynamic Load Balancer
for Virtual Machine. In: Guarracino, M.R., Vivien, F., Träff, J.L., Cannatoro, M.,
Danelutto, M., Hast, A., Perla, F., Knüpfer, A., Di Martino, B., Alexander, M.
(eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586, pp. 641–648. Springer, Heidel-
berg (2011)
14. Smith, J.E., Nair, R.: The Architecture of Virtual Machines. Computer 38(5),
32–38 (2005)
15. Sotomayor, B., Montero, R.S., Llorente, I.M., Foster, I.: Virtual Infrastructure
Management in Private and Hybrid Clouds. IEEE Internet Computing 13(5),
14–22 (2009)
16. Xu, J., Zhao, M., Fortes, J.A.B.: Cooperative Autonomic Management in Dynamic
Distributed Systems. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873,
pp. 756–770. Springer, Heidelberg (2009)
17. Yazir, Y.O., Matthews, C., Farahbod, R., Neville, S., Guitouni, A., Ganti, S.,
Coady, Y.: Dynamic Resource Allocation in Computing Clouds Using Distributed
Multiple Criteria Decision Analysis. In: Cloud 2010: IEEE 3rd International Con-
ference on Cloud Computing, pp. 91–98. IEEE Computer Society, Los Alamitos
(2010)
Large-Scale DNA Sequence Analysis
in the Cloud: A Stream-Based Approach
1 Introduction
Today, huge amounts of data is being generated at ever increasing rates by
a wide range of sources from networks of sensing devices to social media and
special scientific devices such as DNA sequencing machines and astronomical
telescopes. It has become both an exciting opportunity to use these data sets
in intelligent applications such as detecting and preventing diseases or spotting
business trends, as well as a major challenge to manage their capture, transfer,
storage, and analysis.
Recent advances in cloud computing technologies have made it possible to
analyze very large data sets in scalable and cost-effective ways. Various platforms
and frameworks have been proposed to be able to use the cloud infrastructures
M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 467–476, 2012.
c Springer-Verlag Berlin Heidelberg 2012
468 R. Kienzler et al.
for solving this problem such as the MapReduce framework [2], [4]. Most of
these solutions are primarily designed for batch processing of data stored in a
distributed file system. While such a design supports scalable and fault-tolerant
processing very well, it may pose some limitations when transferring data. More
specifically, large amounts of data has to be uploaded into the cloud before the
processing starts, which not only causes significant data transfer latencies, but
also adds to the cloud storage costs [19], [26].
In this short paper, we mainly investigate the performance problems that
arise from having to transfer large amounts of data in and out of the cloud
based on a real data-intensive use case from bioinformatics, for which we pro-
pose a stream-based approach as a promising solution. Our key idea is that data
transfer latencies can be hidden by providing an incremental data processing
architecture, similar in spirit to pipelined query evaluation models in traditional
database systems [15]. It is important though that this is done in a way to also
support linear scalability through parallel processing, which is an indispensable
requirement for handling data and compute-intensive workloads in the cloud.
More specifically, we propose to use a stream-based data management archi-
tecture, which not only provides an incremental and parallel data processing
model, but also facilitates in-memory processing, since data is processed on the
fly and intermediate data need not be materialized on disk (unless it is explicitly
needed by the application), which can further reduce end-to-end response time
and cloud storage costs.
The rest of this paper is outlined as follows: In Section 2, we describe our use
case for large-scale DNA sequence analysis which has been the main motivation
for the work presented in this paper. We present our stream-based solution
approach in Section 3, including an initial implementation and evaluation of our
use case based on the IBM InfoSphere Streams computing platform [5] deployed
on Amazon EC2 [1]. Finally, we conclude with a discussion of future work in
Section 4.
Table 1. Compared to the Sanger method, NGS methods have significantly higher
throughput at a significant fraction of their costs
“Genome biologists will have to start acting like the high energy physi-
cists, who filter the huge datasets coming out of their collectors for a tiny
number of informative events and then discard the rest.”
NGS is used to sequence DNA in an automated and high-throughput process.
DNA molecules are fragmented into pieces of 100 to 800 bps, and digital versions
of DNA fragments are generated. These fragments, called reads, originate from
random positions of DNA molecules. In re-sequencing experiments the reads are
mapped back to a reference genome (e.g., human) [19] or - without a reference
genome - they can be assembled de novo [23]. However, de novo assembly is more
complex due to the short read length as well as to potential repetitive regions
in the genome. In re-sequencing experiments, polymorphisms between analyzed
DNA and the reference genome can be observed. A polymorphism of a single bp
is called Single Nucleotide Polymorphism (SNP) and is recognized as the main
cause of human genetic variability [9]. Figure 1 shows an example, with a ref-
erence genome at the top row and two SNPs identified on the analyzed DNA
sequences depicted below. As stated by Fernald et al, once NGS technology be-
comes available on a clinical level, it will become part of the standard healthcare
process to check patients’ SNPs before medical treatment (a.k.a., “personalized
medicine”) [12]:
“We are on the verge of the genomic era: doctors and patients will have
access to genetic data to customize medical treatment.”
Aligning NGS reads to genomes is computationally intensive. Li et al give an
overview of algorithms and tools currently in use [19]. To align reads containing
SNPs, probabilistic algorithms have to be used, since finding an exact match
between reads and a given reference is not sufficient because of polymorphisms
and sequencing errors. Most of these algorithms are based on a basic pattern
called seed and extend [8], where small matching regions between reads and the
reference genome are identified first (seeding), and then further extended. Ad-
ditionally, to be able to identify seeds that contain SNPs, a special algorithm
that allows for a certain difference during seeding needs to be used [16]. Un-
fortunately, this adaptation further increases the computational complexity. For
example, on a small cluster used by FGCZ [3] (25 nodes with a total of 232 CPU
compute cores and 800 GB main memory), a single genome alignment process
can take up to 10 hours.
Read alignment algorithms have been shown to have a great potential for
linear scalability [24]. However, sequencing throughput increases faster than
470 R. Kienzler et al.
Fig. 1. SNP identification: The top row shows a subsequence of the reference genome.
The following rows are aligned NGS reads. Two SNPs can be identified. T is replaced
by C (7th column) and C is replaced by T (25th column). In one read (line 7), a
sequencing error can be observed where A has been replaced by G (last column).
Source: http://bioinf.scri.ac.uk/tablet/.
computational power and storage size [25]. As a result, although NGS machines
are becoming cheaper, using dedicated compute clusters for read alignment is
still a significant investment. Fortunately, even small labs can do the alignment
by using cloud resources [11]. Li et al state that cloud computing might be a
possible solution for small labs, but also raises concerns about data transfer
bottlenecks and storage costs [19]. Thus, existing cloud-based solutions such as
CloudBurst [24] and Crossbow [17] as well as the cloud-enabled version of Galaxy
[14] have a common disadvantage: before processing starts, large amounts of data
has to be uploaded into the cloud, potentially causing significant data transfer
latency and storage costs [26].
In this work, our main focus is to develop solutions for the performance prob-
lems that stem from having to transfer large amounts of data in and out of the
cloud for data-intensive use cases such as the one described above. If we roughly
capture the overall processing time with a function f (n, s) ∝ cs + ns , where n is
the number of CPU cores, s is the problem size1 , and c is a constant for data
1
Problem size for NGS read alignment depends on a number of factors including the
number of reads to be aligned, the size of the reference genome, and the “fuzziness”
of the alignment algorithm.
Incremental DNA Sequence Analysis in the Cloud 471
transfer rate between a client and the cloud, our main goal is to bring down the
first component (cs) in this formula. Furthermore, we would like to do it in a way
that supports linear scalability. In the next section, we will present the solution
that we propose, together with an initial evaluation study which indicates that
our approach is a promising one.
3 A Stream-Based Approach
Fig. 3. With our stream-based approach, the client streams the reads into the cloud,
where they instantly get mapped to a reference genome and results are immediately
streamed back to the client
Fig. 4. Operator and dataflow graph for our stream-based incremental processing im-
plementation of SHRiMP
Figure 4 shows a detailed data flow graph of our implementation. A client ap-
plication implemented in Java compresses and streams raw NGS read data into
the cloud, where a master Streams node first receives it. At the master node, the
read stream gets uncompressed by an Uncompress operator and is then fed into a
TCPSource operator. In order to be able to run parallel instances of SHRiMP for
increased scalability, TCPSource operator feeds the stream into a ThreadedSplit
operator. ThreadedSplit is aware of the data input rates that can be handled by
its downstream operators, and therefore, it can provide an optimal load distri-
bution. The number of substreams that ThreadedSplit generates determines the
the number of processing (i.e., slave) nodes in the compute cluster, each of which
will run a SHRiMP instance. SHRiMP instances are created by instantiating a
custom Streams operator using standard Unix pipes. The resulting aligned read
data (in the form of SAM output [6]) on different SHRiMP nodes are merged
by the master node using a Merge operator. Then a TCPSink operator passes
the output stream to a Compress operator, which ensures that results are sent
back to the client application in compact form, where they should be uncom-
pressed again before being presented to the user. The whole chain, including the
compression stages, is fully incremental.
Fig. 5. At a cluster with size of 4 nodes and above, the stream-based solution incurs
less total processing time than the standalone application. This is because data transfer
time always adds up to the curve of the standalone application.
Scalability. Figure 5 shows the result of our scalability experiment. The bottom
flat line corresponds to the data transfer time of 90 minutes for our specific input
dataset. This time is included in the SHRiMP standalone curve, where input data
has to be uploaded into the cloud in advance. On the other hand, the stream-
based approach does not transfer any data in advance, thus does not incur this
additional latency. Both approaches show linear scalability in total processing
time as the number of Amazon EC2 nodes are increased. Upto 4 nodes, the
standalone approach takes less processing time. However, we see that as the
cluster size increases beyond this value, the relative effect of the initial data
transfer latency for the standalone approach starts to show itself, reaching to
almost a 30-minute difference in processing time over the stream-based approach
for the 16-node setup. We expect this difference to become even more significant
as the input dataset size and the cluster size further increase.
Costs. As our solution allows data processing to start as soon as the data arrives
in the cloud, we can show that the constant c in the formula f (n, s) ∝ cs + ns
introduced in the previous section can be brought to nearly zero, leading to
f (n, s) ∝ ns for the overall data processing time. Since we have shown linear scale
out, we can calculate the CPU cost using p(n, s) ∝ nf (n, s) ∝ n ns ∝ s. Since
the cost ends up being dependent only on the problem size, one can minimize
the processing time f (n, s) by maximizing n without any significant effect on
the cost. Data transfer and storage costs are relatively small in comparison
to the CPU cost, therefore, we have ignored them in this initial cost analysis.
Nevertheless, it is not difficult to see that these costs will also decrease with our
stream-based approach.
Incremental DNA Sequence Analysis in the Cloud 475
Ease of Use. Our client, a command line tool, behaves exactly the same way as
a command line tool for any read alignment software package. Therefore, existing
data processing chains can be sped up by simply replacing the existing aligner
with our client without changing anything else. Even flexible and more complex
bioinformatics data processing engines (e.g., Galaxy [14] or Pegasus [10]) can be
transparently enhanced by simply replacing the original data processing stages
with our solution.
References
1. Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/
2. Apache Hadoop, http://hadoop.apache.org/
3. Functional Genomics Center Zurich, http://www.fgcz.ch/
4. Google MapReduce, http://labs.google.com/papers/mapreduce.html
5. IBM InfoSphere Streams,
http://www.ibm.com/software/data/infosphere/streams
6. The SAM Format Specification, samtools.sourceforge.net/SAM1.pdf
7. Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.,
Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.:
The Design of the Borealis Stream Processing Engine. In: Conference on Innovative
Data Systems Research (CIDR 2005), Asilomar, CA (January 2005)
8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Align-
ment Search Tool. Journal of Molecular Biology 215(3) (October 1990)
9. Collins, F.S., Guyer, M., Chakravarti, A.: Variations on a Theme: Cataloging Hu-
man DNA Sequence Variation. Science 278(5343) (November 1997)
10. Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: mapping large-scale
workflows to distributed resources. In: Workflows for e-Science, pp. 376–394 (2007)
476 R. Kienzler et al.
11. Dudley, J.T., Butte, A.J.: In Silico Research in the Era of Cloud Computing. Nature
Biotechnology 28(11) (2010)
12. Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J., Altman, R.B.: Bioin-
formatics Challenges for Personalized Medicine. Bioinformatics 27(13) (July 2011)
13. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S
Declarative Stream Processing Engine. In: ACM SIGMOD Conference, Vancouver,
BC, Canada (June 2008)
14. Goecks, J., Nekrutenko, A., Taylor, J., Team, G.: Galaxy: A Comprehensive Ap-
proach for Supporting Accessible, Reproducible, and Transparent Computational
Research in the Life Sciences. Genome Biology 11(8) (2010)
15. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing
Surveys 25(2) (June 1993)
16. Keich, U., Ming, L., Ma, B., Tromp, J.: On Spaced Seeds for Similarity Search.
Discrete Applied Mathematics 138(3) (April 2004)
17. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs
with Cloud Computing. Genome Biology 10(11) (2009)
18. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and Memory-efficient
Alignment of Short DNA Sequences to the Human Genome. Genome Biology 10(3)
(2009)
19. Li, H., Homer, N.: A Survey of Sequence Alignment Algorithms for Next-
Generation Sequencing. Briefings in Bioinformatics 11(5) (September 2010)
20. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP Detec-
tion for Massively Parallel Whole-Genome Resequencing. Genome Research 19(6)
(June 2009)
21. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.:
SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational
Biology 5(5) (May 2009)
22. Sanger, F., Coulson, A.R.: A Rapid Method for Determining Sequences in DNA by
Primed Synthesis with DNA Polymerase. Journal of Mol. Biol. 94(3) (May 1975)
23. Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-
generation sequencing. Genome Research 20(9), 1165 (2010)
24. Schatz, M.C.: CloudBurst: Highly Sensitive Read Mapping with MapReduce.
Bioinformatics 25(11) (June 2009)
25. Stein, L.D.: The Case for Cloud Computing in Genome Informatics. Genome Bi-
ology 11(5) (2010)
26. Viedma, G., Olias, A., Parsons, P.: Genomics Processing in the Cloud. International
Science Grid This Week (February 2011),
http://www.isgtw.org/feature/genomics-processing-cloud
27. Voelkerding, K.V., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing:
From Basic Research to Diagnostics. Clinical Chemistry 55(4) (February 2009)
Author Index