0% found this document useful (0 votes)
734 views502 pages

2012 Euro-Par 22 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
734 views502 pages

2012 Euro-Par 22 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 502

Lecture Notes in Computer Science 7156

Commenced Publication in 1973


Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
Michael Alexander Pasqua D’Ambra
Adam Belloum George Bosilca
Mario Cannataro Marco Danelutto
Beniamino Di Martino Michael Gerndt
Emmanuel Jeannot Raymond Namyst
Jean Roman Stephen L. Scott
Jesper Larsson Traff Geoffroy Vallée
Josef Weidendorfer (Eds.)

Euro-Par 2011:
Parallel Processing
Workshops
CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC,
HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC
Bordeaux, France, August 29 – September 2, 2011
Revised Selected Papers, Part II

13
Volume Editors

Michael Alexander, E-mail: malexander@scilytics.com


Pasqua D’Ambra, E-mail: pasqua.dambra@na.icar.cnr.it
Adam Belloum, E-mail: a.s.z.belloum@uva.nl
George Bosilca, E-mail: bosilca@eecs.utk.edu
Mario Cannataro, E-mail: cannataro@unicz.it
Marco Danelutto, E-mail: marcod@di.unipi.it
Beniamino Di Martino, E-mail: beniamino.dimartino@unina.it
Michael Gerndt, E-mail: michael.gerndt@in.tum.de
Emmanuel Jeannot, E-mail: emmanuel.jeannot@inria.fr
Raymond Namyst, E-mail: raymond.namyst@labri.fr
Jean Roman, E-mail: jean.roman@inria.fr
Stephen L. Scott, E-mail: scottsl@ornl.gov
Jesper Larsson Traff, E-mail: traff@par.univie.ac.at
Geoffroy Vallée, E-mail: valleegr@ornl.gov
Josef Weidendorfer, E-mail: josef.weidendorfer@in.tum.de

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-642-29739-7 e-ISBN 978-3-642-29740-3
DOI 10.1007/978-3-642-29740-3
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2012935785

CR Subject Classification (1998): C.4, D.2, C.2, D.4, C.2.4, C.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

© Springer-Verlag Berlin Heidelberg 2012


This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

Euro-Par is an annual series of international conferences dedicated to the


promotion and advancement of all aspects of parallel and distributed computing.
Euro-Par 2011 was the 17th edition in this conference series. Euro-Par covers a
wide spectrum of topics from algorithms and theory to software technology and
hardware-related issues, with application areas ranging from scientific to mobile
and cloud computing. Euro-Par provides a forum for the introduction, presenta-
tion and discussion of the latest scientific and technical advances, extending the
frontier of both the state of the art and the state of the practice.
Since 2006, Euro-Par conferences provide a platform for a number of accom-
panying, technical workshops. This is a great opportunity for small and emerging
communities to meet and discuss focussed research topics. This 2011 edition es-
tablished a new record: 12 workshops were organized. Among these workshops,
we had the pleasure of welcoming 4 newcomers: HPCVirt (previously held in
conjunction with EuroSys), HPSS (first edition), MDGS (first edition) and Re-
silience (previously held in conjunction with CCgrid). It was also great to see
the CCPI, HiBB and UCHPC workshops attracting a broad audience for their
second edition. Here is the complete list of workshops that were held in 2011:
1. Cloud Computing Projects and Initiatives (CCPI)
2. CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS)
3. Algorithms, Models and Tools for Parallel Computing on Heterogeneous
Platforms (HeteroPar)
4. High-Performance Bioinformatics and Biomedicine (HiBB)
5. System-Level Virtualization for High-Performance Computing (HPCVirt)
6. Highly Parallel Processing on a Chip (HPPC)
7. Algorithms and Programming Tools for Next-Generation High-Performance
Scientific Software (HPSS)
8. Managing and Delivering Grid Services (MDGS)
9. Productivity and Performance (Proper)
10. Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds,
and Grids
11. UnConventional High-Performance Computing 2011 (UCHPC)
12. Virtualization in High-Performance Cloud Computing (VHPC).
The present volume includes the proceedings of all workshops. Each workshop
had their own paper-reviewing process. Special thanks are due to the authors
of all the submitted papers, the members of the Program Committees, all the
reviewers and the workshop organizers. They all contributed to the success of
this edition.
We are also grateful to the members of the Euro-Par Steering Committee
for their support, in particular Luc Bougé and Christian Lengauer for all their
advices regarding the coordination of workshops. We thank Domenico Talia,
VI Preface

Pasqua D’Ambra and Mario Rosario Guarracino of the organization of Euro-


Par 2010 for sharing their experience with us.
A number of institutional and industrial sponsors contributed toward the
organization of the conference. Their names and logos appear on the Euro-Par
2011 website http://europar2011.bordeaux.inria.fr/
It was our pleasure and honor to organize and host the Euro-Par 2011 work-
shops in Bordeaux. We hope all the participants enjoyed the technical program
and the social events organized during the conference.

January 2011 Emmanuel Jeannot


Raymond Namyst
Jean Roman
Organization

Euro-Par Steering Committee


Chair
Chris Lengauer University of Passau, Germany

Vice-Chair
Luc Bougé ENS Cachan, France

European Respresentatives
José Cunha New University of Lisbon, Portugal
Marco Danelutto University of Pisa, Italy
Emmanuel Jeannot INRIA, France
Paul Kelly Imperial College, UK
Harald Kosch University of Passau, Germany
Thomas Ludwig University of Heidelberg, Germany
Emilio Luque University Autonoma of Barcelona, Spain
Tomàs Margalef University Autonoma of Barcelona, Spain
Wolfgang Nagel Dresden University of Technology, Germany
Rizos Sakellariou University of Manchester, UK
Henk Sips Delft University of Technology,
The Netherlands
Domenico Talia University of Calabria, Italy

Honorary Members
Ron Perrott Queen’s University Belfast, UK
Karl Dieter Reinartz University of Erlangen-Nuremberg, Germany

Euro-Par 2011 Organization


Conference Co-chairs
Emmanuel Jeannot INRIA, France
Raymond Namyst University of Bordeaux, France
Jean Roman INRIA, University of Bordeaux, France

Local Organizing Committee


Olivier Aumage INRIA, France
Emmanuel Agullo INRIA, France
Alexandre Denis INRIA, France
VIII Organization

Nathalie Furmento CNRS, France


Laetitia Grimaldi INRIA, France
Nicole Lun LaBRI, France
Guillaume Mercier University of Bordeaux, France
Elia Meyre LaBRI, France

Euro-Par 2011 Workshops


Chair
Raymond Namyst University of Bordeaux, France

Workshop on Cloud Computing Projects and Initiatives


(CCPI)
Program Chairs
Beniamino Di Martino Second University of Naples, Italy
Dana Petcu West University of Timisoara, Romania
Antonio Puliafito University of Messina, Italy

Program Committee
Pasquale Cantiello Second University of Naples, Italy
Maria Fazio University of Messina, Italy
Florin Fortis West University of Timisoara, Romania
Francesco Moscato Second University of Naples, Italy
Viorel Negru West University of Timisoara, Romania
Massimo Villari University of Messina, Italy

CoreGRID/ERCIM Workshop on Grids, Clouds and P2P


Computing – CGWS2011
Program Chairs
M. Danelutto University of Pisa, Italy
F. Desprez INRIA and ENS Lyon, France
V. Getov University of Westminster, UK
W. Ziegler SCAI, Germany
Program Committee
Artur Andrzejak Institute For Infocomm Research (I2R),
Singapore
Marco Aldinucci University of Torin, Italy
Alvaro Arenas IE Business School, Madrid, Spain
Rosa M. Badia Technical University of Catalonia, Spain
Alessandro Bassi HIT ACHI, France
Organization IX

Augusto Ciuffoletti University of Pisa, Italy


Marco Danelutto University of Pisa, Italy
Marios Dikaiakos University of Cyprus, Cyprus
Dick H.J. Epema Delft University of Technology,
The Netherlands
Thomas Fahringer University of Innsbruck, Austria
Gilles Fedak INRIA, France
Paraskevi Fragopoulou FORTH-ICS, Greece
J. Gabarro Technical University of Catalonia, Spain
Vladimir Getov University of Westminster, UK
Sergei Gorlatch University of Münster, Germany
T. Harmer Belfast e-Science Center, UK
Ruben S. Montero Complutense University of Madrid, Spain
Peter Kacsuk MT A SZT AKI, Hungary
Thilo Kielmann Vrije Universiteit, The Netherlands
Derrick Kondo INRIA, France
Philippe Massonet CETIC, Belgium
Carlo Mastroianni ICAR-CNR, Italy
Norbert Meyer Poznan Supercomputing and Networking
Center, Poland
Ignacio M. Llorente Complutense University of Madrid, Spain
Christian Pérez INRIA/IRISA, France
Ron Perrott Queen’s University of Belfast, UK
Thierry Priol INRIA, France
Omer Rana Cardiff University, UK
Rizos Sakellariou University of Manchester, UK
Alan Stewart Queen’s University of Belfast, UK
Junichi Suzuki University of Massachusetts, Boston, USA
Domenico Talia University of Calabria, Italy
Ian Taylor Cardiff University, UK
Jordi Torres Technical University of Catalonia - BSC, Spain
Paolo Trunfio University of Calabria, Italy
Ramin Yahyapour University of Dortmund, Germany
Demetrios Zeinalipour-Yazti University of Cyprus, Cyprus
Wolfgang Ziegler Fraunhofer Institute SCAI, Germany

5th Workshop on System-Level Virtualization for


High-Performance Computing (HPCVirt 2011)
Program Chairs
Stephen L. Scott Oak Ridge National Laboratory, USA
Geoffroy Vallée Oak Ridge National Laboratory, USA
Thomas Naughton Tennessee Tech University, USA
X Organization

Program Committee
Patrick Bridges UNM, USA
Thierry Delaitre The University of Westminster, UK
Christian Engelmann ORNL, USA
Douglas Fuller ORNL, USA
Ada Gavrilovska Georgia Tech, USA
Jack Lange University of Pittsburgh, USA
Adrien Lebre Ecole des Mines de Nantes, France
Laurent Lefevre INRIA, University of Lyon, France
Jean-Marc Menaud Ecole des Mines de Nantes, France
Christine Morin INRIA, France
Thomas Naughton ORNL, USA
Dimitrios Nikolopoulos University of Crete, Greece
Josh Simons VMWare, USA
Samuel Thibault LaBRI, France

HPPC 2011: 5th Workshop on Highly Parallel Processing


on a Chip
Program Chairs
Martti Forsell VTT, Finland
Jesper Larsson Träff University of Vienna, Austria

Program Committee
David Bader Georgia Institute of Technology, USA
Martti Forsell VTT, Finland
Jim Held Intel, USA
Peter Hofstee IBM, USA
Magnus Jahre NTNU, Norway
Chris Jesshope University of Amsterdam, The Netherlands
Ben Juurlink Technical University of Berlin, Germany
Jörg Keller University of Hagen, Germany
Christoph Kessler University of Linköping, Sweden
Avi Mendelson Microsoft, Israel
Vitaly Osipov Karlsruhe Institute of Technology, Germany
Martti Penttonen University of Eastern Finland, Finland
Sven-Bodo Scholz University of Hertfordshire, UK
Jesper Larsson Träff University of Vienna, Austria
Theo Ungerer University of Augsburg, Germany
Uzi Vishkin University of Maryland, USA
Sponsors
VTT, Finland http://www.vtt.fi
University of Vienna http://www.univie.ac.at
Euro-Par http://www.euro-par.org
Organization XI

Algorithms and Programming Tools for Next-Generation


High-Performance Scientific Software (HPSS 2011)
Program Chairs
Stefania Corsaro University of Naples Parthenope and
ICAR-CNR, Italy
Pasqua D’Ambra ICAR-CNR, Naples, Italy
Francesca Perla University of Naples Parthenope and
ICAR-CNR, Italy

Program Committee
Patrick Amnestoy University of Toulouse, France
Peter Arbenz ETH Zurich, Switzerland
Rob Bisseling Utrecht University, The Netherlands
Daniela di Serafino Second University of Naples and ICAR-CNR,
Italy
Jack Dongarra University of Tennesse, USA
Salvatore Filippone University of Rome Tor Vergata, Italy
Laura Grigori INRIA, France
Andreas Grothey University of Edinburgh, UK
Mario Rosario Guarracino ICAR-CNR, Italy
Sven Hammarling University of Manchester and NAG Ltd., UK
Mike Heroux Sandia National Laboratories, USA
Gerardo Toraldo University of Naples Federico II and
ICAR-CNR, Italy
Bora Ucar CNRS, France
Rich Vuduc Georgia Tech, USA
Ulrike Meier Yang Lawrence Livermore National Laboratory, USA

HeteroPar 2011: Algorithms, Models and Tools for Parallel


Computing on Heterogeneous Platforms
Program Chairs
George Bosilca ICL, University of Tennessee, Knoxville, USA

Program Committee
Jacques Bahi University of Franche-Comté, France
Jorge Barbosa FEUP, Portugal
George Bosilca Innovative Computing Laboratory - University
of Tennessee, Knoxville, USA
Andrea Clematis IMATI CNR, Italy
Michel Dayde IRIT - INPT / ENSEEIHT, France
Frederic Desprez INRIA, France
Pierre-Francois Dutot Laboratoire LIG, France
Alfredo Goldman University of São Paulo - USP, Brasil
XII Organization

Thomas Herault Innovative Computing Laboratory - University


of Tennessee, Knoxville, USA
Shuichi Ichikawa Toyohashi University of Technology, Japan
Emmanuel Jeannot LaBRI, INRIA Bordeaux Sud-Ouest, France
Helen Karatza Aristotle University of Thessaloniki, Greece
Zhiling Lan Illinois Institute of Technology, USA
Pierre Manneback University of Mons, Belgium
Kiminori Matsuzaki Kochi University of Technology, Japan
Wahid Nasri Higher School of Sciences and Techniques of
Tunis, Tunisia
Dana Petcu West University of Timisoara, Romania
Serge Petiton Université des Sciences et Technologies de Lille,
France
Casiano Rodriguez-Leon Universidad de La Laguna, Spain
Franciszek Seredynski Polish Academy of Sciences, Poland
Howard J. Siegel CSU, USA
Antonio M. Vidal Universidad Politécnica de Valencia, Spain
Ramin Yahyapour TU University Dortmund, Germany

HiBB 2011: Second Workshop on High-Performance


Bioinformatics and Biomedicine
Program Chairs
Mario Cannataro University Magna Græcia of Catanzaro, Italy

Program Committee
Pratul K. Agarwal Oak Ridge National Laboratory, USA
David A. Bader College of Computing, Georgia University of
Technology, USA
Ignacio Blanquer Universidad Politécnica de Valencia,
Valencia, Spain
Daniela Calvetti Case Western Reserve University, USA
Werner Dubitzky University of Ulster, UK
Ananth Y. Grama Purdue University, USA
Concettina Guerra University of Padova, Italy
Vicente Hernández Universitad Politécnica de Valencia, Spain
Salvatore Orlando University of Venice, Italy
Omer F. Rana Cardiff University, UK
Richard Sinnott National e-Science Centre, University of
Glasgow, Glasgow, UK
Fabrizio Silvestri ISTI-CNR, Italy
Erkki Somersalo Case Western Reserve University, USA
Paolo Trunfio University of Calabria, Italy
Albert Zomaya University of Sydney, Australia
Organization XIII

Managing and Delivering Grid Services 2011 (MDGS2011)


Program Chairs
Thomas Schaaf Ludwig-Maximiians-Universität, Munich,
Germany
Owen Appleton Emergence Tech Limited, London, UK
Adam S.Z. Belloum University of Amsterdam, The Netherlands
Joan Serrat-Fernández Universitat Politècnica de Catalunya,
Barcelona, Spain
Tomasz Szepieniec AGH University of Science and Technology,
Krakow, Poland

Program Committee
Nazim Agulmine University of Evry, France
Michael Brenner Leibniz Supercomputing Centre, Germany
Ewa Deelman University of Southern California, USA
Karim Djemame University of Leeds, UK
Thomas Fahringer University of Innsbruck, Austria
Alex Galis University College London, UK
Dieter Kranzlmüller Ludwig-Maximilians-Universität, Germany
Laurent Lefebre INRIA, France
Edgar Magana CISCO research labs, USA
Patricia Marcu Leibniz Supercomputing Centre, Germany
Carlos Merida Barcelona Supercomputing Center, Spain
Steven Newhouse European Grid Initiative, The Netherlands
Omer F. Rana Cardiff University, UK
Stefan Wesner High Performance Computing Center
Stuttgart, Germany
Philipp Wieder Technische Universität Dortmund, Germany
Ramin Yahyapour Technische Universität Dortmund, Germany

4th Workshop on Productivity and Performance Tools for


HPC Application Development (PROPER 2011)
Program Chairs
Michael Gerndt TU München, Germany

Program Committee
Andreas Knüpfer TU Dresden, Germany
Dieter an Mey RWTH Aachen, Germany
Jens Doleschal TU Dresden, Germany
Karl Fürlinger University of California at Berkeley, USA
Michael Gerndt TU München, Germany
Allen Malony University of Oregon, USA
XIV Organization

Shirley Moore University of Tennessee, USA


Matthias Müller TU Dresden, Germany
Martin Schulz Lawrence Livermore National Lab, USA
Felix Wolf German Research School for Simulation
Sciences, Germany
Josef Weidendorfer TU München, Germany
Shajulin Benedict St. Xavier’s College, India
Beniamino Di Martino Seconda Università di Napoli, Italy
Torsten Höfler University of Illinois, USA

Workshop on Resiliency in High-Performance Computing


(Resilience) in Clusters, Clouds, and Grids
Program Chairs
Stephen L. Scott Oak Ridge National Laboratory, USA
Chokchai (Box) Leangsuksun Louisiana Tech University, USA

Program Committee
Vassil Alexandrov Barcelona Supercomputing Center, Spain
David E. Bernholdt Oak Ridge National Laboratory, USA
George Bosilca University of Tennessee, USA
Jim Brandt Sandia National Laboratories, USA
Patrick G. Bridges University of New Mexico, USA
Greg Bronevetsky Lawrence Livermore National Laboratory, USA
Franck Cappello INRIA/UIUC, France/USA
Kasidit Chanchio Thammasat University, Thailand
Zizhong Chen Colorado School of Mines, USA
Nathan DeBardeleben Los Alamos National Laboratory, USA
Jack Dongarra University of Tennessee, USA
Christian Engelmann Oak Ridge National Laboratory, USA
Yung-Chin Fang Dell, USA
Kurt B. Ferreira Sandia National Laboratories, USA
Ann Gentile Sandia National Laboratories, USA
Cecile Germain University Paris-Sud, France
Rinku Gupta Argonne National Laboratory, USA
Paul Hargrove Lawrence Berkeley National Laboratory, USA
Xubin He Virginia Commonwealth University, USA
Larry Kaplan Cray, USA
Daniel S. Katz University of Chicago, USA
Thilo Kielmann Vrije Universiteit Amsterdam, The Netherlands
Dieter Kranzlmueller LMU/LRZ Munich, Germany
Zhiling Lan Illinois Institute of Technology, USA
Chokchai (Box) Leangsuksun Louisiana Tech University, USA
Xiaosong Ma North Carolina State University, USA
Celso Mendes University of Illinois at Urbana Champaign,
USA
Organization XV

Christine Morin INRIA Rennes, France


Thomas Naughton Oak Ridge National Laboratory, USA
George Ostrouchov Oak Ridge National Laboratory, USA
DK Panda The Ohio State University, USA
Mihaela Paun Louisiana Tech University, USA
Alexander Reinefeld Zuse Institute Berlin, Germany
Rolf Riesen IBM Research, Ireland
Eric Roman Lawrence Berkeley National Laboratory, USA
Stephen L. Scott Oak Ridge National Laboratory, USA
Jon Stearley Sandia National Laboratories, USA
Gregory M. Thorson SGI, USA
Geoffroy Vallee Oak Ridge National Laboratory, USA
Sudharshan Vazhkudai Oak Ridge National Laboratory, USA

UCHPC 2011: Fourth Workshop on UnConventional


High-Performance Computing
Program Chairs
Anders Hast University of Gävle, Sweden
Josef Weidendorfer Technische Universität München, Germany
Jan-Philipp Weiss Karlsruhe Institute of Technology, Germany

Steering Committee
Lars Bengtsson Chalmers University, Sweden
Ren Wu HP Labs, Palo Alto, USA

Program Committee
David A. Bader Georgia Tech, USA
Michael Bader Universität Stuttgart, Germany
Denis Barthou Université de Bordeaux, France
Lars Bengtsson Chalmers, Sweden
Karl Fürlinger LMU, Munich, Germany
Dominik Göddeke TU Dortmund, Germany
Georg Hager University of Erlangen-Nuremberg, Germany
Anders Hast University of Gävle, Sweden
Ben Juurlink TU Berlin, Germany
Rainer Keller HLRS Stuttgart, Germany
Gaurav Khanna University of Massachusetts Dartmouth, USA
Harald Köstler University of Erlangen-Nuremberg, Germany
Dominique Lavenier INRIA, France
Manfred Mücke University of Vienna, Austria
Andy Nisbet Manchester Metropolitan University, UK
Ioannis Papaefstathiou Technical University of Crete, Greece
Franz-Josef Pfreundt Fraunhofer ITWM, Germany
XVI Organization

Bertil Schmidt Johannes Gutenberg University Mainz,


Germany
Thomas Steinke Zuse Institute, Berlin, Germany
Robert Strzodka Max Planck Center for Computer Science,
Germany
Carsten Trinitis Technische Universität München, Germany
Josef Weidendorfer Technische Universität München, Germany
Jan-Philipp Weiss KIT, Germany
Gerhard Wellein University of Erlangen-Nuremberg, Germany
Stephan Wong Delft University of Technology,
The Netherlands
Ren Wu HP Labs, Palo Alto, USA
Peter Zinterhof Jr. University of Salzburg, Austria
Yunquan Zhang Chinese Academy of Sciences, Beijing, China

Additional Reviewers
Antony Brandon Delft University of Technology,
The Netherlands
Roel Seedorf Delft University of Technology,
The Netherlands

VHPC 2011: Sixth Workshop on Virtualization in


High-Performance Cloud Computing
Program Chairs
Michael Alexander scaledinfra technologies GmbH, Vienna,
Austria
Gianluigi Zanetti CRS4, Italy

Program Committee
Padmashree Apparao Intel Corp., USA
Hassan Barada Khalifa University, UAE
Volker Buege University of Karlsruhe, Germany
Isabel Campos IFCA, Spain
Stephen Childs Trinity College Dublin, Ireland
William Gardner University of Guelph, Canada
Derek Groen UVA, The Netherlands
Ahmad Hammad FZK, Germany
Sverre Jarp CERN, Switzerland
Xuxian Jiang NC State, USA
Kenji Kaneda Google, Japan
Krishna Kant Intel, USA
Yves Kemp DESY Hamburg, Germany
Marcel Kunze Karlsruhe Institute of Technology, Germany
Organization XVII

Naoya Maruyama Tokyo Institute of Technology, Japan


Jean-Marc Menaud Ecole des Mines de Nantes, France
Oliver Oberst Karlsruhe Institute of Technology, Germany
Jose Renato Santos HP Labs, USA
Deepak Singh Amazon Webservices, USA
Yoshio Turner HP Labs, USA
Andreas Unterkirchner CERN, Switzerland
Lizhe Wang Rochester Institute of Technology, USA
Table of Contents – Part II

HiBB 2011: 2nd Workshop on High-Performance


Bioinformatics and Biomedicine
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Mario Cannataro
On Parallelizing On-Line Statistics for Stochastic Biological
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Marco Aldinucci, Mario Coppo, Ferruccio Damiani,
Maurizio Drocco, Eva Sciacca, Salvatore Spinella,
Massimo Torquati, and Angelo Troina
Scalable Sequence Similarity Search and Join in Main Memory on
Multi-cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Astrid Rheinländer and Ulf Leser
Enabling Data and Compute Intensive Workflows in Bioinformatics . . . . 23
Gaurang Mehta, Ewa Deelman, James A. Knowles, Ting Chen,
Ying Wang, Jens Vöckler, Steven Buyske, and Tara Matise
Homogenizing Access to Highly Time-Consuming Biomedical
Applications through a Web-Based Interface . . . . . . . . . . . . . . . . . . . . . . . . . 33
Luigi Grasso, Nuria Medina-Medina, Rosana Montes-Soldado, and
Marı́a M. Abad-Grau
Distributed Management and Analysis of Omics Data . . . . . . . . . . . . . . . . 43
Mario Cannataro and Pietro Hiram Guzzi

Managing and Delivering Grid Services (MDGS)


Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Thomas Schaaf, Adam S.Z. Belloum, Owen Appleton,
Joan Serrat-Fernández, and Tomasz Szepieniec
Resource Allocation for the French National Grid Initiative . . . . . . . . . . . . 55
Gilles Mathieu and Hélène Cordier
On Importance of Service Level Management in Grids . . . . . . . . . . . . . . . . 64
Tomasz Szepieniec, Joanna Kocot, Thomas Schaaf, Owen Appleton,
Matti Heikkurinen, Adam S.Z. Belloum, Joan Serrat-Fernández, and
Martin Metzker
On-Line Monitoring of Service-Level Agreements in the Grid . . . . . . . . . . 76
Bartosz Balis, Renata Slota, Jacek Kitowski, and Marian Bubak
XX Table of Contents – Part II

Challenges of Future e-Infrastructure Governance . . . . . . . . . . . . . . . . . . . . 86


Dana Petcu

Influences between Performance Based Scheduling and Service Level


Agreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Antonella Galizia, Alfonso Quarati, Michael Schiffers, and
Mark Yampolskiy

User Centric Service Level Management in mOSAIC Applications . . . . . . 106


Massimiliano Rak, Rocco Aversa, Salvatore Venticinque, and
Beniamino Di Martino

Service Level Management for Executable Papers . . . . . . . . . . . . . . . . . . . . 116


Reginald Cushing, Spiros Koulouzis, Rudolf Strijkers,
Adam S.Z. Belloum, and Marian Bubak

Change Management in e-Infrastructures to Support Service Level


Agreements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Silvia Knittl, Thomas Schaaf, and Ilya Saverchenko

PROPER 2011: Fourth Workshop on Productivity


and Performance: Tools for HPC Application
Development
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Michael Gerndt

Scout: A Source-to-Source Transformator for SIMD-Optimizations . . . . . 137


Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, and
Wolfgang E. Nagel

Scalable Automatic Performance Analysis on IBM BlueGene/P


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Yury Oleynik and Michael Gerndt

An Approach to Creating Performance Visualizations in a Parallel


Profile Analysis Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Wyatt Spear, Allen D. Malony, Chee Wai Lee, Scott Biersdorff, and
Sameer Shende

INAM - A Scalable InfiniBand Network Analysis and Monitoring


Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur,
Dhabaleswar K. Panda, and Ron Brightwell

Auto-tuning for Energy Usage in Scientific Applications . . . . . . . . . . . . . . . 178


Ananta Tiwari, Michael A. Laurenzano, Laura Carrington, and
Allan Snavely
Table of Contents – Part II XXI

Automatic Source Code Transformation for GPUs Based on Program


Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Pasquale Cantiello and Beniamino Di Martino
Enhancing Brainware Productivity through a Performance Tuning
Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Christian Iwainsky, Ralph Altenfeld, Dieter an Mey, and
Christian Bischof

Workshop on Resiliency in High-Performance


Computing (Resilience) in Clusters, Clouds, and
Grids
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Stephen L. Scott and Chokchai (Box) Leangsuksun
The Malthusian Catastrophe Is Upon Us! Are the Largest HPC
Machines Ever Up? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Patricia Kovatch, Matthew Ezell, and Ryan Braby
Simulating Application Resilience at Exascale . . . . . . . . . . . . . . . . . . . . . . . 221
Rolf Riesen, Kurt B. Ferreira, Maria Ruiz Varela,
Michela Taufer, and Arun Rodrigues
Framework for Enabling System Understanding . . . . . . . . . . . . . . . . . . . . . . 231
J. Brandt, F. Chen, A. Gentile, Chokchai (Box) Leangsuksun,
J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, and M. Wong
Cooperative Application/OS DRAM Fault Recovery . . . . . . . . . . . . . . . . . . 241
Patrick G. Bridges, Mark Hoemmen, Kurt B. Ferreira,
Michael A. Heroux, Philip Soltero, and Ron Brightwell
A Tunable, Software-Based DRAM Error Detection and Correction
Library for HPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
David Fiala, Kurt B. Ferreira, Frank Mueller, and
Christian Engelmann
Reducing the Impact of Soft Errors on Fabric-Based Collective
Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
José Carlos Sancho, Ana Jokanovic, and Jesus Labarta
Evaluating Application Vulnerability to Soft Errors in Multi-level
Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Zhe Ma, Trevor Carlson, Wim Heirman, and Lieven Eeckhout
Experimental Framework for Injecting Logic Errors in a Virtual
Machine to Profile Applications for Soft Error Resilience . . . . . . . . . . . . . . 282
Nathan DeBardeleben, Sean Blanchard, Qiang Guan,
Ziming Zhang, and Song Fu
XXII Table of Contents – Part II

High Availability on Cloud with HA-OSCAR . . . . . . . . . . . . . . . . . . . . . . . . 292


Thanadech Thanakornworakij, Rajan Sharma, Blaine Scroggs,
Chokchai (Box) Leangsuksun, Zeno Dixon Greenwood,
Pierre Riteau, and Christine Morin

On the Viability of Checkpoint Compression for Extreme Scale Fault


Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Dewan Ibtesham, Dorian Arnold, Kurt B. Ferreira, and
Patrick G. Bridges

Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data


Staging? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron,
Vilobh Meshram, and Dhabaleswar K. Panda

Impact of Over-Decomposition on Coordinated Checkpoint/Rollback


Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Xavier Besseron and Thierry Gautier

UCHPC 2011: Fourth Workshop on UnConventional


High-Performance Computing
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Anders Hast, Josef Weidendorfer, and Jan-Philipp Weiss

PACUE: Processor Allocator Considering User Experience . . . . . . . . . . . . 335


Tetsuro Horikawa, Michio Honda, Jin Nakazawa,
Kazunori Takashio, and Hideyuki Tokuda

Workload Balancing on Heterogeneous Systems: A Case Study of


Sparse Grid Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Alin Muraraşu, Josef Weidendorfer, and Arndt Bode

Performance Evaluation of a Multi-GPU Enabled Finite Element


Method for Computational Electromagnetics . . . . . . . . . . . . . . . . . . . . . . . . 355
Tristan Cabel, Joseph Charles, and Stéphane Lanteri

Study of Hierarchical N-Body Methods for Network-on-Chip


Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen

Extending a Highly Parallel Data Mining Algorithm to the Intel R

Many Integrated Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375


Alexander Heinecke, Michael Klemm, Dirk Pflüger,
Arndt Bode, and Hans-Joachim Bungartz
Table of Contents – Part II XXIII

VHPC 2011: 6th Workshop on Virtualization in


High-Performance Cloud Computing
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
Michael Alexander and Gianluigi Zanetti

Group-Based Memory Deduplication for Virtualized Clouds . . . . . . . . . . . 387


Sangwook Kim, Hwanju Kim, and Joonwon Lee

A Smart HPC Interconnect for Clusters of Virtual Machines . . . . . . . . . . . 398


Anastassios Nanos, Nikos Nikoleris, Stratos Psomadakis,
Elisavet Kozyri, and Nectarios Koziris

Coexisting Scheduling Policies Boosting I/O Virtual Machines . . . . . . . . . 407


Dimitris Aragiorgis, Anastassios Nanos, and Nectarios Koziris

PIGA-Virt: An Advanced Distributed MAC Protection of Virtual


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
J. Briffaut, E. Lefebvre, J. Rouzaud-Cornabas, and C. Toinard

An Economic Approach for Application QoS Management in Clouds . . . . 426


Stefania Costache, Nikos Parlavantzas, Christine Morin, and
Samuel Kortas

Evaluation of the HPC Challenge Benchmarks in Virtualized


Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
Piotr Luszczek, Eric Meek, Shirley Moore, Dan Terpstra,
Vincent M. Weaver, and Jack Dongarra

DISCOVERY, Beyond the Clouds: DIStributed and COoperative


Framework to Manage Virtual EnviRonments autonomicallY:
A Prospective Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Adrien Lèbre, Paolo Anedda, Massimo Gaggero, and Flavien Quesnel

Cooperative Dynamic Scheduling of Virtual Machines in Distributed


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
Flavien Quesnel and Adrien Lèbre

Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based


Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Romeo Kienzler, Rémy Bruggmann, Anand Ranganathan, and
Nesime Tatbul

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477


Table of Contents – Part I

CCPI 2011: Workshop on Cloud Computing Projects


and Initiatives
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Beniamino Di Martino and Dana Petcu

Towards Cross-Platform Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 5


Magdalena Slawinska, Jaroslaw Slawinski, and Vaidy Sunderam

QoS Monitoring in a Cloud Services Environment: The SRT-15


Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Giuseppe Cicotti, Luigi Coppolino, Rosario Cristaldi,
Salvatore D’Antonio, and Luigi Romano

Enabling e-Science Applications on the Cloud with COMPSs . . . . . . . . . . 25


Daniele Lezzi, Roger Rafanell, Abel Carrión,
Ignacio Blanquer Espert, Vicente Hernández, and
Rosa M. Badia

OPTIMIS and VISION Cloud: How to Manage Data in Clouds . . . . . . . . 35


Spyridon V. Gogouvitis, George Kousiouris, George Vafiadis,
Elliot K. Kolodner, and Dimosthenis Kyriazis

Integrated Monitoring of Infrastructures and Applications in Cloud


Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Roberto Palmieri, Pierangelo di Sanzo, Francesco Quaglia,
Paolo Romano, Sebastiano Peluso, and Diego Didona

Towards Collaborative Data Management in the VPH-Share Project . . . . 54


Siegfried Benkner, Jesus Bisbal, Gerhard Engelbrecht, Rod D. Hose,
Yuriy Kaniovskyi, Martin Koehler, Carlos Pedrinaci, and
Steven Wood

SLM and SDM Challenges in Federated Infrastructures . . . . . . . . . . . . . . . 64


Matti Heikkurinen and Owen Appleton

Rapid Prototyping of Architectures on the Cloud Using Semantic


Resource Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Houssam Haitof
XXVI Table of Contents – Part I

Cloud Patterns for mOSAIC-Enabled Scientific Applications . . . . . . . . . . . 83


Teodor-Florin Fortiş, Gorka Esnal Lopez, Imanol Padillo Cruz,
Gábor Ferschl, and Tamás Máhr

Enhancing an Autonomic Cloud Architecture with Mobile Agents . . . . . . 94


A. Cuomo, M. Rak, S. Venticinque, and U. Villano

Mapping Application Requirements to Cloud Resources . . . . . . . . . . . . . . . 104


Yih Leong Sun, Terence Harmer, Alan Stewart, and Peter Wright

CoreGRID/ERCIM Workshop on Grids, Clouds and


P2P Computing – CGWS2011
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Marco Danelutto, Frédéric Desprez, Vladimir Getov, and
Wolfgang Ziegler

A Perspective on the CoreGRID Grid Component Model . . . . . . . . . . . . . 115


Françoise Baude

Towards Scheduling Evolving Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 117


Cristian Klein and Christian Pérez

Model Checking Support for Conflict Resolution in Multiple


Non-functional Concern Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Marco Danelutto, P. Kilpatrick, C. Montangero, and L. Semini

Consistent Rollback Protocols for Autonomic ASSISTANT


Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Carlo Bertolli, Gabriele Mencagli, and Marco Vanneschi

A Dynamic Resource Management System for Real-Time Online


Applications on Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Dominik Meiländer, Alexander Ploss, Frank Glinka, and
Sergei Gorlatch

Cloud Federations in Contrail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


Emanuele Carlini, Massimo Coppola, Patrizio Dazzi,
Laura Ricci, and Giacomo Righetti

Semi-automatic Composition of Ontologies for ASKALON Grid


Workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Muhammad Junaid Malik, Thomas Fahringer, and Radu Prodan
Table of Contents – Part I XXVII

The Chemical Machine: An Interpreter for the Higher Order Chemical


Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Vilmos Rajcsányi and Zsolt Németh

Design and Performance of the OP2 Library for Unstructured Mesh


Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Carlo Bertolli, Adam Betts, Gihan Mudalige, Mike Giles, and
Paul Kelly

Mining Association Rules on Grid Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 201


Raja Tlili and Yahya Slimani

5th Workshop on System-Level Virtualization for


High-Performance Computing (HPCVirt 2011)
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Stephen L. Scott, Geoffroy Vallée, and Thomas Naughton

Performance Evaluation of HPC Benchmarks on VMware’s ESXi


Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
Qasim Ali, Vladimir Kiriansky, Josh Simons, and Puneet Zaroo

Virtualizing Performance Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


Benjamin Serebrin and Daniel Hecht

A Case for Virtual Machine Based Fault Injection in a High-Performance


Computing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Thomas Naughton, Geoffroy Vallée, Christian Engelmann, and
Stephen L. Scott

HPPC 2010: 5th Workshop on Highly Parallel


Processing on a Chip
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Martti Forsell and Jesper Larsson Träff

Thermal Management of a Many-Core Processor under Fine-Grained


Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Fuat Keceli, Tali Moreshet, and Uzi Vishkin

Mainstream Parallel Array Programming on Cell . . . . . . . . . . . . . . . . . . . . . 260


Paul Keir, Paul W. Cockshott, and Andrew Richards
XXVIII Table of Contents – Part I

Generating GPU Code from a High-Level Representation for Image


Processing Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Richard Membarth, Anton Lokhmotov, and Jürgen Teich

A Greedy Heuristic Approximation Scheduling Algorithm for 3D


Multicore Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen

Algorithms and Programming Tools for


Next-Generation High-Performance Scientific
Software HPSS 2011
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Stefania Corsaro, Pasqua D’Ambra, and Francesca Perla

European Exascale Software Initiative: Numerical Libraries, Solvers


and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Iain S. Duff

On Reducing I/O Overheads in Large-Scale Invariant Subspace


Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Hasan Metin Aktulga, Chao Yang, Ümit V. Çatalyürek,
Pieter Maris, James P. Vary, and Esmond G. Ng

Enabling Next-Generation Parallel Circuit Simulation with Trilinos . . . . . 315


Chris Baker, Erik Boman, Mike Heroux, Eric Keiter,
Siva Rajamanickam, Rich Schiek, and Heidi Thornquist

DAG-Based Software Frameworks for PDEs . . . . . . . . . . . . . . . . . . . . . . . . . 324


Martin Berzins, Qingyu Meng, John Schmidt, and
James C. Sutherland

On Partitioning Problems with Complex Objectives . . . . . . . . . . . . . . . . . . 334


Kamer Kaya, François-Henry Rouet, and Bora Uçar

A Communication-Avoiding Thick-Restart Lanczos Method on a


Distributed-Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Ichitaro Yamazaki and Kesheng Wu

Spherical Harmonic Transform with GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 355


Ioan Ovidiu Hupca, Joel Falcou, Laura Grigori, and Radek Stompor

Design Patterns for Scientific Computations on Sparse Matrices . . . . . . . . 367


Davide Barbieri, Valeria Cardellini, Salvatore Filippone, and
Damian Rouson
Table of Contents – Part I XXIX

High-Performance Matrix-Vector Multiplication on the GPU . . . . . . . . . . 377


Hans Henrik Brandenborg Sørensen
Relaxed Synchronization with Ordered Read-Write Locks . . . . . . . . . . . . . 387
Jens Gustedt and Emmanuel Jeanvoine
The Parallel C++ Statistical Library ‘QUESO’: Quantification of
Uncertainty for Estimation, Simulation and Optimization . . . . . . . . . . . . . 398
Ernesto E. Prudencio and Karl W. Schulz
Use of HPC-Techniques for Large-Scale Data Migration . . . . . . . . . . . . . . . 408
Jan Dünnweber, Valentin Mihaylov, René Glettler,
Volker Maiborn, and Holger Wolff

Algorithms, Models and Tools for Parallel Computing


on Heterogeneous Platforms (HeteroPar 2011)
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
George Bosilca
A Genetic Algorithm with Communication Costs to Schedule Workflows
on a SOA-Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Jean-Marc Nicod, Laurent Philippe, and Lamiel Toch
An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku, and
Mitsuhisa Sato
Performance Evaluation of List Based Scheduling on Heterogeneous
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
Hamid Arabnejad and Jorge G. Barbosa
Column-Based Matrix Partitioning for Parallel Matrix Multiplication
on Heterogeneous Processors Based on Functional Performance
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
David Clarke, Alexey Lastovetsky, and Vladimir Rychkov
A Framework for Distributing Agent-Based Simulations . . . . . . . . . . . . . . . 460
Gennaro Cordasco, Rosario De Chiara, Ada Mancuso,
Dario Mazzeo, Vittorio Scarano, and Carmine Spagnuolo
Parallel Sparse Linear Solver GMRES for GPU Clusters with
Compression of Exchanged Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Jacques M. Bahi, Raphaël Couturier, and Lilia Ziane Khodja
Two-Dimensional Discrete Wavelet Transform on Large Images for
Hybrid Computing Architectures: GPU and CELL . . . . . . . . . . . . . . . . . . . 481
Marek Blażewicz, Milosz Ciżnicki, Piotr Kopta,
Krzysztof Kurowski, and Pawel Lichocki
XXX Table of Contents – Part I

Scheduling Divisible Loads on Heterogeneous Desktop Systems with


Limited Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Aleksandar Ilic and Leonel Sousa

Peer Group and Fuzzy Metric to Remove Noise in Images Using


Heterogeneous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Ma. Guadalupe Sánchez, Vicente Vidal, and Jordi Bataller

Estimation of MPI Application Performance on Volunteer


Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
Girish Nandagudi, Jaspal Subhlok, Edgar Gabriel, and Judit Gimenez

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521


HiBB 2011: 2nd Workshop on High
Performance Bioinformatics and Biomedicine

Mario Cannataro

Bioinformatics Laboratory,
Department of Medical and Surgical Sciences,
University Magna Græcia of Catanzaro,
88100 Catanzaro, Italy
cannataro@unicz.it

Foreword
The availability of high-throughput technologies, such as microarray and mass
spectrometry, and the diffusion of genomics and proteomics studies to large
populations, are producing an increasing amount of experimental and clinical
data. Biological databases and bioinformatics tools are key tools for organizing
and exploring such biological and biomedical data with the aim to discover
new knowledge in biology and medicine. However the storage, preprocessing and
analysis of experimental data is becoming the main bottleneck of the analysis
pipeline.
High-performance computing may play an important role in many phases
of life sciences research, from raw data management and processing, to data
integration and analysis, till data exploration and visualization, so well known
high performance computing techniques such as Parallel and Grid Computing, as
well as emerging computational models such as Graphics Processing and Cloud
Computing, are more and more used in bioinformatics.
The huge dimension of experimental data is the first reason to implement
large distributed data repositories, while high performance computing is nec-
essary both to face the complexity of bioinformatics algorithms and to allow
the efficient analysis of huge data. In such a scenario, novel parallel architec-
tures (e.g. CELL processors, GPU, FPGA, hybrid CPU/FPGA) coupled with
emerging programming models may overcome the limits posed by conventional
computers to the mining and exploration of large amounts of data.
The second edition of the Workshop on High Performance Bioinformatics
and Biomedicine (HiBB) aimed to bring together scientists in the fields of high
performance computing, computational biology and medicine to discuss the par-
allel implementation of bioinformatics algorithms, the application of high per-
formance computing in biomedical applications, and the organization of large
scale databases in biology and medicine. As in the past, also this year the work-
shop has been organized in conjunction with Euro-Par, the main European (but
international) conference on all aspects of parallel processing.
Presentations were organized in three sessions. The first session (Bioinformat-
ics and Systems Biology) comprised two papers discussing the parallel
2 M. Cannataro

implementation of bioinformatics and systems biology algorithms on multicore


architectures:
– On Parallelizing On-Line Statistics for Stochastic Biological Simulations
– Scalable Sequence Similarity Search and Join in Main Memory on Multi-
Cores
The second session (Software Platforms for High Performance Bioinformatics)
comprised two papers describing software environments for the development of
bioinformatics workflows:
– Enabling Data and Compute Intensive Workflows in Bioinformatics
– Homogenizing Access to Highly Time-Consuming Biomedical Applications
throughout a Web-Based Interface
Finally, the third session included a tutorial on:
– Distributed Management and Analysis of Omics Data.
This post-workshop proceedings includes the final revised versions of the HiBB
papers and tutorial, taking the feedback from reviewers and workshop audience
into account.
The program chair sincerely thanks the Euro-Par organization, for providing
the opportunity to arrange the HiBB workshop in conjunction with the Euro-
Par 2011 conference, the program committee and the additional reviewers, for
the time and expertise they put into the reviewing work, and all the workshop
attendees who contributed to a lively day.

October 2011
Mario Cannataro
On Parallelizing On-Line Statistics
for Stochastic Biological Simulations

Marco Aldinucci1 , Mario Coppo1 , Ferruccio Damiani1 , Maurizio Drocco1 ,


Eva Sciacca1 , Salvatore Spinella1 , Massimo Torquati2 , and Angelo Troina1
1
Department of Computer Science, University of Torino, Italy
{aldinucci,coppo,damiani,drocco,sciacca,spinella,troina}@di.unito.it
2
Department of Computer Science, University of Pisa, Italy
torquati@di.unipi.it

Abstract. This work concerns a general technique to enrich parallel


version of stochastic simulators for biological systems with tools for on-
line statistical analysis of the results. In particular, within the FastFlow
parallel programming framework, we describe the methodology and the
implementation of a parallel Monte Carlo simulation infrastructure ex-
tended with user-defined on-line data filtering and mining functions. The
simulator and the on-line analysis were validated on large multi-core plat-
forms and representative proof-of-concept biological systems.

Keywords: multi-core, parallel simulation, stochastic simulation, on-


line clustering.

1 Introduction

The traditional approach to describe biological systems relies on deterministic


mathematical tools like, e.g., Ordinary Differential Equations (ODEs). This kind
of modelling becomes more and more difficult when the complexity of the bio-
logical systems increases. To address these issues, in the last decade, formalisms
developed in Computer Science for the description of stochastically behaving
computational entities have been exploited for of biological systems [15].
Biochemical processes, such as gene transcription, regulation and signalling,
often take place in environments containing a (relatively) limited number of
some reactants, or involve very slow reactions, and thus result in high random
fluctuations, determining phenomena like transients or multi-stable behaviour.
Stochastic methods can give an exact account of the system evolution in all situ-
ations and are playing a growing role in modelling biological systems. Stochastic
modeling keeps track of the exact number of species present in a system and all
reactions are simulated individually. These methods can be highly demanding in
terms of computational power (e.g., when a large number of molecules or species

This research has been funded by the BioBITs Project (Converging Technologies
2007, Biotechnology-ICT, Regione Piemonte). The authors acknowledge the HPC
Advisory Council (www.hpcadvisorycouncil.com) University Award spring 2011.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 3–12, 2012.

c Springer-Verlag Berlin Heidelberg 2012
4 M. Aldinucci et al.

is involved) and data storage (e.g., when the amounts of each species for each
time sample of a simulation have to be tracked).
A single stochastic simulation represents just one possible way in which the
system might react over the entire simulation time-span. Many simulations are
usually needed to get a representative picture of how the system behaves on the
whole. Multiple simulations exhibit a natural independence that would allow
them to be treated in a rather straightforward parallel way. On a multicore plat-
form, they might exhibit serious performance degradation due to the concurrent
usage of underlying memory and I/O resources.
In [2] we presented a highly parallelized simulator for the Calculus of Wrapped
Compartments (CWC) [5] which exploits, in an efficient way, the multi-core
architecture using the FastFlow programming framework [8]. The framework
relies on selective memory [1], i.e. data structure designed to perform the online
alignment and reduction of multiple computations. A stack of layers progressively
abstract the shared memory parallelism at the level of cores up to the definition
of useful programming constructs supporting structured parallel programming
on cache-coherent shared memory multi- and many-core architectures.
Even in distributed computing the data processing of hundreds (or even thou-
sands) simulations is often demoted to a secondary aspect in the computation
and treated as off-line post-processing tools. The storage and processing of sim-
ulation data, however, may require a huge amount of storage space (linear in
the number of simulations and the observation size of the time courses) and an
expensive post-processing phase, since data should be retrieved from permanent
storage and processed.
In this paper, we adapt the approach presented in [2] to support concurrent
real-time data analysis and mining. Namely, we enrich the parallel version of
the CWC simulator with on-line (parallel) statistics tools for the analysis of re-
sults on cache-coherent, shared memory multicore. To this aim, we exploit the
FastFlow framework, which makes it possible not only to run multiple parallel
stochastic simulations but also combine their results on the fly according to user-
defined analysis functions, e.g. statistical filtering or clustering. In this respect,
it is worth noticing that while running independent simulations is an embar-
rassingly parallel problem, running them aligned at the simulation time and
combining their trajectories with on-line procedures definitely is not as merging
high-frequency data streams. This, in turn, requires to enforce that simulations
proceed aligned according to the simulation time in order to avoid the explosion
of the working set of the statistical and mining reduction functions.

2 The CWC Formalism and Its Parallel Simulator

The Calculus of labelled Wrapped Compartments (CWC) [5,2] has been designed
to describe biological entities (like cells and bacteria) by means of a nested
structure of ambients delimited by membranes.
The terms of the calculus are built on a set of atoms (representing species
e.g. molecules, proteins or DNA strands) , ranged over by a, b, . . ., and on a set
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 5

of labels (representing compartment types e.g. cells or tissues), ranged over by


,. . .. A term is a multiset t of simple terms where a simple term is either an
atom a or a compartment (a  t ) consisting of a wrap (a multiset of atoms a),
a content (a term t ) and a type (a label ).
Multisets are denoted by listing the elements separated by a space. As usual,
the notation n ∗ t to denotes n occurrences of the simple term t. For instance,
the term 2 ∗ a (b c  d e) represents a multiset containing two occurrences of
the atom a and an -type compartment (b c  d e) which, in turn, consists of a
wrap with two atoms b and c on its surface, and containing the atoms d and e1 .
Interaction between biological entities are described by rewriting rules written
as  : P → O where P and O are terms built on an extended set of atomic
elements which includes variables (ranged over by X, Y ,...) and  represents the
compartment type to which the rule can be applied. An example of rewrite rule
is  : a b X → c X that is often written as  : a b → c giving X for understood
to simplify notations.2 The application of a rule  : P → O to a term t consists in
finding (if it exists) a subterm u in a compartment of type  such that u = σ(P )
for a ground substitution σ and replacing it with σ(O) in t. We write t → t to
mean that t cam be obtained from t by applying a rewrite rule.
The standard way to model the time evolution of biological systems is that
presented by Gillespie [9]. In Gillespie’s algorithm a rate function is associated
with each considered chemical reaction which is used as the parameter of an
exponential distribution modelling the probability that the reaction takes place.
In the standard approach this reaction rate is obtained by multiplying the kinetic
constant of the reaction by the number of possible combinations of reactants that
may occur in the region in which the reaction takes place, thus modelling the
k
law of mass action. In this case a stochastic rule is written as  : P −→ O where
k represent the kinetic constat of the corresponding reaction.
The CWC simulator [6] is an open source tool under development at the Com-
puter Science Department of Turin University, implements Gillespie’s algorithm
on CWC terms. It handles CWC models with different rating semantics (law of
mass action, Michaelis-Menten kinetics, Hill equation) and it can run indepen-
dent stochastic simulations, featuring deep parallel optimizations for multi-core
platforms on the top of FastFlow [8].

3 On Line Statistical Tools


Most biological data from dynamical kinetics of species might require further
processing with statistical or mining tools to be really functional to biologists.
In particular, the bulk of trajectories coming from Monte Carlo simulators can
exhibit a natural unevenness due to the stochastic nature of the tool and are typ-
ically represented with many and large data series. This unevenness, in the form
1
For uniformity we assume that the term representing the whole system is always a
single compartment labelled  with an empty wrap.
2
We force exactly one variable to occur in each compartment content and wrap. This
prevents ambiguities in the instantiations needed to match a given compartment.
6 M. Aldinucci et al.

simulation dataset
sim-obja@ti sim-objb@ti Sk Sk-1

Pipeline+Farm
ne+Farm
instances window windows
buffering [Sk-2,Sk-3, ...]
offload [Sk-3,Sk-4, ...]
stream dispatch

Farm
mean dispatch mean
ack Sim Sim
Eng Eng variance variance
selective Stat Stat
memory k-means Eng Eng k-means
schedule gather
next bulk

sim-objc@ti+1 Sk+2=[sima ... simn]@tk+2


Sk+1=[sima ... simn]@tk+1
mean[Sw], variance[S
ce[Sw]], k-means[S
k-m w,Sw+1...]
simulation-time-aligned data
Parallel simulation Parallel on-line filtering

Fig. 1. CWC simulator with on-line parallel filtering: architecture

of deviant trajectories, high variance of results and multi-stable behaviours, often


represents the real nature of the phenomena that is not captured by traditional
approaches, such as ODEs.
Several techniques for analysing such data, e.g. principal components analysis,
linear modelling, canonical correlation analysis have been proposed. We envision
next generation software tools for natural sciences as able to perform this kind of
processing in pipeline with the running data source, as a partially or totally on-
line process because: 1) it will be needed to manage an ever increasing amount
of experimental data, either coming from measurement or simulation, and 2) it
will substantially improve the overall experimental workflow by providing the
natural scientists with an almost real-time feedback, enabling the early tuning
or sweeping of the experimental parameters.
On-line data processing requires data filtering and mining operators to work
on streamed data and, in general, the random access to data is guaranteed only
within a limited window of the whole dataset, while already accessed data can
be only stored in synthesized form. When data filtering techniques, requiring to
access the whole data set in random order, cannot be used, on-line data filtering
and mining requires novel algorithms. The extensive study of these algorithms is
an emerging topic in data discovery community and is beyond the scope of this
work, which focuses on the design of a parallel infrastructure with the following
general objectives: 1) efficient support for data streams and its parallel processing
on multi-core platforms, and 2) easy engineering of battery of filters, that can
be plugged in the tool without any concern for parallelism exploitation, data
hazards and synchronisations.
These issues will be demonstrated by extending the existing CWC parallel
simulator with a sample set of parallel on-line statistical measures computation
including mean, variance, quantiles and clustering of trajectories (according to
different methodologies such as K-means and Quality Threshold). The flexibility
given by the possibility of running many different filters is of particular interest
for the present work, as in many cases the searched pattern in experimental
results is unknown and might require different kind of analysis tools.
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 7

The CWC parallel simulator, which is extensively discussed in [2] and sketched
in Fig. 1 (left box), employs the selective memory concept, i.e. a data structure
supporting the on-line reduction of time-aligned trajectory data by way of one
or more user-defined associative functions (e.g. statistic and mining operators).
Selective memory distinguishes from standard parallel reduce operation because
it works on (possibly unbound) streams, and aligns simulation points (i.e. stream
items) according to simulation time before reducing them: since each simulation
proceed at a fixed time step, simulation points coming from different simulations
cannot simply be reduced as soon as they are produced [1].
In this work, we further extend the selective memory concept by making it
parallel via a FastFlow accelerator [8], which make it possible to offload selective
memory operators onto a parallel on-line statistical tools implementing the same
functions in parallel fashion. The pipeline has two stages: 1) statistic buffering,
and 2) a farm of statistic engines. The first stage creates dataset windows (i.e. a
number of arrays of simulation-time-aligned trajectory data from different sim-
ulation). The second stage farms out the execution of one or more filtering or
mining functions, which are independently executed on different (possibly over-
lapping) dataset windows. Additional filtering functions can be easily plugged in
by simply extending the list of statistics with additional (reentrant) sequential
or parallel functions (i.e. adding a function pointer to that list). Overall, the
parallel simulation (Fig. 1, left box) and parallel on-line filtering (Fig. 1, right
box), work, in turn, in a two-stage pipeline fashion.

3.1 Typical Patterns for Biological Trajectories


Monostable Systems Analytical mathematical methods for steady-state analysis
of deterministic models give insights on the dynamic equilibrium of a biological
system over time. In the case of stochastic models are usually performed statistics
on the mean and standard deviation of the system comparing the results with
the correspondent deterministic mathematical model. Another useful analysis is
the one based on quantiles calculation which approximate the distribution of
simulation trajectories data over time.

Multi-stable Systems. Multi-stable biological systems play a significant role in


some of the basic processes of life. The core behavior of these systems is based on
genetic switches. Stochastic effects in these systems can be substantial as noise
can influence the convergence to different equilibria.
Deterministic modeling of multi-stable systems is problematic. Bifurcation
analysis of ODE based models traces time-varying changes in the state of the
system in a multidimensional space where each dimension represents a particular
concentration of the biochemical factor involved.
The effect of molecular noise in stochastic simulations causes the switching
between the two stable equilibria if the noise amplitude is sufficient to drive the
trajectories occasionally out of the basin of attraction of one equilibrium to the
other. When stochastic simulations are performed a useful mining tool to capture
these multi-stable behaviors is represented by curves clustering techniques. In
8 M. Aldinucci et al.

the presence of stochasticity in the data, direct clustering methods on aligned


simulation results is not reliable. In order to keep the structure of the molecular
evolution over time, we propose to apply the clustering procedure on data stream
portions filtering numerically the data from the noise of the stochastic simulation
and calculating the relative local trends.
In this work we employed two clustering techniques: K-means [10] and Quality
Threshold (QT) [11] clustering. The clustering procedure collects the filtered
data contained into the constant sliding time window ΔW centered in the current
data point xi ≡ f (ti ) where ti ≡ t0 +iΔS (where ΔS is a constant sampling time)
for all simulation trajectories and the extrapolated forecast point xEi referred to a
future trend in time using the information of the Savitzky-Golay filter. Savitzky-
Golay filter fSG replaces the data value xi by a linear combination of itself and
some number of equally spaced nearby neighbors to the left (nL ) and to the
nR
right (nR ) of the data point xi : xSGi = fSG (xi ) = Σj=−n c x . The idea of
L j i+j
the numerical filter is to find the coefficients cj to approximate the underlying
function within the sliding time window by a polynomial of degree M . The
extrapolated forecast point xE i is calculated at a chosen time step ΔF exploiting
the derivatives coming from the filter in a Taylor series truncated at third term.
The couple (xSG E
i , xi ) represents the trend of the curve at time ti . A weighted
metric distance employed by the clustering procedures on these couples phrase
the similarity of behaviour between curves at time ti using the information of
data stream portions contained in the sliding time window ΔW . This method
is comparable with other curve clustering techniques (traditionally performed
off-line) that partition the data keeping their functional structure.

Oscillatory Systems. Many processes in living organisms are oscillatory (e.g. the
beating of the heart or, on a microscopic scale, the cell cycle). In these systems
molecular noise plays a fundamental role inducing oscillations and spikes. We
are currently working on statistical tools to synthesize the qualitative behavior
of oscillations through peak detection and frequency analysis [16].

4 Examples
We now consider two motivating examples that illustrate the effectiveness of the
presented real-time statistical and mining reduction functions.

Simple Crystallization. Consider a simplified CWC set of rules for the crystal-
lization of species “a”:

1e−7 1e−7
 : 2 ∗ a −→ b  : a c −→ d
We here show how to reconstruct the first two moments of species “c” using the
on-line statistics based upon 100 simulations running for 100 time units using
a sampling time ΔS = 1 time unit. The starting term was: T = 106 ∗ a 10 ∗ c.
Figure 2(a) shows the on-line computation of the mean and standard deviation
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 9

10
raw simulations 2.5 standard deviation
standard deviation mean

5
number of "a" molecules x 10
mean raw simulations
8
number of "c" molecules

ODE 2.0

6 1.5

4 1.0

2 5.0

0.0
0
0 20 40 60 80 100 0.0 4.0 8.0 1.2 1.6 2.0
-4
time time x 10

(a) Simple crystallization (b) Stable switch


Fig. 2. Mean and standard deviation on the simple crystallization and on the stable
switch. The figures report also the raw simulation trajectories.

for species c. Notice that in these cases of mono-stable behaviors, the mean of
the stochastic simulations overlap the solution of the corresponding deterministic
simulation using ODEs.

Switches. We here consider two sets of CWC rules abstracting the behavior of
a stable and an unstable biochemical switch [4] showing how to reconstruct the
equilibria of the species using the on-line clustering techniques on the filtered
trajectories. The stable switch with two competing agents a and c is based on a
very simple population model (with only 3 agents) that computes the majority
value quickly, provided the initial majority is sufficiently large. The essential idea
of the model is that when two agents a and c with different preferences meet,
one drops its preference and enters a special “blank” state b; b then adopts the
preference of any non-blank agent it meets. The rules modeling this case are:
10 10 10 10
 : a c −→ c b  : c a −→ a b  : b a −→ a a  : b c −→ c c
The unstable switch is based on a direct competition where a species a catalyzes
the transformation of another species b into a and, in turn, b catalyzes the
transformation of a into b. In this example any perturbation of a stable state
can initiate a random walk to the other stable state. The set of CWC rules
modeling this case are:
10 10
 : a c −→ a a  : c a −→ c c
In these cases, simple mean and standard deviation are not significant to summa-
rize the overall behavior. For instance in Fig. 2(b) the mean is not representative
of any simulation trajectory.
Figures 3 a) and b) show the resulting clusters (black circles) computed on-
line using K-means on the stable switch and QT on the unstable switch for
species a over 60 stochastic simulations. The stable switch was run for 2 · 10−4
time units with ΔS = 4 · 10−6 . The number of clusters for K-means was set to
2. The starting term was: T = 105 ∗ a 105 ∗ c. The unstable switch was run for
0.1 time units with ΔS = 2 · 10−3 . The threshold of clustering diameter for QT
10 M. Aldinucci et al.

2.0 200
number of "a" molecules x 105

number of "a" molecules


1.5 150

1.0 100

5.0 50

0.0 0
0.0 4.0 8.0 1.2 1.6 2.0 0 0.02 0.04 0.06 0.08 0.1
-4
time x 10 time

(a) K-means clustering on the stable switch (b) QT clustering on the unstable switch
Fig. 3. On-line clustering results (black circles) on the stable and unstable switches
using K-means and QT, respectively. The figures report also the raw simulations.

was set to 100. The starting term was: T = 100 ∗ a 100 ∗ c. Circles diameters are
proportional to each cluster size.
K-means is suitable for stable systems where the number of clusters and their
tendencies are known in advance, in the other cases QT, although more compu-
tationally expensive, can build accurate partitions of trajectories giving evidence
of instabilities with a dynamic number of clusters.
Figure 4 shows the speedup of the simulation engines equipped with mean,
standard deviation, quantiles, K-means, and QT filters on a 8 cores Intel plat-
form against number of Simulation Engines with one and two Statistic Engines,
respectively, on varying number of simulations and sampling rates. The first ex-
periments show the ability of selective memory of reducing the I/O traffic as the
speedup remain stable with increased number of simulations, thus output size.
In the second experiment, the speedup decreases while the number of samples
increases highlighting that the bottleneck of the system is in the data analysis
stage of the pipeline: any further increase of Simulation Engines does not bring
performance benefits.

5 Related Work
The parallelisation of stochastic simulators has been extensively studied in the
last two decades. Many of these efforts focus on distributed architectures. Our
work differs from these efforts in three aspects: 1) it addresses multicore-specific
parallelisation issues; 2) it advocates a general parallelisation schema rather than
a specific simulator, 3) it addresses the on-line data analysis, thus it is designed
to manage large streams of data. To the best of our knowledge, many related
works cover some of these aspects, but few of them address all three aspects.
The Swarm algorithm [14], which is well suited for biochemical pathway opti-
misation has been used in a distributed environment, e.g., in Grid Cellware [7], a
grid-based modelling and simulation tool for the analysis of biological pathways
that offers an integrated environment for several mathematical representations
ranging from stochastic to deterministic algorithms.
On Parallelizing On-Line Statistics for Stochastic Biological Simulations 11

12 12
Ideal Ideal
100 simulations 200 samples
10 200 simulations 10 1000 samples

speedup (seq. statistics)


300 simulations 5000 samples
8 8
speedup

6 6

4 4

2 2

0 0
2 4 6 8 10 12 2 4 6 8 10 12
number of Sim. Eng. (with 1 Stat Eng) number of Sim. Eng. (with 2 Stat Eng)

Fig. 4. Speedup on the stable switch simulation with 1 Statistic Engine for different
number of parallel simulations and 200 samples (left), and with 2 Statistic Engines for
different sampling rates and 200 simulations (right). The grey region delimits available
platform parallelism (Intel x86 64 with 8 cores).

DiVinE is a general distributed verification environment meant to support


the development of distributed enumerative model checking algorithms including
probabilistic analysis features used for biological systems analysis [3].
StochKit [13] is a C++ stochastic simulation framework. Among other meth-
ods, it implements the Gillespie algorithm and in its second version it targets
multi-core platforms, it is therefore similar to our work. It does not implement
on-line trajectory reduction that is performed in a post-processing phase. A first
form of on-line reduction of simulation trajectories has been experimented within
StochKit-FF [1], which is an extension of StochKit using the FastFlow runtime.
StochSimGPU [12] exploits GPU for parallel stochastic simulations of biologi-
cal systems. The tool allows to compute averages and histograms of the molecular
populations across the sampled realizations on the GPU. The tool leverages on a
GPU-accelerated version of the Matlab framework that can be hardly compared
in flexibility and performance with a C++ implementation.

6 Conclusions

Starting from the Calculus of Wrapped Compartments and its parallel simulator
we have discussed the problem of the analysis of stochastic simulation results,
which can be complex to interpret also due to intrinsic stochastic “noise” and
the overlapping of the many required experiments by the Monte Carlo method.
At this aim, we characterised some patterns of behaviour for biological sys-
tem dynamics, e.g. monostable, multi-stable, and oscillatory systems, and we
exemplified them with minimal yet paradigmatic examples from the literature.
For these, we identified data filters able to provide statistically significative in-
formation to the biological scientists in order to simplify the data analysis.
Both the simulations and the on-line statistic filters, which are both parallel
and pipelined, can be easily extended with new simulation algorithms and filters
12 M. Aldinucci et al.

thanks to FastFlow-based parallel infrastructure that exempt the programmer


from synchronization and orchestration of concurrent activities.
Preliminary experiments demonstrated a fair speedup on a standard multi-
core platform. We plan to further investigate the performance tuning of the
simulation pipeline on larger problems and platforms.

Acknowledgements. We wish to thank Luca Cardelli for the inspiring talk


and the discussion on multi-stable biological systems and switches, and An-
drea Bracciali for the discussion on data filtering for biological simulations. We
also thank M. Mazumder and E. Macchia of Etica Srl for the simulator GUI
implementation.

References
1. Aldinucci, M., Bracciali, A., Liò, P., Sorathiya, A., Torquati, M.: StochKit-FF: Ef-
ficient Systems Biology on Multicore Architectures. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannataro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A., Di
Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586, pp.
167–175. Springer, Heidelberg (2011)
2. Aldinucci, M., Coppo, M., Damiani, F., Drocco, M., Torquati, M., Troina, A.:
On designing multicore-aware simulators for biological systems. In: Proc. of Intl.
Euromicro PDP 2011: Parallel Distributed and Network-Based Processing, pp.
318–325. IEEE, Ayia Napa (2011)
3. Barnat, J., Brim, L., Safránek, D.: High-performance analysis of biological systems
dynamics with the divine model checker. Briefings in Bioinformatics 11(3), 301–312
(2010)
4. Cardelli, L.: On switches and oscillators (2011), http://lucacardelli.name
5. Coppo, M., Damiani, F., Drocco, M., Grassi, E., Troina, A.: Stochastic Calculus
of Wrapped Compatnents. In: QAPL 2010, vol. 28, pp. 82–98. EPTCS (2010)
6. CWC Simulator website (2010), http://cwcsimulator.sourceforge.net/
7. Dhar, P.K., et al.: Grid cellware: the first grid-enabled tool for modelling and
simulating cellular processes. Bioinformatics 7, 1284–1287 (2005)
8. FastFlow website (2009), http://mc-fastflow.sourceforge.net/
9. Gillespie, D.: Exact stochastic simulation of coupled chemical reactions. J. Phys.
Chem. 81, 2340–2361 (1977)
10. Hartigan, J., Wong, M.: A k-means clustering algorithm. Journal of the Royal
Statistical Society C 28(1), 100–108 (1979)
11. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and
analysis of coexpressed genes. Genome Research 9(11), 1106 (1999)
12. Klingbeil, G., Erban, R., Giles, M., Maini, P.: Stochsimgpu: parallel stochastic
simulation for the systems biology toolbox 2 for matlab. Bioinformatics 27(8),
1170 (2011)
13. Petzold, L.: StochKit: stochastic simulation kit web page (2009),
http://www.engineering.ucsb.edu/~ cse/StochKit/index.html
14. Ray, T., Saini, P.: Engineering design optimization using a swarm with an intelli-
gent information sharing among individuals. Eng. Opt. 33, 735–748 (2001)
15. Regev, A., Shapiro, E.: Cells as computation. Nature 419, 343 (2002)
16. Sciacca, E., Spinella, S., Genre, A., Calcagno, C.: Analysis of calcium spiking in
plant root epidermis through cwc modeling. Electronic Notes in Theoretical Com-
puter Science 277, 65–76 (2011)
Scalable Sequence Similarity Search and Join
in Main Memory on Multi-cores

Astrid Rheinländer and Ulf Leser

Humboldt-Universität zu Berlin, Department of Computer Science


Berlin, Germany

Abstract. Similarity-based queries play an important role in many large


scale applications. In bioinformatics, DNA sequencing produces huge col-
lections of strings, that need to be compared and merged. We present
PeARL, a data structure and algorithms for similarity-based queries on
many-core servers. PeARL indexes large string collections in compressed
tries which are entirely held in main memory. Parallelization of searches
and joins is performed using MapReduce as the underlying execution
paradigm. We show that our data structure is capable of performing
many real-world applications in sequence comparisons in main memory.
Our evaluation reveals that PeARL reaches a significant performance
gain compared to single-threaded solutions. However, the evaluation also
shows that scalability should be further improved, e.g., by reducing se-
quential parts of the algorithms.

1 Introduction

Similarity-based searches and joins are important for many applications such as
document clustering or plagiarism detection [7,16]. In bioinformatics, similarity-
based queries are used for sequence read alignment or for finding homologous
sequences between different species. In recent years, much effort has been spent
on developing tools to speed up similarity-based queries on sequences. Many
prominent tools use sophisticated index structures and filter techniques that
enable significant runtime improvements [2,8,9].
A challenge arises from the immense growth of sequence databases in the
past few years. For example, the number of sequences stored in EMBL grows
exponentially every year and sums up to more than 300 billion nucleotides as
of May 2011. One strategy to deal with this huge amount of data is to divide
it into smaller parts and perform analyses partition-wise in parallel. For this
scenario, Google developed the programming paradigm MapReduce to enable a
massively-parallel processing of huge data sets in large distributed systems of
commodity hardware [4]. However, the main bottleneck of distributed MapRe-
duce is network bandwidth and disk I/O. Therefore, another option is to design
data structures and algorithms that adapt the MapReduce paradigm for many-
core servers [11]. We argue that modern many-core servers, combined with the
constantly falling prices for main memory, are perfectly suited to perform many

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 13–22, 2012.

c Springer-Verlag Berlin Heidelberg 2012
14 A. Rheinländer and U. Leser

real-world applications in sequence analysis. Such settings are much easier to


maintain and do not suffer from bandwidth problems.
In this paper, we challenge the current opinion that problems in sequence anal-
ysis already have grown so big that distributed systems are the only solution.
We present PEARL, a main-memory data structure and parallel algorithms for
similarity-based search and join operations on sequence data. In particular, our
data structure uses compressed tries. In tries, the complexity for exact searches
only depends on string lengths and not on the number of stored strings [14].
This allows an efficient execution of exact searches even in large tries. In order
to retain these advantages for similarity-based queries, we store additional infor-
mation at each node that enable early pruning of whole subtries. Previously, we
demonstrated that these strategies effectively speed up similarity-based queries
in PETER [12], a disk-based index structure and predecessor of PeARL.
A crucial aspect in designing data structures for similarity based queries that
interact with MapReduce is to support proper data partitioning. Specifically, we
show how tries on top of large string collections can be compressed and par-
titioned for enabling in-memory MapReduce based search and join operations.
To our knowledge, this is the first work that parallelizes similarity-based string
searches and joins in tries. Our evaluation reveals that PeARL’s similarity-based
algorithms scale well.
The rest of this paper is organized as follows: Section 2 introduces basic con-
cepts needed for the design of our data structure and algorithms. We describe
design principles of PeARL and algorithms for similarity search and join, as well
as our parallelization strategy in Sect. 3. We evaluate our tool in Sect. 4 and
discuss related work in Sect. 5. Finally, we conclude our paper with an outlook
to future work.

2 Preliminaries

Let Σ ∗ be the set of all strings of any finite length over an alphabet Σ. The
length of a string s ∈ Σ ∗ is denoted by |s|. A substring s[i . . . j] of s starts at
position i and ends at position j, (1 ≤ i ≤ j ≤ |s|). Any substring of length q ∈ N
is called q-gram. Conceptually, we will ground our algorithms on operators for
similarity search and similarity join, which are defined as follows:
Let s be a string, R a bag of strings, d a distance function, and k a threshold. The
similarity-based search operator is defined as simsearch (s, R, k) = {r|d(r, s) ≤
k, r ∈ R}. Similarly, for two bags of strings R, S, the similarity-based join operator
is defined as simjoin (R, S, k) = {(r, s)|d(r, s) ≤ k, r ∈ R, s ∈ S}.
In PeARL, we support Hamming and edit distance as similarity measure. We
focus on edit distance based operations in this paper, but see [12] for the key
ideas on queries using Hamming distance. In general, the edit distance of r and
s is computed in O(|r| ∗ |s|) using dynamic programming. As we are mostly
interested in finding highly similar strings within a previously defined distance
threshold k, we use the k-banded alignment algorithm [5] with time complexity
O(k ∗ max{|r|, |s|}) instead.
Scalable Sequence Similarity Search and Join in Main Memory 15

Our parallelization strategy is inspired by the well-known programming model


MapReduce, a two-step approach that consists of a map and a reduce phase [4].
Essentially, data is stored in <key, value>-pairs and partitioned into several
subsets. In the map step, a user-defined function is applied to each input item
<ki ,vi > and an intermediate list of <kj ,vj > pairs is emitted. All intermediate
items generated by map are grouped on the basis of the keys and finally, the
user-defined reduce function is applied to each group in order to assemble the
final result set.

3 Data Structure and Algorithms


In this section, we introduce our data structure PeARL together with algorithms
for executing similarity string searches and joins in parallel.
Conceptually, a PeARL index (see Fig. 1) is based on radix trees [10] and
defined as follows: Let R be a bag of strings. A PeARL index PR for R consists
of a set of rooted, compressed tries TR , a sequence string seq, and a <key,value>
data structure StringIDM ap and meets the following conditions:
– (Identification of strings) The string seq is a concatenation of all r ∈ R. We
assign a unique ID to each r, assembled from a serial number, the length of r,
and the start position of r in seq.
– (Node types) We distinguish between infix nodes and string nodes. An infix
node is a node that represents some substring rl of r, |rl | ≥ 1. Each node
u represents a sequence of characters of length l ≥ 1. The labels of any two
children v, w of u start with different characters. Every r maps to exactly one
node x ∈ TR such that the concatenation of all labels from TR ’s root to x exactly
is r. Such nodes x are called string node. We store a pair that consists of the
node ID of x and the UID of r in the StringIDMap. If R contains multiple copies
of r, all corresponding UIDs are assigned to x.
– (Storing infixes) Node labels are not stored directly in node u, but retrieved
via lookups in seq. Thus, u stores length and start position of the represented
infix in seq.
– (Additional information) Each node u stores additional attributes, namely the
minimum (min) and maximum (max) lengths of strings stored in the subtrie
starting at u, a character frequency vector f v and a bit-string qGr. The character
frequency vector f v(u) consists of |Σ| components and counts the number of
occurrences of ci ∈ Σ in the prefix represented by u in component i. Similarly,
a bit in qGr at position i represents the ith string of all strings over Σ of length
q in lexicographical order. Bit i is set to 1, if the prefix represented by node u
contains the corresponding q-gram.
– (Trie partitioning) For very large string collections, we expect the upper levels
of a trie to be completely filled. Therefore, we partition a single PeARL trie into
multiple tries on the basis of shared prefixes. Each partition is identified by the
prefix which was used for partitioning (see Fig. 1). The prefix length used for
partitioning is user-defined.
16 A. Rheinländer and U. Leser

Figure 1 displays a PeARL index for strings over Σ = {A, C, G, T }. Grey nodes
are string nodes, white nodes are infix nodes. Edge labels are not stored in
the index itself, but are displayed for better comprehensibility only. Displayed
q−gram sets indicate which bits in qGr are set.

Fig. 1. PeARL index structure

3.1 Algorithms
Building the PeARL index for a set of strings R works as follows: In a first step,
R is sorted lexicographically, UIDs are assembled, and R is split into multiple
partitions based on shared prefixes. For each partition Ri ⊆ R, we start with an
empty trie TRi and iteratively insert each string contained in Ri using preorder
DFS traversal. After all strings from Ri have been inserted, we iterate once over
the whole trie and update the information min/max, f v and qGr.
Similar to indexing, our algorithms for similarity-based searches and joins are
also grounded on preorder DFS traversal of all trie partitions. Each algorithm
is equipped with filtering strategies. These filters, namely prefix and edit dis-
tance pruning [14], character frequency pruning [1], and q−gram filtering [6],
have been introduced in slightly different contexts before. Their concrete usage
and efficiency for trie-based search and join queries is shown in [12]. Therefore,
we only briefly summarize our search and join strategies in the following and
concentrate on our novel parallelization scheme later.
Similarity search starts with a given search string q and traverses each trie
partition in a PeARL index starting at root. Whenever a new child of the current
node is reached, we first check whether we can prune this node (see [12] for details
on filtering). If all filters have been passed successfully, we compute the edit
distance between the query and the prefix of the node. If the distance exceeds a
threshold k, we start a backtracking routine and traverse the remaining, not yet
examined paths in the trie. Otherwise we descend forward to the leaves. When
a string node x is reached and d(q, x) ≤ k holds, we report a match.
Scalable Sequence Similarity Search and Join in Main Memory 17

Similarity join for two sets R, S takes two PeARL indices PR , PS as input.
Each trie partition TRi is joined with each partition TSj . Recall that both tries
are partitioned by prefixes. We first check the partition prefixes on edit distance
and it might happen that k is already exceeded. In this case, we skip the cor-
responding trie pair. Otherwise, we compute the similarity-based intersection of
both partitions. As for search, we start at the root nodes and traverse both tries
concurrently. When unseen nodes are reached, we check all filters and prune, if
possible. Whenever two string nodes x ∈ TRi , y ∈ TSj are reached, and given
that d(x, y) ≤ k holds, we report a match.

Fig. 2. MapReduce workflow of similarity joins in PeARL

3.2 Parallelization with In-memory MapReduce


We use MapReduce to parallelize PeARL for an execution on multi-core servers.
However, a usage in distributed scenarios is conceptually also possible as PeARL
trie partitions could as well be spread over nodes in a distributed file system.
Recall that a user-defined function is applied to each input item <ki ,vi > in
the map phase. Depending on the specific task, we use the map phase to either
execute the similarity join of any two PeARL partitions or, to search a certain
string in each partition of a PeARL index. Reduce phases are typically used
to compute aggregates of intermediate results. Figure 2 shows the workflow for
parallelizing similarity joins in PeARL with MapReduce. A master routine takes
two PeARL indices PR , PS as input, together with an error threshold k, and
a number of available threads t. As string collections stored with PeARL are
already partitioned into multiple tries, we get a natural data partitioning for the
map phase. The master generates a set of map tasks (stored in a FIFO data
structure mapTaskList), such that each trie partition TRi ∈ PR is joined with
each trie partition TSj ∈ PS and starts the map phase.
18 A. Rheinländer and U. Leser

Each map thread has access to mapTaskList and extracts one task (TRi , TSj )
out of this list. After some initialization steps, map calls the join routine, that
executes the similarity join of TRi k TSj and returns the set of all similar
string pairs contained in (TRi , TSj ) within the given distance k. These items are
inserted into an intermediate data structure. For each similar string pair (r, s),
an intermediate key is set to the UID of r. When one map iteration has finished
and as long as mapTaskList is not empty, the map thread extracts the next
(TRi , TSj ) pair out of this list and again computes the similarity join.
When all map tasks have been processed, the master partitions all inter-
mediate data on the basis of intermediate keys and passes each partition to a
separate reduce thread. This ensures that all similar string pairs which involve
r are assigned to the same intermediate partition. Finally, reduce sorts all (r, s)
pairs based by edit distance. Optionally, reduce can also emit the number of
similar strings found in S for each r, or filter the results found for r on best
score.
Parallelizing similarity searches is analog to the parallelization of similarity
joins. The main difference is that PS is replaced with one or a list of search se-
quences. If not existent, each search pattern is assigned a unique ID. For searches,
the mapTaskList contains <ki ,vi > pairs where ki is a partition prefix of and vi
consists of TRi and the search sequence(s). As for join, similarity search is per-
formed in the map phase.

4 Evaluation

We evaluated the performance of PeARL on a NUMA server with 24 cores and


256 GB main memory available. All experiments were executed with numactl
-localalloc to control the memory accession strategy and thread placement.
Test data sets (see Table 1) are extracted from dbEST1 as of March 7th, 2011 for
the organism mouse. Indexing is linear in the number of indexed strings [12] and
is not included in the reported measurements. In terms of memory consumption,
PeARL needs roughly 20 GB of main memory to index all infixes of length 2,000
bp in the C. elegans genome (roughly 100M strings). For computing the similarity
join IIIk IV, PeARL needs approx. 8 GB of main memory.

Table 1. Data sets extracted from dbEST

Set # strings  length (min / max) # characters


I 10,000 511.99 (49 / 1,190) 5,120,495
II 240,000 455.94 (18 / 2,160) 109,425,487
III 300,000 446.74 (18 / 2,160) 134,023,819
IV 1,000,000 512.12 (7 / 3,920) 512,123,043

1
www.ncbi.nlm.nih.gov/dbEST/
Scalable Sequence Similarity Search and Join in Main Memory 19

4.1 Performance of Similarity-Based Operators

First, we compared the performance of all similarity-based operations in PeARL


with its predecessor PETER in single-threaded mode. The main difference of
both tools is that in PeARL, all parts of the index are kept in main memory
whereas in PETER, disk I/O was necessary during search and join. Another
difference is that q-gram sets in PeARL are stored persistently in the index
whereas previously, q-grams were computed on the fly. Trie partitioning and
parallelization was also not present in the predecessor. Overall, we observed that
these improvements increased the efficiency of our filters. Whereas in PETER,
filtering lead to runtime improvements of up to 80% compared to the baseline
with no filters enabled, we now achieve runtime improvements of up to 99%
caused by filtering (data not shown).
We evaluated the runtime of similarity search and measured 10,000 individual
searches of non-indexed patterns from set I in the PeARL index for set IV, see
Fig. 3 (left). In single-threaded mode, searches in PETER ran significantly faster
than in PeARL (up to factor 10 on k = 2). This is not surprising, as there is
some overhead introduced in PeARL by the added functionality for MapReduce
based parallelization, which is also present in single-threaded searches. However,
we will see in the following section that this overhead pays out for multi-threaded
similarity searches and joins. We also compared PeARL to Flamingo, a library
for string searching developed at UC Irvine2 . As displayed in Fig. 3(left), PeARL
outperforms Flamingo for search in single threaded mode for small thresholds
(factor 20 for k = 1 and factor 3 for k = 2). For larger k, Flamingo begins to
outperform PeARL.
For evaluating the runtime of similarity joins in PETER and PeARL, we
computed the join between set IV and varying subsets taken from set II. As
shown in Fig. 3 (right), similarity joins in PeARL are computed considerably
faster than in PETER. For example, we reached an improvement of factor 3
on k = 2 at a join cardinality of 2e+11. Generally speaking, the implemented
improvements in PeARL are the more profitable when indexed string sets grow
large. We could not compare PeARL to Flamingo for joins, since no reference
implementation was available.

4.2 Scalability of PeARL

We compared the multi-threaded execution of 10,000 individual searches of pat-


terns from set I in set IV with PeARL (24 threads) to a single-threaded execution
with PeARL and Flamingo. As displayed in Fig. 4 (left), the multi-threaded ex-
ecution in PeARL outperforms the single-threaded execution in Flamingo with
factors in the range of 6 (k = 3) to 57 (k = 1). We also observed that the
24-threaded outperforms the single-threaded execution in PeARL with factors
in the range of 5.5(k = 1) to 6.2 (k = 3). For similarity joins, we could only
compare the 24-threaded to the single-threaded execution in PeARL since no
2
http://flamingo.ics.uci.edu/
20 A. Rheinländer and U. Leser

external reference implementation was available. Thus, we measured the exe-


cution times of IIIk∈{1,2,3} IV. As displayed in Fig. 4 (right), we measured a
runtime improvement of factors in the range of 4.2 (k = 2) to 4.9 (k = 1).
When analyzing the parallelized search and join algorithms in terms of speed-
up, the first step is to estimate the fractions of parallelizable and non-parallelizable
parts in our algorithms. In general, the parallelizable fractions dominate, since
only reading the indices into main memory, extracting tasks from mapJoinList,
sorting intermediate partitions before executing reduce, and writing the final out-
put to file is performed in serial. We estimated the size of the parallelizable fraction
based on the measured speed-up using N = 24 CPU cores. According to this, 10 %
of our search and 20 % of our join algorithm remain serial.
Figure 5 (left) displays the speed-up of searches of all ESTs from set I in the
indexed set IV with regard to the number of CPU cores. We observed that the
speed-up for measured runtimes almost perfectly fits the theoretical curve of
Amdahl’s law for P = 0.90. Similarly, we observed for joins that the measured
speed-up fits well to Amdahl’s law for P = 0.80 (see Fig. 5 (right)). This indicates
that estimating the non-parallelizable fraction with 10 % for searches and 20 %
for joins is sound. Using 24 CPU cores with 24 map and reduce workers, we
achieve a speed-up of our join algorithm of 4.3. According to that, the maximal
speed-up for join is 4.9 using ≥ 1, 000 cores. This indicates that executing the
current implementation of PeARL is limited by the serial parts contained in our
algorithms.

Fig. 3. Performance of single-threaded similarity operations. Left: search of 10,000


patterns from set I in set IV. Right: join IVk=2 II on subsets of II.

5 Related Work

Morrison [10] introduced prefix trees as an index structure for storing strings and
exact string matching. Shang et al. [14] extended prefix trees with dynamic pro-
gramming techniques to perform inexact matching. Prefix pruning was studied
in [14] and is based on the observation that edit distance can only grow with pre-
fix length. Aghili et al. [1] proposed character frequency distance based filtering
to reduce candidate sets for similarity-based string searches. Indexing methods
Scalable Sequence Similarity Search and Join in Main Memory 21

Fig. 4. Performance of parallelization. Left: search. Right: join.

Fig. 5. PeARL speed-up. Left: sim. search on k = 2. Right: sim. join on k = 3.

based on q-grams restrict search spaces efficiently for edit distance based opera-
tions. They take advantage of the observation that two strings are within a small
edit distance iff they share a large number of q-grams [15].
The MapReduce programming model for parallel data analysis was initially
proposed by Dean and Ghemawat [4]. Vernica et al. [17] present an algorithm
set-similarity string joins with distributed MapReduce. We could not compare
to their solution, since no in-memory version was available. Ranger et al. [11] de-
veloped a MapReduce based programming framework for shared-memory multi-
core servers with a scalability almost reaching hand-coded solutions.
A main application for similarity-based string searches and joins in bioinfor-
matics is read alignment. Almost all tools follow the seed-and-extend approach.
BLAST [2] seeds the alignment with hash-table indices and extend the initially
ungapped seeds with a banded local alignment algorithm. However, algorithms
that use only ungapped seeds might miss some valuable alignments. BWA-SW [8]
is one tool that allows gap and mismatches in the seeds. We also applied PeARL
for read alignment and compared the execution times to BWA-SW. BWA-SW
significantly outperforms PeARL (data not shown), but it must be noted that it
is a heuristic that misses solutions, while PEARL solves the alignment problem
exactly. CloudBurst [13] is another another tool for read alignment using MapRe-
duce on top of Hadoop [3]. A comparison between PEARL and CloudBurst is
pending.
22 A. Rheinländer and U. Leser

6 Conclusions and Future Work


In this paper, we presented PeARL, a data structure and parallel algorithms
for similarity-based search and join operations in compressed tries. PeARL is
parallelized in main memory with MapReduce on a multi-core server. Our eval-
uation revealed that the speed-up of our search and join algorithms executed on
multi-core servers cannot grow infinitely large due to the serial parts contained
in our workflow. We are currently working on reducing these bottlenecks and on
performing a detailed comparison between PeARL and CloudBurst.

References
1. Aghili, S.A., Agrawal, D., El Abbadi, A.: BFT: Bit Filtration Technique for Ap-
proximate String Join in Biological Databases. In: Nascimento, M.A., de Moura,
E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer,
Heidelberg (2003)
2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local align-
ment search tool. J. Molecular Biology 215(3), 403–410 (1990)
3. Bialecki, A., Cafarella, M., Cutting, D., O’Malley, O.: Hadoop,
http://hadoop.apache.org/
4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters.
Communications of the ACM 51(1), 107 (2008)
5. Fickett, J.W.: Fast optimal alignment. Nucl. Acids Res. 12(1Part1), 175–179 (1984)
6. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Sri-
vastava, D.: Approximate string joins in a database (Almost) for free. In: Proc.
VLDB, pp. 491–500 (2001)
7. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarized documents.
J. American Society for Information Science and Technology 54, 203–215 (2003)
8. Li, H., Durbin, R.: Fast and accurate long-read alignment with Burrows–Wheeler
transform. Bioinformatics 26(5), 589–595 (2010)
9. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation
sequencing. Briefings in Bioinformatics 11(5), 473–483 (2010)
10. Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in
alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
11. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G.R., Kozyrakis, C.: Evalu-
ating mapreduce for multicore and multiprocessor systems. In: Proc. HPCA, pp.
13–24 (2007)
12. Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for
Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher,
B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)
13. Schatz, M.C.: Cloudburst. Bioinformatics 25, 1363–1369 (2009)
14. Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE TKDE 8(4),
540–547 (1996)
15. Sutinen, E., Tarhio, J.: Filtration with q-Samples in Approximate String Matching.
In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 50–63.
Springer, Heidelberg (1996)
16. Vakali, A., Pokorný, J., Dalamagas, T.: An Overview of Web Data Clustering
Practices. In: Lindner, W., Fischer, F., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.)
EDBT 2004. LNCS, vol. 3268, pp. 597–606. Springer, Heidelberg (2004)
17. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapre-
duce. In: Proc. SIGMOD, pp. 495–506 (2010)
Enabling Data and Compute Intensive Workflows
in Bioinformatics

Gaurang Mehta1, Ewa Deelman1, James A. Knowles2, Ting Chen3, Ying Wang3,5,
Jens Vöckler1, Steven Buyske4, and Tara Matise4
1
USC Information Sciences Institute
2
Keck School of Medicine of USC
3
University of Southern California
4
Rutgers University
5
Xiamen University, P.R. China
{gmehta,deelman}@isi.edu

Abstract. Accelerated growth in the field of bioinformatics has resulted in large


data sets being produced and analyzed. With this rapid growth has come the
need to analyze these data in a quick, easy, scalable, and reliable manner on a
variety of computing infrastructures including desktops, clusters, grids and
clouds. This paper presents the application of workflow technologies, and,
specifically, Pegasus WMS, a robust scientific workflow management system,
to a variety of bioinformatics projects from RNA sequencing, proteomics, and
data quality control in population studies using GWAS data.

Keywords: workflows, bioinformatics, sequencing, epigenetics, proteomics.

1 Introduction
Advances in the fields of molecular chemistry, molecular biology, and computational
biology have resulted in accelerated growth in bioinformatics research. In the last
decade there have been rapid developments in genome sequencing technology,
enabling large volumes of RNA and DNA to be sequenced from humans, animals,
and plants. Advances in biochemistry have also enabled protein analysis and bacterial
RNA studies to be carried out on larger scale than ever before. A sharp drop in the
cost of genome sequencing instruments is enabling a larger number of scientists to
sequence genomes from a wide variety of species.
These developments have resulted in petabytes of raw data being generated in
individual laboratories. These massive data need to be analyzed quickly and in an
easy, efficient manner. At the same time, there is an increase in the availability of
large-scale clusters at most universities as well as national grid infrastructures, and
cheap and easily accessible cloud computing resources. Thus, scientists are looking
for simple tools and techniques to manage and analyze their data to produce scientific
results along with their provenance. This paper provides the motivation for the use of
workflow technologies in bioinformatics, followed by a description of the Pegasus
Workflow Management System (WMS) [1,2,28] and its application to the data
management and analysis issues arising in a few bioinformatics projects. The paper
concludes with related work and future plans.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 23–32, 2012.
© Springer-Verlag Berlin Heidelberg 2012
24 G. Mehta et al.

2 Motivation

Generally, most laboratories and small projects that perform data-intensive


bioinformatics experiments lack the necessary expertise, tools, and manpower to create
complex computational pipelines to analyze large datasets. Running these pipelines is
often complicated, and requires researchers to gain access to computational resources,
create pipelines, and train lab staff on running and maintaining complex software.
Additionally, scaling these experiments to take advantage of the large computing
infrastructure present in the laboratories, on campus, and in commercial cloud
environments is an even bigger challenge. The generated datasets need to be moved
efficiently to remote computational resources, analyzed, mapped to genomes, and
reference files. The results need to be collected in a robust and secure manner. Finally,
scientists require that the provenance of the generated data be recorded. In order to meet
these requirements we have developed several bioinformatics application pipelines
using Pegasus WMS workflow technologies, which enable the execution of large-scale
computations on peta-scale datasets on a variety of resources.

3 Workflow Technology

Workflows are defined as a collection of computational tasks linked via data and
control dependencies. Each task in a workflow is either a single invocation of an
executable or a sub-workflow containing more tasks. Several workflow technologies
have been developed over the last decade, each tackling different problems [22].
Business workflows attempt to coordinate business processes and are generally highly
customized for a specific company. Scientific workflows, on the other hand, tend to
be shared more frequently with collaborators and run on various types of platforms.
To enable scientific workflows, there are a wide variety of software systems from
GUI-based drag and drop workflow systems [19,20,21] to web services-based
workflow enactors [19,21]. Pegasus WMS was originally developed to enable large-
scale physics experiments in the GriPhyN project [24]. As the scale of data and
analysis of bioinformatics applications have grown it has been a natural fit to apply
the experiences and technology of Pegasus to these projects as well.
The Pegasus Workflow Management System is a software system that supports
the development of large-scale scientific workflows and manages their execution
across local, grid [1,2,28], and cloud [3] resources simultaneously. Pegasus provides
API’s in Java, Python, and Perl to create workflow descriptions in the Abstract
Directed Acyclic Graph in XML (DAX) format. A DAX contains information about
all the steps or tasks in the workflow, including the arguments used to invoke the task,
the input and output datasets used and generated, as well as any relationships between
the tasks. DAXes are abstract descriptions of the workflow that are agnostic of the
resources available to run it, and the location of the input data and executables.
Pegasus compiles these abstract workflows into executable workflows by querying
information catalogs that contain information about the available resources and
sending computations across local and distributed computing infrastructures such as
Enabling Data and Compute Intensive Workflows in Bioinformatics 25

the Teragrid [29], the Open Science Grid [30], campus clusters, emerging commercial
and community cloud environments [31] in an easy and reliable manner using Condor
[5] and DAGMan [6]. Fig. 1 shows the block diagram of Pegasus WMS.

Fig. 1. Pegasus Workflow Management System

Pegasus WMS optimizes data movement by leveraging existing grid and cloud
technologies via flexible, pluggable interfaces. It provides advanced features to
manage data transfers, data reuse, and automatic cleanup of data generated on remote
resources. It also provides for optimization of the execution by allowing several small
tasks to be clustered into larger jobs, thus minimizing execution overheads. Pegasus
interfaces with several job-scheduling systems via the Condor-G [4] interface,
allowing the various tasks in the workflow to be executed on a variety of resources.
Reproducibility is a very important part of computational science. To enable
scientists to track the progress of their workflows and tackle data reproducibility
issues, Pegasus captures all the provenance of the workflow from the compilation
stage to the execution of the generated data. Pegasus also monitors and captures
statistics during the run of the workflow allowing scientists to accurately measure the
performance of their workflow.
Pegasus WMS also supports the use of hierarchal workflows allowing users to
divide large pipelines into several smaller, more manageable sub-workflows. Each
sub-workflow is planned and executed only when all the necessary dependencies for
that sub-workflow have been satisfied. As a result an application can induce different
sub-workflows to execute based on previous analysis in the upper level workflow.
Pegasus WMS is a very reliable and robust system with several options for failure
recovery. Cloud and grid environments are inherently unreliable, as are the applications
themselves. In order to manage this, Pegasus automatically resubmits tasks that fail to the
same, or another resource several times before the task completely fails. Pegasus will
also finish as many tasks and sub-workflows as possible regardless of one or more failed
tasks. When the workflow can proceed no further, a rescue workflow is created that can
be resubmitted after fixing whatever caused the failures. If re-planning of the workflow is
required (e.g. to make use of additional or new resources), Pegasus will reduce the
original workflow, eliminating tasks that have completed successfully, leaving only those
tasks that previously failed or were not submitted due to dependencies on the failed tasks.
26 G. Mehta et al.

4 Workflows in Bioinformatics
Recently, an ever-increasing number of bioinformatics applications have started
adopting workflows and workflow technologies to help them in their continuous
analysis of the large-scale data generated by scientific studies. Below we present a
variety of bioinformatics projects, including RNA sequencing, protein studies, and
quality control in population epidemiology studies, which are among the many
bioinformatics projects that use Pegasus WMS for their work.

4.1 Proteomics: MassMatrix


MassMatrix [7] is a database search software package for tandem mass spectrometric
data. It uses a mass accuracy-sensitive probabilistic scoring model to rank peptide and
protein matches. MassMatrix provides improvements in sensitivity over Mascot [26]
and SEQUEST [25] with comparably low false positives.
A major requirement in MassMatrix is the ability to handle a large degree of
parallelism in the analysis jobs, as well as the ability to run these workflows on cloud
computing environments that can scale in size. After evaluating several solutions to
simplify and automate the process of these peptide and protein matches, MassMatrix
implemented the proteomic workflows using Pegasus WMS as it offered the
flexibility of incorporating parallel and serial codes in the same workflow, as well the
ability to run these workflows on multiple computing infrastructures simultaneously.

Fig. 2. a) Pegasus workflow template. b) Implementation of workflow for five shotgun proteomic
data sets. c) Hierarchical cluster analysis of shotgun proteomic data.

The MassMatrix workflow was generated using the Pegasus Python API, which
produced the required XML workflow description, and executed on the available
distributed resources [8], which includes high-performance clusters at the Ohio State
University and Amazon EC2. Fig. 2 shows a MassMatrix workflow template, its
instantiation for 5 shotgun datasets, and the final result shown as a hierarchal cluster
analysis. Currently MassMatrix is looking at ways to optimize the allocation and
efficient usage of computational resources for executing these workflows on a larger
scale by balancing the costs and execution time requirements as well as dynamically
modifying the parallelism in the workflows [1].
Enabling Data and Compute Intensive Workflows in Bioinformatics 27

4.2 RNA Sequencing: Transcriptional Atlas of the Developing Human Brain


The Transcriptional Atlas of the Developing Human Brain (TADHB)[9] project seeks
to find when and where in the brain a gene is expressed. This information holds clues
to potential causes of disease. A recent study [23] found that forms of a gene
associated with schizophrenia are over-expressed in the fetal brain. To make
discoveries about abnormal gene expression, scientists first need to know what the
normal patterns of gene expression are during brain development. To this end, the
National Institute of Mental Health (NIMH), part of the National Institutes of Health
(NIH), has funded the creation of TADHB. To map human brain transcriptomes,
researchers identify the composition of intermediate products, called transcripts or
messenger RNAs, which translate genes into proteins throughout development.
The biggest issue in creating the brain atlas was handling and analyzing the large
amount of RNA sequence data in an easy and reliable manner without the need to
train users on advanced software concepts and without worrying about configuring
remote resources individually. The analysis was to be performed on a shared local
campus cluster while ensuring that other users of the cluster are not adversely affected
due to the large amount of I/O occurring in the application. To enable TADHB,
workflows were developed to map the genetic sequences and to map environmental,
or epigenetic, regulation of gene expression across development using the Pegasus
Java API. The lab scientists were then able to run and submit an analysis of over 225
sequence samples in a short time using the workflow and data management
capabilities in Pegasus WMS. Two workflows using different mapping algorithms
were created to analyze the RNA sequences: one based on the ELAND [10] algorithm
from Illumina and the other using an alignment and mapping package, PERM [11].

Fig. 3. TADHB Workflow in production using Illumina ELAND

Fig. 3 shows the ELAND-based production TAHBD workflow. Each workflow


aligns to the human transcriptome a single lane of RNA sequence or a whole flowcell
(8 sequences) in qseq format. The output of ELAND is an aligned sequence file in the
export format. This aligned sequence file is then used to compute the expression
levels of genes, exons and splice junctions.
28 G. Mehta et al.

Table 1. Statistics for workflow runs using the ELAND-based pipeline


Workflow Lanes Tasks I/p Files I/p Data O/p Data Cumulative
O/p Files Saved Data Runtime
Eland WF 225 2,757 26,919 897GB 9.9 TB 1,202hr
20,198 3.8 TB

The production run computed approximately 225 lanes of Brain RNA sequences,
using about 50 days worth of CPU time and producing approximately 10 TB of data.
Table 1 shows the number of lanes, files used and generated, and data size from the
workflow runs. A production pipeline using PERM that aligns sequences to the
transcriptome and the human genome, and computes advanced differential analysis
[12] is currently being run.

4.3 RNA Sequencing: Cancer Genome Atlas Using SeqWare


SeqWare [13] is a project that provides several tools to perform genome mapping,
variance calculation, and data management for events inferred from genetic sequence
data that was produced using sequencing technologies provided by Illumina, ABI
Solid and 454. The SeqWare Pipeline tool consists of many different programs useful
for processing and annotating sequence data. These can be combined with other tools
(BFAST, BWA, SAMtools, etc.) and strung together to form more complex
workflows to support many experiment types.

Fig. 4. Cancer Atlas RNA Seq Alignment and Variant Calls using Pegasus in SeqWare

One of the requirements of SeqWare for running their workflows is the capability to
easily run similar workflows on the local campus cluster, on Amazon EC2, or inside a
simple Virtual Machine, enabling the user to scale the analysis in a flexible way. Also
due to strict data privacy issues, SeqWare wanted to use their own mechanisms for data
Enabling Data and Compute Intensive Workflows in Bioinformatics 29

transfers. SeqWare analyzed several workflow technologies used in bioinformatics, but


nothing else provided the extensibility, scalability and reliability provided by Pegasus.
SeqWare leveraged the advanced configurations available in Pegasus to transfer data
between local computers, clusters and Amazon EC2 as well as Pegasus’ task clustering
capability to optimize running a mixture of short- and long-running tasks. Additionally,
SeqWare relied upon the automatic cleanup feature provided by Pegasus to
continuously delete no longer needed files from the limited temporary storage space
available in the cloud environment to enable large workflows to run.
Fig. 4 shows the RNA sequence alignment and variant calls workflows developed
for SeqWare. SeqWare is currently being used in production for supporting human
RNA sequence processing as part of a $200 million grant for “The Cancer Genome
Atlas project”. Using Pegasus the TCGA group at the University of North Carolina
were recently able to process more then 800 samples of RNA sequences for the Atlas.

4.4 Quality Assurance and Quality Control: Population Architecture Using


Genomics and Epidemiology (PAGE)
Genome-wide association studies (GWAS) have allowed researchers to uncover
hundreds of genetic variants associated with common diseases. However, the
discovery of genetic variants through GWAS research represents just the first step in
the challenging process of piecing together the complex biological picture of common
diseases. The National Human Genome Research Institute (NHGRI)-funded PAGE
[14] project investigates genetic variants initially identified through GWAS research
to assess their impact in diverse populations, to identify genetic and environmental
modifiers, and to investigate associations with novel phenotypes.
One of the main requirements of the PAGE project is to submit data from the various
participating studies to the database of Genotypes and Phenotypes (dbGaP) [15]. One of
the challenges in PAGE is to ensure the quality of the data that is being submitted to the
repository. More often than not, the data submitted by individual studies is formatted
inconsistently, fields may not be documented, and data may not be standardized in terms
of given data types. To ensure that the data submitted to dbGaP adheres to the standards
required by the service, we are developing Pegasus-based Quality Assurance and
Quality Control (QA/QC) workflows that automatically check the data submission,
coherence between data fields, and even between documents of the same submission
and that can alert the submitter of the issues found via a brief report.
Fig. 5 shows the QA/QC workflow being developed for PAGE. The four
participating PAGE studies submit their results to the PAGE coordinating center
website via ftp uploads. After the data is uploaded to the results archive, the data
reception process checks the submission for completeness and re-runs sanity checks
on the submission to quickly detect simple errors and type-checking certain cells, like
adherence to a proper floating point number for some columns. Also checked during
the data reception step is the strand orientation, a critical step when combining data
from different genotyping assays. Once the reception process is complete, 3 sets of
files for each set of submitted study data exist: the SNP summary, the phenotype
summary, and the association results. These files are then loaded into a relational
database. Rows with too low of a count are prevented from loading, indices are added,
30 G. Mehta et al.

Fig. 5. The PAGE Quality Control/Analysis Workflow

and views are created as necessary for later QC steps. Each of these QC steps
comprises a sub-workflow containing several steps to verify the submitted data.
Failure of some steps is considered a critical failure resulting in rejection of the
submitted data while other steps may flag interesting data that requires verification by
the study. Additionally, the QC for association results is only performed if the QC for
SNP summaries and phenotypes succeeded. Finally an aggregated report for each
study data set submitted is produced and provided to the study for further manual
analysis and verification.

5 Workflows in a Virtual Machine

A large number of bioinformatics projects deal with human data. These data have
strict requirements regarding who can access the data, how it must be stored, etc.
Because of these restrictions it can be difficult to have a hosted workflow service for
users where they can upload their datasets for analysis. In order to provide users with
an easy way to utilize existing workflows for analyzing their data, we have bundled
Pegasus WMS with several workflow pipelines [12] that users can install and run
directly on their laptops, desktops, or in a cloud environment. The virtual machine
(VM) image is built and shipped as a vmdk file. This file can be used directly using
Virtual Box [16], VMware [17] or kvm [18] software. Simple scripts are provided to
upload data into the VM, configure the workflows and execute them in a few steps.
Users can also use these virtual machines as an easy way to evaluate several
different algorithms for their analysis, or as a way to get their application code and data
ready to be used for cloud environments. Currently we have two virtual machines
available: one with two RNA sequence analysis workflows, and the other with a portal
interface that includes several smaller workflows such as copy number variation
detection, association test, imputation etc.
Enabling Data and Compute Intensive Workflows in Bioinformatics 31

6 Related Works
Several workflow systems [22] provide a way to automate bioinformatics pipelines to
aid the burgeoning field of bioinformatics. A few of the ones that are most popular are
mentioned below. Galaxy [20] is a Python based GUI that allows a user to create
bioinformatics pipelines by creating Python wrapper modules. Galaxy is primarily a
desktop tool but now support is available to run Galaxy on clusters and clouds.
Galaxy only supports scheduling tasks on a single set of resources that it is
preconfigured to use. Taverna [21] is a GUI-based workflow manager that primarily
supports web services-based pipelines. Recent support for non-web services
workflows has been added by providing automatic wrappers around non-web service
executables. While several bioinformatics projects have used Taverna to create and
share small workflows, it has not been suitable for creating and running large-scale
pipelines. Kepler [19] a workflow framework based on Ptolemy2 [27] provides both a
GUI interface and a command-line interface to create and run workflows.

7 Future Works and Conclusion


With the explosion of data and computation in the bioinformatics field, a large
number of researchers are now starting to use workflow technologies to manage their
data movement and computation. While there are several different workflow systems
available, Pegasus WMS provides a proven solution when the data and computation
problems are quite large, involve legacy codes, are cross-institutional collaborative
projects, or require using a large array of resources from local desktops to clusters,
grids, and clouds. Currently, issues such as optimizing data transfers, advanced data
placements, support for status notifications, and metadata management for the data
products generated by the workflow are being investigated.
Acknowledgments. We would like to thank Michael Freitas, Brian O’Connor and the
Pegasus WMS Team. Pegasus WMS is supported by NSF OCI grant 0722019.
Population Architecture Using Genomics and Epidemiology program is funded by the
National Human Genome Research Institute (NHGRI) grant U01HG004801. The
BrainSpan project (Transcriptional Atlas of the Developing Human Brain) is
supported by NIH grants RC2MH089921, RC2MH090047 and RC2MH089929.

References
1. Deelman, E., Mehta, G., Singh, G., Su, M.H., Vahi, K.: Pegasus: Mapping Large-Scale
Workflows to Distributed Resources. In: Workflows for e-Science (2007)
2. Deelman, E., et al.: Pegasus: a Framework for Mapping Complex Scientific Workflows
onto Distributed Systems. Scientific Programming Journal 13, 219–237 (2005)
3. Juve, G., Deelman, E., Vahi, K., Mehta, G., et al.: Data Sharing Options for Scientific
Workflows on Amazon EC2. In: Proceedings of the 2010 ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and Analysis (2010)
4. Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: a computation
management agent for multi-institutional grids. In: Proceedings 10th IEEE International
Symposium on High Performance Distributed Computing, vol. 5(3), pp. 55–63 (2002)
5. Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A Hunter of Idle Workstations. In: 8th
International Conference on Distributed Computing Systems (1988)
32 G. Mehta et al.

6. Couvares, P., Kosar, T., Roy, A., et al.: Workflow in Condor. In: Taylor, I., Deelman, E.,
et al. (eds.) Workflows for e-Science. Springer Press (January 2007)
7. Xu, H., Freitas, M.A.: Bioinformatics 25(10), 1341–1343 (2009)
8. Freitas, M.A., Mehta, G., et al.: Large-Scale Proteomic Data Analysis via Flexible Scalable
Workflows. In: RECOMB Satellite Conference on Computational Proteomics (2010)
9. Transcriptional Atlas of the Developing Human Brain,
http://www.brainspan.org/
10. Illumina Eland Alignment Algorithm, http://www.illumina.com
11. Chen, Y., Souaiaia, T., Chen, T.: PerM: Efficient mapping of short sequencing reads with
periodic full sensitive spaced seeds. Bioinformatics 25(19), 2514–2521 (2009)
12. Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., et al.: RseqFlow: Workflows for
RNA-Seq data analysis. Submission: Oxford Bioinformatics-Application Notes
13. O’Connor, B., Merriman, B., Nelson, S.: SeqWare Query Engine: storing and searching
sequence data in the cloud. BMC Bioinformatics 11(suppl. 12), S2 (2010)
14. Matise, T.C., Ambite, J.L., et al.: For the PAGE Study. Population Architecture using
Genetics and Epidemiology. Am. J. Epidemiol (2011), doi:10.1093/aje/kwr160
15. Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., et al.: The NCBI dbGaP
Database of Genotypes and Phenotypes. Nat Genet. 39(10), 1181–1186 (2007)
16. Virtual Box, http://www.virtualbox.org/
17. VMware, http://www.vmware.com/
18. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: kvm: the Linux virtual machine
monitor. In: OLS 2007: The 2007 Ottawa Linux Symposium, pp. 225–230 (July 2007)
19. Ludascher, B., Altintas, I., Berkley, C., et al.: Scientific Workflow Management and the
Kepler System. Concurrency and Computation: Practice & Experience (2005)
20. Blankenberg, D., et al.: Galaxy: a web-based genome analysis tool for experimentalists. In:
Current Protocols in Molecular Biology, ch. 19, Unit 19.10.1-21 (2010)
21. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., et al.: Taverna: a tool for building and
running workflows of services. Nucleic Acids Research 34, 729–732 (2006)
22. Romano, P.: Automation of in-silico data analysis processes through workflow
management systems. Briefings in Bioinformatics 9(1), 57–68 (2008)
23. Nakata, K., Lipska, B.L., Hyde, T.M., Ye, T., et al.: DISC1 splice variants are upregulated
in schizophrenia and associated with risk polymorphisms. PNAS, August 24 (2009)
24. Deelman, E., Kesselman, C., Mehta, G., et al.: GriPhyN and LIGO, Building a Virtual
Data Grid for Gravitational Wave Scientists. In: 11th Int. Symposium HPDC, HPDC11
2002, p. 225 (2002)
25. Eng, J.K., McCormack, A.L., Yates III, J.R.: An Approach to Correlate Tandem Mass
Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc.
Mass. Spectrom. 5(11), 976–989 (1994)
26. Perkins, D.N., Pappin, D.J., et al.: Probability-based protein identification by searching
sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
27. Eker, J., Janneck, J., Lee, E.A., Liu, J., et al.: Taming heterogeneity - the Ptolemy
approach. Proceedings of the IEEE 91(1), 127–144 (2003)
28. Pegasus Workflow Management System, http://pegasus.isi.edu/wms
29. Teragrid, http://www.teragrid.org
30. Open Science Grid, http://www.opensciencegrid.org
31. FutureGrid, http://www.futuregrid.org
32. Nagavaram, A., Agrawal, G., et al.: A Cloud-based Dynamic Workflow for Mass
Spectrometry Data Analysis. In: Proceedings of the 7th IEEE International Conference on
e-Science (e-Science 2011) (December 2011)
Homogenizing Access to Highly
Time-Consuming Biomedical Applications
through a Web-Based Interface

Luigi Grasso, Nuria Medina-Medina,


Rosana Montes-Soldado, and Marı́a M. Abad-Grau

Dept. Lenguajes y Sistemas Informáticos - CITIC, Universidad de Granada,


Granada, Spain

Abstract. The exponential increment in the production of biomedical


data is forcing a higher level of automatization to analyze it. There-
fore, biomedical researchers have to entrust bioinformaticians to develop
software able to process a huge amount of data on high performance
unix-based servers. However, most of the software is developed with a
very basic, text-based, user interface, usually because of a lack of time.
In addition the applications are developed as independent tools yielding
to a set of specific software tools with very different ways of use. This im-
plies that final users continuously need developer support. Even worse, in
many situations only developers themselves are able to run the software
every time is required. In this contribution we present a Web-based user
interface that homogenizes the way users interact with the applications
installed in a server. This way, authorized users can add their applica-
tions to the Web site at a very low cost. Therefore, researchers with no
special computational skills will perform analysis by themselves, gaining
independence to run applications whenever they want at the cost of a
very little effort. The application is portable to any unix-like system with
a php+mysql server.

Keywords: High performance biomedical applications, Web-based user


interface, AJAX.

1 Introduction
Complex diseases are explained by the interaction of many genetic factors to-
gether with the environment. To shred light about which factors increment the
risk of developing a complex disease and how they interact to each other, genome-
wide association studies [1] as well as gene expression profiling [2] or a combina-
tion of both [4] are currently being used.
The main feature of these data is their large size, which makes an analysis to
be a highly time-consuming task. As an example, genome-wide data sets with
thousand gigabytes are becoming a common source of data to be analyzed to
detect genetic factors of complex diseases. Analyses are usually performed in high

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 33–42, 2012.

c Springer-Verlag Berlin Heidelberg 2012
34 L. Grasso et al.

performance computers, running usually under a unix-like operative system and


equipped with many processors and a large amount of random access memory.
Quite often they are part of large clusters or grids. Perhaps because of the lack
of time imposed by the highly competitive that scientific research has become in
the biomedical field, software developers focus mostly on functional requirements
and computational time. Therefore, most biomedical applications developed at
the laboratory have a very simple text-based user interface. In addition, the user
manual, in the case it exists, is not complete, hardly understandable and/or does
not follow any standard for user manual production. This is the case of rTDT
[8], BLink [7], Phase [10], FastPhase [6] or T DTP [1], all of them used to process
genetic data sets.
Generally, biomedical researchers ask to bioinformaticians not only to develop
software but also to run it to perform their analyses, as coping with shell com-
mands and scripts in unix-like OS requires a steep learning curve that usually
cannot climb. The need of using applications which have text-based user inter-
faces and scarce user manuals forces them to invest a considerable effort every
time they want to run a new software application.
In the last years, many Web-based tools are being offering to biomedical
research to easily launch high-performance applications [5], such as those to per-
form DNA annotations or phylogeny reconstruction. However, as they usually
can be freely accessed, the resources on their servers become very busy and they
cannot be used to process large files. They usually lack in flexibility too. There-
fore, among those that provide a common entrance to launch more than one
application, they do not gather applications from different research fields or let
alone provide an integrated tool to add a new application to the system. As an
example, NPS@ (Network Protein Sequence Analysis) or GPS@ (Grid Protein
Sequence Analysis) [3], a more powerful version of NPS@ for grid computing,
allows the user to easily interact with many of the most common resources for
protein sequence analysis but cannot cope with other tasks such as protein se-
quence and expression combined analysis. As an example of a Web-based user
interface providing access to any software resource in a grid computing environ-
ment is the UCLA Grid Portal [9], which therefore is very useful to biomedical
researchers with granted access to the grid. However, only portal administrators
can add an application to the portal. Moreover, it is hardly portable, as it is
only a portal but not a workframe that can be used to create portals on any web
server.
In this contribution we present a different approach to provide a Web-based
user interface to easily access to any software resource. Our approach consisted
on developing BioBench, a framework to create Web-based computational work-
benchs, i.e. Web-based user interfaces to provide access to any software installed
in a computer system. The main goals of BioBench were (1) efficacy: the user can
access through the Web to all disks and processing resources they are granted in
the system; (2) portability, so that any laboratory may install BioBench to cre-
ate their own Web application to homogenize access to their software; (3) flexi-
bility: the Web application evolves depending on the needs of each laboratory, by
Homogenizing Access to Biomedical Applications 35

easily adding every new software that can be useful and removing those that are
not used any more; and (4) simplicity: as the tool has been designed to be used by
non-expert users. Compared with more complex frameworks, the tool functional-
ity and design is simple. Therefore, for it to run in a grid configuration there must
be a software layer between the Web server and the OS with high-level networking
protocols and more stringent security capacities.
Section 2 is devoted to explain the main features of BioBench, the framework
developed for the automatic construction of Web-based computational work-
benchs. In Section 3 we introduce BiosH, a workbench (http://bios.ugr.es/BiosH)
that has been created to provide and use the software applications through a
Web-based user interface. The main conclusions appear in Section 4.

2 BioBench: A Framework to Integrate and to Transform


Text-Based User Interfaces into Homogeneous
Web-Based Ones

2.1 Description

The main purpose of BioBench was to easily equip text-based bioinformatic


applications running under unix-like servers with a more friendly user interface
in a way that required very little time for software developers to produce this
user interface and for biomedical researchers to launch these applications.
A Web-based application seemed to be a very appropriate way to do this, as it
also would reduce installation work for users to zero. The Web interface will ease
application sharing among different users and will improve availability. Another
important requirement was to use batch processing whenever a task was going to
need a long time to be completed. Task completion would be communicated to
the user through email. Therefore, BioBench was developed with these features.
Five different types of users are managed by the system, identified by the
following roles: unidentified user, visitor, standard user, expert user and ad-
ministrator. All except an unidentified user are registered users. User roles are
related by an inheritance relationship, so that all functionalities of an ancestral
role are inherited for all its descendants (see Fig. 1).

Fig. 1. The inheritance relationship between the user roles in BioBench

An unidentified user is only allowed to login as a registered user, to register


as a visitor, to access information about BioBench functionality and to down-
load BioBench. Registered users must have a unix account in the server where a
Web-based workbench, i.e. a Web-based application using BioBench, is installed.
36 L. Grasso et al.

Visitors may see their account information and ask to be promoted as an stan-
dard or expert user. Standard users can also run any software already registered
in the Web-based workbench, see information about other users registered and
manage their own user profiles. As standard users may want to know which other
users use the applications in the server for a further collaboration, we have added
the possibility for them to get that information. In order to launch an applica-
tion, they have to choose the software to be run, the server path where the input
files are placed, the server path where the output files should be place, in case
this is necessary, and the software arguments. Expert users have the same rights
of standard users plus the ability to register a new application and to delete or
modify software created by the same user. The software to be registered must
be already installed in the server. The expert user has to provide several data
to the system, such as the server path where the software can be found or a
description of the parameters required for the software to be run. Additional
responsibilities of system administrator are to promote/step-down a user and to
install and export BioBench. Its main functions are summarized in Table 1.

Table 1. Main functions of BioBench

Functionality for unidentified users


Register as a visitor
Login as a registered user
Read information about the functionality of BioBench
Read install documentation
Download BioBench
Functionality for visitors
See information about my account
Ask for promotion as standard or expert user
Functionality for standard users
See information about registered users
Run a registered application
Manage system files
Functionality for expert users
Register a new application
Modify/delete a registered application
Functionality for the system administrator
Promote/step-down a user
Install BioBench1
Export BioBench

2.2 Design
The logic architecture of BioBench is structured in three main layers, according
to the model-view-controller design pattern, in order to separate responsibilities
1
This is the only function that has to be performed through a text-based user interface
(unix terminal).
Homogenizing Access to Biomedical Applications 37

Fig. 2. Logic architecture of BioBench

and ensure that the code implementing the software functionality and accessing
the data (the model) is independent of the user interface provided to access to
this information (the view). The controller layer updates the view every time
a change is performed in the model. The model layer is divided into two sub-
systems: the user subsystem and the application subsystem. Figure 2 shows the
architecture of the model layer. The user subsystem contains the user model,
responsible for the representation and management of users. The application
subsystem is subsequently divided into three models: (1) the application model,
in charge of all the software tools for which BioBench provides a unified Web
interface, (2) the parameters model, in charge of the management of the param-
eters for each software application and (3) the folder model, which represents
and controls disk units and folders accessed by users and applications.
The physical architecture of BioBench and its interaction with users, other
software and hardware is shown in Fig. 3. BioBench can be used to create a
Web-based computational workbench in any server with a unix-like OS, Apache
server, php and MySQL. However, for it to work in a cluster or even more, a grid
configuration, so that applications and/or data from more than one computer can
be accessed by it, other software and additional security procedures are required.
Therefore, the use of the Apache capability has to be complemented with an
extended version named General Remote Access, which considers the URI of
the linked servers (cluster nodes or grids), the list of granted users and their
credentials. According to this architecture, we need to install a small application
on each linked server to act as a client software that interacts with the main
server. This also enables the possibility of monitoring the processes running in
the server.

2.3 Software Development

To develop a highly interactive system with a fast human-machine interaction,


we chose to use Asynchronous JavaScript And XML (AJAX) to develop the
Web-based application. To speed up the application development, we chose to
use Xajax, an open-code PHP library to build Web-based applications using
AJAX. The Xajax library includes all the java-script functions in order to build
the front-end application or update the view of the Web page obtaining the
38 L. Grasso et al.

Fig. 3. Physical architecture of BioBench

html code from the php functions. We also used Prototype library (version 1.6)
in order to benefit from the high potential of its functionality and simplify the
implementation task. The application requires a database to store all the data
such as information about all the applications and their parameters, the system
users and to set up a permission protocol to model relationships between actions
and user roles. BioBench uses a relational database with tables created from a
set of 9 entities: Action, Application, Args, Description-App, Description-Arg,
Labels-act, Labels-arg, Types and User. As a database management system we
chose to use MySQL. Each php object creates a connection with the database us-
ing ADOdb, Database Abstraction Library for PHP (and Python) and MySQL.
We have adapted a simple library, called eXplorer, which allows a remotely man-
age of folders and files interacting with the Xajax library. BioBench has been
developed under the GNU General Public License (GNU GPL) 3.0. A Web site
(http://bios.ugr.es/BioBench) from which the application can be downloaded,
has been built at bios.ugr.es, a linux server where several bioinformatic applica-
tions have being built for biomedical analyses.

3 BiosH: A Case Study

BioBench has been used to create BiosH, a Website (http://bios.ugr.es/BiosH) to


centralize the software that is being used at the laboratory of a group of molecular
biologists at the Spanish Council of Scientific Research (CSIC) in Granada. Lately,
one member of the team has moved to the University of Sevilla but still does scien-
tific research in collaboration with her former laboratory in Granada. Therefore,
the existence of BiosH has become now even more useful, as it is a Web-based tool
which does not require to be installed every time a user moves to a different place.
Homogenizing Access to Biomedical Applications 39

Traditionally, biologists at this laboratory needed bioinformaticians to write for


them a new software any time there not was any available software for doing the
task. As all the research fields regarding the analysis of genome or transcriptome
data sets evolve at an amazingly fast pace, the need of ad-hoc software appeared
frequently at the laboratory. As a solution, these bioinformaticians used BioBench
to build BiosH and as soon as it was launched, it started evolving with more and
more software applications added to it.
As an example of the potential of BioBench, we describe the steps that were
done in order to use BiosH to perform a set of calculations related with genome
and transcriptome combined analysis to satisfy the research needs of some bi-
ologists at the laboratory. Table 2 shows the input data the biologists at the
laboratory had (first row) and the outputs they were looking for (second row).
The last row describes the sequence of applications and their arguments (pa-
rameters inside squared brackets has to be replaced by the real arguments) that
are required to get the outputs from the inputs. Therefore, all these applications
were incorporated to BiosH for the biologist to use them. ImportFormat and
Transpose applications are only required to change input and output formats
respectively. SelectCommonSNPs is an application to perform a preprocessing
step of marker filtering. Finally, Genetranassoc is the application that performs
the computations to obtain association results using the Spearman correlation
measure and their p values.
For the whole task to be performed through BiosH, an expert user had to
introduce the applications that were not already at BiosH. In this use case, only
Genetranassoc was not in BiosH so that it was the only one application that
had to be added to the system. However, her developer was not in BiosH either

Table 2. An example of task that was performed through BiosH

Input I1. Text file with transcriptions for a set of i individuals (columns)
and g genes (rows)
I2. Text file (makeped format) with genotypes from the same population
I3. Text file (makeped format) with genotypes from another population
to select major alleles
I4. One-column text file with p rows with genetic positions to
compute Spearman correlations
I5. The amount of permutations to be performed in order to
assess statistical evidence
Output O1. Text file with gene Spearman correlation coefficients and p values
O2. Text file with [I5] rows and g × p columns with the Spearman
value for each permutation
Applications 1. ImportFormat PED [I2] [I2].gou
2. ImportFormat PED [I3] [i3].gou
3. SelectCommonSNPs [I2].gou [I2]Selection.gou I4
4. SelectCommonSNPs [I3].gou [I3]Selection.gou I4
5. Genetranassoc [I1] [I2]Selection.gou [I3]Selection.gou [O1].t
6. Transpose [O1].t.csv [O1]
40 L. Grasso et al.

and was added by the Web administrator using the option under the ‘Settings’
link available to administrators to add users. Figure 4 shows the screenshot with
the first form that was filled to add Genetranassoc to BiosH. The main infor-
mation that had to be provided, besides the application name and path where
it is installed, were whether the application has to be run in background, the
arguments required and their type and description. On the left of the figure, we
can observe all the functionality an expert user has regarding the applications
(named programs in the workbench). As Genetranassoc can accept 5 arguments,
other 5 forms collecting information for each parameter were filled for the ap-
plication to be used through BiosH. Once all the applications required for the
task in Tab. 2 were in BiosH, the biologist at the laboratory interested in that

Fig. 4. Sreenshot showing the first form filled to add the application Genetranassoc to
BiosH
Homogenizing Access to Biomedical Applications 41

Fig. 5. Sreenshot showing the form that has to be filled in order to run the application
SelectCommonSNPs through BiosH

task was able to do it without any help and under the role of standard user, pro-
vided that she had a user account and enough disk space to store output results
in the linux system were BiosH is installed. Figure 5 shows as an example the
screenshot with the form that was filled by her to perform the step 3 described
in Tab. 2.

4 Conclusions
The quick growth of scientific research in the biomedical field and the huge
amount of data from genomes, transcriptomes, etc. that has to be processed is
significantly changing the way researchers process them. Hand-made processing
is not conceivable any more and software applications are constantly developed to
be used in the laboratory. These applications are usually run in high-performance
computers with several processors and a large central memory storage under
unix-like OS. However, the high increase in work load that bioinformaticians,
biostatisticians or any other researchers have as software developers is forcing
them to write applications with a simple text-based user interface and no user
documentation which are usually only understood and used by their creators.
Moreover, many biomedical researchers are not used to text-based interfaces
42 L. Grasso et al.

under unix servers and they usually ask somebody to run the applications for
them. Therefore, to ease the use of bioinformatic applications is being a very
demanded task by biomedical researchers. This way, they instead of the software
developers may run these applications. BioBench reaches this goal as a tool to
create workbenchs able to provide a friendly and homogeneous Web interface
to the applications installed by any user in a server. Opposite to the very few
similar tools, BioBench can be installed in any unix-like OS with a mysql+php
Web server and every user can add their self-written software so that it can be
easily shared.

Acknowledgment. The authors were supported by the Spanish Research Pro-


gram under project TIN2010-20900-C04, the Andalusian Research Program under
project P08-TIC-03717 and the European Regional Development Fund (ERDF).

References
1. Abad-Grau, M.M., Medina-Medina, N., Montes-Soldado, R., Moreno-Ortega, J.,
Matesanz, F.: Genome-wide association filtering using a highly locus-specific trans-
mission/disequilibrium test. Human Genetics 128(3), 325–344 (2010)
2. Alekseev, O.M., Richardson, R.T., Alekseev, O., O’Rand, M.G.: Analysis of gene
expression profiles in hela cells in response to overexpression or sirna-mediated
depletion of nasp. Reprod. Biol. Endocrinol. 7, 7–45 (2009)
3. Blanchet, C., Combet, C., Daric, V., Deléage, G.: Web Services Interface to Run
Protein Sequence Tools on Grid, Testcase of Protein Sequence Alignment. In:
Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds.) ISBMDA 2006.
LNCS (LNBI), vol. 4345, pp. 240–249. Springer, Heidelberg (2006)
4. Dimas, A.S., Deutsch, S., Stranger, B.E., Montgomery, S.B., Borel, C., Attar-
Cohen, H., Ingle, C., Beazley, C., Arcelus, M.G., Sekowska, M., Gagnebin, M.,
Nisbett, J., Deloukas, P., Dermitzakis, E., Antonarakis, S.E.: Common regula-
tory variation impacts gene expression in a cell type dependent manner. Sci-
ence 325(5945), 1246–1250 (2001)
5. Fox, J.A., McMillan, S., Ouellete, B.F.: A compilation of molecular biology web
servers: 2006 update on the bioinformatics links directory. Nucleic Acids Research
34, W3–W5 (2001)
6. Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale popu-
lation genotype. data: Applications to inferring missing genotypes and haplotypic
phase. Am. J. Hum. Genet. 78, 629–644 (2006)
7. Sebastiani, P., Abad-Grau, M.M.: Bayesian estimates of linkage disequilibrium.
BMC Genetics 8, 1–13 (2007)
8. Sebastiani, P., Abad-Grau, M.M., Alpargu, G., Ramoni, M.F.: Robust Transmis-
sion Disequilibrium Test for incomplete family genotypes. Genetics 168, 2329–2337
(2004)
9. Slottow, J., Korambath, P., Jin, K.: The integration of ajax, interactive x windows
applications and application input generation into the ucla grid portal. In: Proceed-
ings of the IEEE International Symposium on Parallel and Distributed Processing
(2008)
10. Stephens, M., Smith, N.J., Donelly, P.: A new statistical method for haplotype
reconstruction from population data. Am. J. Hum. Genet. 68, 978–989 (2001)
Distributed Management and Analysis
of Omics Data

Mario Cannataro and Pietro Hiram Guzzi

Department of Medical and Surgical Sciences, Bioinformatics Laboratory,


University Magna Græcia of Catanzaro, Italy
{cannataro,hguzzi}@unicz.it

Abstract. The omics term refers to different biology disciplines such


as, for instance, genomics, proteomics, or interactomics. The suffix -ome
is used to indicate the objects of study of such disciplines, such as the
genome, proteome, or interactome, and usually refers to a totality of
some sort. This paper introduces omics data and the main computational
techniques for their storage, preprocessing and analysis. The increasing
availability of omics data due to the advent of high throughput technolo-
gies poses novel issues on data management and analysis that can be
faced by parallel and distributed storage systems and algorithms. After
a survey of main omics databases, preprocessing techniques and analy-
sis approaches, the paper describes some recent bioinformatics tools in
genomics, proteomics and interactomics that use a distributed approach.

Keywords: Omics Data, Genomics, Proteomics, Interactomics,


Distributed Computing.

1 Introduction

The omics term refers to different biology disciplines such as, for instance, ge-
nomics, proteomics, or interactomics. The suffix -ome is used to indicate the
objects of study of such disciplines, for instance the genome, proteome, or in-
teractome, and usually refers to a totality of some sort. Main omics disciplines
are thus genomics, proteomics, and interactomics, that respectively study the
genome, proteome and interactome. The term omics data is used here to re-
fer to experimental data regarding the genome, proteome or interactome of an
organism.
The development of novel technologies for the investigation of the omics disci-
plines had caused the increased availability of omics data. Consequently the need
of both support and spaces for data storing as well as procedures and structures
for data exchanging arises. The resulting scenario is thus characterized by the
introduction of a set of methodologies and tools enabling the management of
data stored in geographically distributed databases using distributed tools often
implemented as services.

Corresponding author.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 43–52, 2012.

c Springer-Verlag Berlin Heidelberg 2012
44 M. Cannataro and P.H. Guzzi

Distribution of data may improve data availability allowing scalability in


terms of data and users, parallel data manipulation from different users allows
to improve the overall knowledge stored in distributed databases and of course
enhances performance.
Main requirements of distributed management of omics data are:

– the introduction of a common shared data model able to capture both raw
data of the experiment and related metadata;
– the definition of an uniform and widely accepted access and manipulation
strategy for such large datasets;
– the design of algorithms that are aware of data distribution and thus may
improve their performance;
– the design of ad-hoc infrastructures for efficient data transfer.

For instance the distributed processing of protein interaction data involves the
following activities: (i) Sharing and dissemination of PPI data among different
databases; (ii) Collection of data stored in heterogeneous databases; and (iii)
Parallel and distributed analysis of data.
The first activity requires the development of both standards and tools to
manage the process of data curation and exchange between interaction databases.
Currently there is an ongoing project, namely the International Molecular Ex-
change Consortium (IMEx)1 , that aims to standardize the exchange of inter-
actomics data. The second activity requires to solve the classical bioinformatic
problem of linking identical data identified with different primary keys. Finally
the rationale for the third activity is due to the algorithmic nature of problems
regarding graphs. A big class of algorithms that mine interaction data can be
re-conducted to classical problems of graph and subgraph isomorphism that are
computationally hard. So the need for high-performance computational plat-
forms as well as parallel algorithms arises.
The rest of the paper is structured as follows. Section 2 discusses the manage-
ment issues of omics data and presents some omics databases. Section 3 recalls
main techniques for analysing omics data, while Section 4 describes some paral-
lele and distributed bioinformatics tools for the analysis of omics data. Finally.
conclusions and future work are reported in Section 5.

2 Management of Omics Data

2.1 Genomics Databases

These databases store information about the primary sequence of proteins. Each
sequence is generally annotated by several information, e.g. the name of the sci-
entist that discovered the sequence or about the post translational modifications.
User can query these databases by using a protein identifier or a fragment of
sequence in order to retrieve the most similar proteins.
1
http://imex.sourceforge.net
Distributed Management and Analysis of Omics Data 45

The EMBL Nucleotide Sequence Database2 [5], maintained at the European


Bioinformatics Institute (EBI) collects nucleotide sequences and annotations
from public available sources. The database, involved in an international collab-
oration, is synchronized with DDBJ (DNA Data Bank of Japan) and GenBank
(USA) (see next Sections). Core data are the protein and nucleotide sequences.
The Annotations section of this database describes the following items: (i) Func-
tion(s) of the protein; (ii) Post-translational modification(s); (iii) Domains and
sites; (iv) Disease(s) associated with deficiencies; and (v) Secondary structure.
The GenBank database3 [4] stores information about nucleotide sequences
maintained by the National Center of Biotechnology Information (NCBI). Gen-
Bank entries are structured as flat files (like the EMBL database) and share the
same structure with EMBL and DDBJ. All the entries are grouped following
both taxonomic and biochemical criteria. GenBank is accessible through a web
interface. Through the ENTREZ system, the entries of GenBank are integrated
with many datasources, enabling the search of information about proteins and
its structures, as well as literature about the functions of genes.
Finally, the UniProt [11] consortium is structured on three main knowledge
bases: (i) UniProt (also referred to as UniProt Knowledge base), that is the main
archive storing information about protein sequences and annotations extracted
from Swiss-Prot, TrEMBL and PSD-PIR; (ii) UniParc (Uniprot archive) that
contains information about proteins extracted from the main publicly available
archives; and (iii) UniRef (Uniprot reference), a set of databases that organize
entries of UniProt by their similarity sequence, e.g. the UniRef90 groupes in
a single record entries of UniProt that present at least the 90% of sequence
similarity.

2.2 Proteomics Databases


The Global Proteome Machine Database4 [12] was constructed to utilize the in-
formation obtained by the different servers included into the Global Proteome
Machine project (GPM), to validate peptide MS/MS spectra and protein cov-
erage. GPM is a system for analyzing, storing, and validating proteomics in-
formation derived from tandem mass spectrometry. The system is based on a
relational database, on different servers for data analysis, and on a user-friendly
interface to retrieve and analyze data. This database has been integrated into
GPM server pages. The gpmDB data model is based on a modification of the
Hupo-PSI Minumun Information About Proteomic Experiment (MIAPE) [16]
scheme. System is available both through a web interface and as a stand alone
application allowing users to compare their experimental results with the other
ones that have been previously observed by other scientists.
PeptideAtlas5 [13] is a database that aims to annotate the human genome
with protein-level information. It contains data coming from identified peptides
2
http://www.ebi.ac.uk/embl
3
http://www.ncbi.nlm.nih.gov/genbank
4
http://www.thegpm.org/GPMDB/index.html
5
http://www.peptideatlas.org/overview.php
46 M. Cannataro and P.H. Guzzi

analyzed by liquid chromatography tandem mass spectrometry (LC-MS/MS)


and thus mapped onto the genome. PeptideAtlas is not a simple repository
for mass spectrometry experiments, but uses spectra as primary information
source to annotate the genome, combining different information. Consequently
the population of this database involves two main phases: (i) a proteomic phase
in which samples are analyzed through LC-MS/MS, and resulting spectra are
mined to identify the contained peptides, (ii) an in silico phase in which peptides
are processed by applying a bioinformatic pipeline and each peptide is used to
annotate a genome. Resulting derived data, both genomics and proteomics, are
stored in the PeptideAtlas database.

2.3 Interactomics Databases

The accumulation of protein interaction data caused the introduction of several


databases [6]. Here we distinct on databases of experimental determined in-
teractions, that include all the databases storing interactions extracted from
both literature and high-throughput experiments, and databases of predicted
interactions that store data obtained by in silico prediction. Another important
class that we report is constituted by integrated databases or meta-databases,
i.e. databases that aim to integrate data stored in other publicly available datasets.
Currently, there exist many databases that differ on biological and information sci-
ence criteria: the covered organism, the kind of interactions, the kind of interface,
the query language, the file format and the visualization of results.
Although the existence of many databases the resulting amount of data presents
three main problems [10]: (i) the low overlap among databases, (ii) the resulting
lack of completeness with respect to the real interactome, and (iii) the absence of
integration. Consequently, in order to perform an exhaustive data collection, (e.g.
for an experiment), researchers should manually query different data sources. This
problem is faced with the introduction of databases based on the integration of
existing ones. Nevertheless, in the interactomics field, the integration of existing
databases is a complex problem not yet completely solved.
In such a scenario many different laboratories are producing data by using
different experimental techniques. Then, data can be modeled as a graph and
stored in repositories by using different technical solutions. Finally, data stored in
such databases can be mined to derive novel interactions or to extract functional
modules, i.e subgraphs of the PPI network that have a biological meaning.
The distributed processing of protein interaction data consequently involves
the following activities: (i) Sharing and dissemination of PPI data among dif-
ferent databases; (ii) Collection of data stored in heterogeneous databases; and
(iii) Parallel and distributed analysis of data.
The first activity requires the development of both standards and tools to
manage the process of data curation and exchange between interaction databases.
Currently there is an ongoing project, namely the International Molecular Ex-
change Consortium (IMEx)6 , devoted to build an enabling framework for data
6
http://imex.sourceforge.net
Distributed Management and Analysis of Omics Data 47

exchange. It is based on an existing standard for protein interaction data, the


HUPO PSI-MI format. Databases that participate in this consortium accept the
deposition of interaction data from authors, helping the researchers to annotate
the dataset through a set of ad hoc developed tools.
The second activity requires to solve the classical bioinformatic problem of
linking identical data identified with different primary keys. The cPath7 tool
[9] is an open source software for collecting and storing pathways coming from
different data sources. From a technological point of view this software is an open
source database integrated in a web application capable of collecting data from
different data sources and exporting these data through a web service interface.
The third activity is related to the possibility of processing omics data in a
parallel way. Issues include the development of parallel bioinformatics algorithms
and the development of collaborative analysis platforms (collaboratories) where
remote users can analise data in a collaborative way.

3 Omics Data Analysis

3.1 Microarray Data Analysis

The typical dimension of microrarray dataset is growing for two main reasons:
the dimension of files generated when using a single chip and the number of
the arrays involved in a single experiment are increasing. Let us consider, for
instance, two common Affymetrix microarray files (also known as CEL files): the
older Human 133 Chip CEL file that has a dimension of 5MB and contains 20000
different genes and the newer Human Gene 1.0 st that has a typical dimension
of 10 MB and contains 33000 genes. Moreover a single array of the Exon family
(e.g. Human Exon or Mouse Exon) can have up to 100 MB of size. Moreover
the recent trend in genomics is to perform microarray experiments considering
a large number of samples (e.g. coming from patients and controls) [1].
From this scenario, the need for the introduction of tools and technologies to
process such huge volume of data in an efficient way arises. A possible way to
develop the efficient preprocessing of microarray data is represented by the par-
allelization of existing algorithms on multicore architectures. In such a scenario
the whole computation is distributed onto different processors, that perform
computations on smaller sets of data and results are finally integrated. Such sce-
nario requires the design of new algorithms for summarisation and normalisation
that take advantage of the underlying parallel architectures. Nevertheless a first
step in this direction can be represented by the replication on different nodes of
existing preprocessing softwares that runs on smaller datasets.
Despite its relevance, the parallel processing of microarray data is a rela-
tively new field. An important work is represented by affyPara [15], that is a
Bioconductor package for parallel preprocessing of Affymetrix microarray data.
It is freely available from the Bioconductor project. Similarly the μ-CS project
presents a framework for the analysis of microarray data based on a distributed
7
http://cbio.mskcc.org/software/cpath
48 M. Cannataro and P.H. Guzzi

architecture made of different web-services internally parallel for the annotation


and preprocessing of data. Compared to affyPara, such an approach presents
three main differences: (i) the possibility to realize more summarisation scheme
such as Plier, (ii) the easily extension to newer SNP arrays, (iii) it does not
require the installation of Bioconductor platform.

3.2 Mass Spectrometry Data Analysis

Mass Spectrometry-based proteomics is becoming a powerful, widely used tech-


nique in order to identify molecular targets in different pathological conditions.
Classical bioinformatics tasks, such as protein sequence alignment, protein struc-
ture prediction, peptide identification, etc., are more and more combined with
data mining and machine learning algorithms to obtain powerful computational
platforms. Mass spectrometry produces huge volumes of data, said spectra, that
may be affected by errors and noise due to sample preparation and instrument
approximation. As a results preprocessing and data mining algorithms require
huge amount of computational resources. The collection, storage, and analysis
of huge mass spectra can leverage the computational power of Grids, that offer
efficient data transfer primitives, effective management of large data stores (e.g.
replica management), and high computing power.

3.3 Protein-to-Protein Interaction Data Analysis

Once that an interaction network is modeled by using graphs, the study of bio-
logical properties can be done using graph-based algorithms [6], and associating
graph properties to biological properties of the modeled PPI. Algorithms for the
analysis of local properties of graphs may be used to analyze local properties
of PPIs networks, e.g. dense distribution of nodes in a small graph region may
be associated to proteins (nodes) and interactions (edges) relevant to represent
biological functions.
The rationale for the distributed analysis of PPI data is due to the algorithmic
nature of problems regarding graphs. A big class of algorithms that mine inter-
action data may be faced using classical algorithms for solving the graph and
subgraph isomorphism problems that are computationally hard. So the need for
high-performance computational platforms arises. Currently, different softwares
that mine protein interaction networks are available through web interfaces. For
instance NetworkBlast8 and Graemlin9 , that allow the comparison of multiple
interaction networks are both available through a web-interface. Alignment algo-
rithms usually employ different heuristics to face with the subgraph isomorphism
problem. Although this, they are usually time consuming and the dimension of
input data is still growing, so the development of high performance architectures
will be an important challenges in the future.
8
http://www.cs.tau.ac.il/~ bnet/networkblast.htm
9
http://graemlin.stanford.edu
Distributed Management and Analysis of Omics Data 49

4 Tools for Distributed Management of Omics Data


4.1 Micro-CS
μ-CS (Microarray Cel file Summarizer) [14], is a distributed tool for the auto-
matic normalization, summarization and annotation of Affymetrix binary data.
μ-CS is based on a client-server architecture. The μ-CS client is provided both
as a plug-in of the TIGR M4 (TM4) platform and as a Java standalone tool and
enables users to read, preprocess and analyse binary microarray data, avoiding
the manual invocation of external tools (e.g. the Affymetrix Power Tools), the
manual loading of preprocessing libraries, and the management of intermediate
files. The μ-CS server automatically updates the references to the summariza-
tion and annotation libraries that are provided to the μ-CS client before the
preprocessing. The μ-CS server is based on the web services technology. Thus
μ-CS users can directly manage binary data without worrying about locating
and invoking the proper preprocessing tools and chip-specific libraries. More-
over, users of the μ-CS plugin for TM4 can manage and mine Affymetrix binary
files without using external tools, such as APT (Affymetrix Power Tools) and
related libraries.

4.2 MS-Analyzer
The analysis of Mass Spectrometry proteomics data requires the combination of
large storage systems, effective preprocessing techniques, and data mining and
visualization tools. The collection, storage and analysis of huge mass spectra pro-
duced in different laboratories can leverage the services of Computational Grids,
that offer efficient data transfer primitives, effective management of large data
stores, and large computing power. MS-Analyzer [7] is a software platform that
uses ontologies and workflows to combine spectra preprocessing tools, efficient
spectra management techniques, and off-the-shelf data mining tools to analyze
proteomics data on the Grid. Domain ontologies are used model bioinformat-
ics knowledge about: (i) biological databases; (ii) experimental data sets; (iii)
bioinformatics software tools; and (iv) bioinformatics processes. MS-Analyzer
adopts the Service Oriented Architecture and provides both specialized spectra
management services and public available off-the-shelf data mining and visu-
alization software tools. Composition and execution of such services is carried
out through an ontology-based workflow editor and scheduler, and services are
discovered with the help of the ontologies. Finally, spectra are managed by a
specialized database.

4.3 IMPRECO
Starting from protein interaction data, a number of algorithm for the individua-
tion of biologically meaningful modules has been introduced such as algorithms
for prediction of protein complexes. Protein complexes are a set of mutually
interacting proteins that play a common biological role. The individuation of
50 M. Cannataro and P.H. Guzzi

protein complexes in protein interaction networks is often made by searching


small dense subgraphs. The performance of a prediction algorithm is therefore
influenced by: (i) the kind and the initial configuration of the used algorithm,
and (ii) the validity of the initial protein to protein interactions (i.e., reliability
of edges in the graph representing of the input interaction network). IMPRECO
(IMproving PREdiction of COmplexes) is a tool that combines the results of dif-
ferent predictors using an integration algorithm which is able to gather (partial)
results from different predictors and eventually produce novel predictions [8]. IM-
PRECO is based on a distributed architecture that implements the IMPRECO
integration algorithm and demonstrates its ability to predict protein complexes.
The proposed meta-predictor first invokes different available predictors wrapped
as services in a parallel way, then integrates their results using graph analysis,
and finally evaluates the predicted results by comparing them against external
databases storing experimentally determined protein complexes.

4.4 OntoPIN
PPI databases are often publicly available on the Internet offering to the user
the possibility to retrieve data of interest through simple querying interfaces.
Users, in fact, can conduct a search through the insertion of: (i) one or more
protein identifiers, (ii) a protein sequence, or (iii) the name of an organism.
Results may consist of, respectively, a list of proteins that interact directly with
the seed protein or that are at distance k from the seed protein, or the list of all
the interactions of an organism. Often it is impossible to formulate even simple
queries involving biological concepts, such as all the interactions that are related
to glucose synthesis.
The OntoPIN project [2], conversely, demonstrates the effectiveness of the use
of ontologies for annotating interaction starting from the annotation of nodes
and the subsequent use for querying interaction data. The OntoPIN project is
based on three main modules:

– A framework able to extend existing PPI databases with annotations ex-


tracted from ontologies: at the bottom of the proposed software platform
there is an annotation module able to extend an existing PPI database
with annotation extracted from the Gene Ontology Annotation Database [3]
(GOA). For each protein three kind of annotations are currently provided:
biological process, cellular compartment and molecular function.
– A system to annotate interactions starting from the annotations of interact-
ing proteins: usually annotated databases contain annotations only for single
proteins, nor for interactions. For instance, if the protein A is annotated with
terms T1 , T2 , and T3 , and the protein B is annotated with terms T1 , T2 , T4 ,
and T5 , then the annotation of the interaction (A, B) is the common set:
{T1 , T2 }.
– A system for querying such database using semantic similarity in addition to
key-based search. The realized query interface supports the following query-
ing parameters: (i) protein identifier, (ii) molecular function annotation, (iii)
Distributed Management and Analysis of Omics Data 51

cellular process annotation, (iv) cellular compartment. The user can insert a
list of parameters that will be joined in a conjunctive way, i.e. the system will
retrieve interactions whose participants are annotated with all the selected
terms.

5 Conclusion and Future Work


Nowadays the efficient management and analysis of omics data has a big impact
in molecular biology and is a key technology in genomics as well as in molecular
medicine and clinical applications. The storage and analysis of omics data is
becoming the bottleneck in this process, so well known high performance com-
puting techniques such as Parallel and Grid Computing, as well as emerging
computational models such as Graphics Processing and Cloud Computing, are
more and more used in bioinformatics. The huge dimension of experimental data
is a first reason to implement large distributed data repositories, while high per-
formance computing is necessary both to face the complexity of bioinformatics
algorithms and to allow the efficient analysis of huge data. The paper intro-
duced main omics data types and described the use of distributed management
and analysis techniques along the whole pipeline of analysis, from data storage,
to data analysis and knowledge extraction.

References
1. Guzzi, P.H., Cannataro, M.: Challenges in microarray data management and analy-
sis. In: Proceedings of the 24th IEEE International Symposium on Computer-Based
Medical Systems, Bristol, United Kingdom, June 27-30 (2011)
2. Cannataro, M., Guzzi, P.H., Veltri, P.: Using ontologies for querying and analysing
protein-protein interaction data. Procedia CS 1(1), 997–1004 (2010)
3. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., Apweiler, R.:
The GOA database in 2009–an integrated Gene Ontology Annotation resource.
Nucleic Acids Research 37, D396–D403 (2009)
4. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Gen-
Bank. Nucleic Acids Research 36(Database issue) (2008)
5. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C.C., Estreicher, A.,
Gasteiger, E., Martin, M.J., Michoud, K., O’Donovan, C., Phan, I., Pilbout,
S., Schneider, M.: The SWISS-PROT protein knowledgebase and its supplement
TrEMBL in 2003. Nucleic Acids Research 31(1), 365–370 (2003)
6. Cannataro, M., Guzzi, P.H., Veltri, P.: Protein-to-protein interactions: Technolo-
gies, databases, and algorithms. ACM Comput. Surv. 43 (2010)
7. Cannataro, M., Guzzi, P.H., Mazza, T., Tradigo, G., Veltri, P.: Using ontologies
for preprocessing and mining spectra data on the grid. Future Generation Comp.
Syst. 23(1), 55–60 (2007)
8. Cannataro, M., Guzzi, P.H., Veltri, P.: Impreco: Distributed prediction of protein
complexes. Future Generation Comp. Syst. 26(3), 434–440 (2010)
9. Cerami, E., Bader, G., Gross, B.E., Sander, C.: Cpath: open source software for
collecting, storing, and querying biological pathways. BMC Bioinformatics 7(497),
1–9 (2006)
52 M. Cannataro and P.H. Guzzi

10. Chaurasia, G., Iqbal, Y., Hanig, C., Herzel, H., Wanker, E.E., Futschik, M.E.:
UniHI: an entry gate to the human protein interactome. Nucl. Acids Res. 35(suppl.
1), D590–D594 (2007)
11. The UniProt Consortium: The universal protein resource (UniProt) in 2010. Nu-
cleic Acids Research 38(suppl. 1), D142–D148 (2010)
12. Craig, R., Cortens, J.P., Beavis, R.C.: Open source system for analyzing, validating,
and storing protein identification data. Journal of Proteome Research 3(6), 1234–
1242 (2004)
13. Desiere, F., Deutsch, E.W., King, N.L., Nesvizhskii, A.I., Mallick, P., Eng, J., Chen,
S., Eddes, J., Loevenich, S.N., Aebersold, R.: The peptideatlas project. Nucleic
Acids Research 34(suppl. 1), D655–D658
14. Guzzi, P.H., Cannataro, M.: mu-cs: An extension of the tm4 platform to manage
affymetrix binary data. BMC Bioinformatics 11, 315 (2010)
15. Schmidberger, M., Vicedo, E., Mansmann, U.: Affypara: a bioconductor package
for parallelized preprocessing algorithms of affymetrix microarray data
16. Taylor, C.F., Hermjakob, H., Julian, R.K., Garavelli, J.S., Aebersold, R., Apweiler,
R.: The work of the human proteome organisation’s proteomics standards initiative
(HUPO PSI). OMICS 10(2), 145–151 (2006)
Managing and Delivering Grid Services (MDGS)

Thomas Schaaf1 , Adam S.Z. Belloum2 , Owen Appleton3 ,


Joan Serrat-Fernández4 , and Tomasz Szepieniec5
1
Ludwig-Maximiians-Universitt, Munich
2
University of Amsterdam
3
Emergence Tech Limited, London
4
Universitat Politcnica de Catalunya, Barcelona
5
AGH University of Science and Technology, Krakow

The aim of the MDGS workshop is to bring together Grid experts from the (Grid)
infra-structure community with experts in IT service management in order to
present and discuss the state-of-the-art in managing the delivery of ICT services
and how to apply these concepts and techniques to Grid envi-ronments. Up to
now, work in this area has proceeded mostly on a best effort basis. Little effort
has been put into the processes and approaches from the professional (often
commercial) IT service management (ITSM).
The workshop creates a platform for both the users of Grid-based services
(e.g., high performance distributed computing users) and the people involved
in contributing to Grids and their operation (e.g., members of grid initiatives,
resources providers) to share their views on the topic of managed service deliv-
ery and related re-quirements and constraints. This reveals the need for defined
service levels in the form of service level agreements (SLAs) in Grid environ-
ments. Based on this, the workshop provides insight into the ITSM frameworks,
and focus on the exchange of ideas on how the Grid community may adopt and
adapt the concepts and mechanisms of these frameworks (and the ITSM domain
in general) to take benefit from them. In this context, the specific features and
characteristics of Grid environments are taken into ac-count.
Contributions to the MDGS2011 describe on going work on various topics re-
lated to Service level Management in Grid based systems. The accepted papers
cover various topics such as current best practices in grid Service Level Manage-
ment, problems faced, potential models from commercial IT Service Management
to be adopted, and specific case studies to highlight the full complexities of the
situation.
Resource Allocation for the French National Grid
Initiative

Gilles Mathieu and Hélène Cordier

IN2P3/CNRS Computing Centre


43 bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France
{gilles.mathieu,helene.cordier}@in2p3.fr

Abstract. Distribution of resources between different communities in


production grids is the combined result of needs and policies: where the users’
needs dictate what is required, resource providers’ policies define how much is
offered and how it is offered. From a provider point of view, getting a
comprehensive and fair understanding of resources distribution is then a key
element for the establishment of any scientific policy, and a prerequisite for
delivering a high quality of service to users.
The resource allocation model which is currently applied within most
national grid initiatives (NGIs) was designed with the needs of the EGEE
(Enabling Grids for E-sciencE) projects and should now be revised: NGIs now
especially need to assess how resources and services are delivered to their
national community, and expose the return on investment for resources
delivered to international communities.
The French NGI “France Grilles” is currently investigating down this route,
trying to define key principles for a national resource allocation strategy that
would answer this concern while allowing for the proper definition of service
level agreements (SLA) between users, providers and the NGI itself.
After looking for clear definitions of who are the communities we are
dealing with, we propose to look at how resource allocation is done in other
environments such as high performance computing (HPC) and the concepts we
could possibly reuse from there while keeping the specificities of the Grid. We
then review different use-cases and scenarios before concluding on a proposal
which, together with open questions, could constitute a base for a resource
allocation strategy for the French national grid.

1 Context and Definitions

1.1 Context of Current Work


The EGI-Inspire [1] project started in May 2010, as a continuation of around 6 years of
EGEE projects [2]. In this context, the French National Grid Initiative “France Grilles”
[3] as emerged as EGI’s partner for federating and operating grid resources in France.
Within EGI operational context [4], allocation of resources to grid user has changed
scope, since national grids are now privileged interlocutors and interfaces between users
and providers. This work is a preliminary reflection on the topic of resource allocation,

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 55–63, 2012.
© Springer-Verlag Berlin Heidelberg 2012
56 G. Mathieu and H. Cordier

and a possible basis for establishing policies and procedures in the medium term
specific to France’s context and based on international collaboration.

1.2 Definition of “User Communities”


In this paper, “user community” is used to represent a logical grouping of users that
can be seen as a unique interlocutor for all other actors. In our Grid context, a typical
example of a user community is a Virtual Organization (VO), but this can also be
extended to Virtual Research Communities (VRCs) or a specific scientific community
federated around a given project.

1.3 Definition of “Resource Providers” and “Service Providers”


Resource providers are the entities that provide user communities with access to
computing and storage resources. They are grid resource centres or “sites” as
described in EGI operational architecture definition [4]. Service providers are entities
offering services that can be technical – e.g. core VO services, monitoring tools etc. –
or not – e.g. support or expertise. France Grilles places itself both as a service
provider and an operation centre as defined in [4].

1.4 Definition of “Resource Allocation”


We consider “resource allocation” as a process involving different partners with the
aim to provide resources and services to Grid users. The result is actual resources and
services being provided, but also agreements being set up between providers and
consumers. Involved actors are user communities, resource providers and hyper-
structures such as Grid Infrastructures. “Allocation” is understood in this context, and
should not be interpreted as “reservation”.

2 Identified Needs and Goals

2.1 Improve Service Delivery to the French Community


The French Grid Infrastructure “France Grilles”, like most of its counterparts, has been
set up to answer specific scientific needs according to a national scientific policy. Overall
supervision on resource allocation and distribution is highly desirable and should be done
in respect to this. An allocation policy is clearly needed to get a comprehensive and fair
understanding of resource distribution. If there is a need to re-equilibrate this distribution
of resources between different communities (VOs, VRCs, projects, etc.), this should be
done according both to the needs and the overall scientific policy. This is an essential
contribution to a better quality of service delivered to our users.

2.2 Measure What Is Done


There has been no clear resource allocation policy so far within France Grilles: the
current resource allocation model which was designed with the needs of EGEE
Resource Allocation for the French National Grid Initiative 57

projects should be revised to ensure the visibility and sustainability of the French Grid
Infrastructure. Beyond that, there is a clear need of accountability. Especially, France
Grilles needs to be able to:
- Assess how resources and services are delivered to the French community;
- Justify that resources delivered to international communities are not wasted,
and that there is a return on investment.

3 Inspiration from Existing Resource Allocation Mechanisms


3.1 High Performance Computing (HPC) World
Resource allocation is an important aspect of all computing infrastructures, and High
Performance Computing is no exception. In this particular domain, resource allocation
is based on scientific evaluation, through a priori (e.g. evaluation of answers to a call for
proposals) and a posteriori (regular review of supported demands) analyses. This is how
resources are allocated in the GENCI [5] project, as explained as early as 2007 in [6]
and reflected in the yearly activity report from 2009 onwards [7].
These concepts could be adapted to the Grid context. However this has to be done
with care to take into account Grid specificities such as “free” access to resources and
the absence of the concept of resource reservation.

3.2 Worldwide LHC Computing Grid (WLCG)


The WLCG [8] resource allocation model is based on a principle of pledges: to
answer LHC experiments’ needs, participating resource providers offer amounts of
computing resources under the form of pledges under the supervision of a “Resource
Scrutiny Group” on a yearly basis [9]. WLCG being the biggest user of France Grilles
resources, it is utterly important to take this procedure into account.

3.3 Other National Grids


Discussions and collaborations with the polish NGI PL-Grid [10] have led to sharing
ideas and concepts about resource allocation at a national level. PL-Grid model is a
resource allocation centric model [11] which makes extensive use of an SLA
management tool, the Grid Bazaar [12]. Interactions with our Polish colleagues have
already produced some of the ideas described in this paper. The use of a bazaar-like
tool is equally one of the tracks we could follow in the future.

4 Definition of the Strategy


4.1 Key Principles
We propose the following principles as a basis for our resource allocation strategy:
- Decisions on how to allocate resources are made on both a priori and a
posteriori analyses, the former allowing to agree on estimated needs and the
latter focusing on measuring how much has been used
58 G. Mathieu and H. Cordier

- New communities can join in and use resources without necessarily being
filtered, provided their needs are reasonable (filtering is done above a given
threshold in terms of how much the user asks – if asking for any precise
amount of resources)
- Established user communities provide the scientific expertise needed to
validate resource allocation above this threshold
- There is a unique point of contact for all users in demand of resources
- The complexity of the model is not visible to users
- The whole model allows to measure and report on resource usage for both
new and established communities, either French or international

4.1.1 Who Are the User Communities We Have to Consider?


There are various kinds of user communities using French NGI resources, spanning
from international to regional, thematic or project-driven. Moreover, we are now
considering Virtual Research Communities within EGI. These VRC will gather
several VOs spread across several projects, countries and groups. Our needs and key
principles lead to two clearly different use cases:
- Resource allocation to new users (not using the grid yet) and French
scientific communities. Those communities might not be structured yet and
can be identified by the project that federates them.
- Resource allocation to established international communities. These can
be international VOs or VRCs.

4.2 Allocating Resources to New Users and French Communities


4.2.1 “A Priori” Analysis
In the overall scenario described on figure 1, a new user with a predefined project asks
for resources, or simply expresses interest to join the Grid without any precise demand on
the amount of needed resources. The request is handled by the NGI through a single
point of contact that acts as a “broker”. The 3 basic questions to answer at this stage are:
1. Is there an existing VO on the Grid that could integrate this project/demand
to its activities?
2. Is the user “grid aware”, e.g. is the project ready for grids, are all
applications ported etc.?
3. Is the requested amount of resources above a given threshold, if any?
As shown on Fig.1, the result of the analysis can either be:
- Rejection of the project if it is considered non valid by the scientific
committee
- Redirection of the user:
o To the training activity if it is felt the project has potential but is
not grid-enabled or grid-focused;
o To a better frame (e.g. HPC) if it is felt the project is not a good
use case for grids. A bi-directional process is the long-term aim
here: e.g. HPC potential users could be redirected to grids if their
need is better matched.
Resource Allocation for the French National Grid Initiative 59

Ask
resources
User Single point of contact ("Broker")
INFORMAL PROJECT
REQUEST: Resources
"I want to use request yes
Matching Transfer
the Grid"
community? request

no
Establishe
d VO
Send user no Is user "grid
to aware"?
VO based
yes resources
allocation

NGI/VO yes
Resources > Ask for
Training
threshold? validation

no
Scientific
Transfer committee
request

Transfer yes Request


request agreed?
NGI
Operations no
Project
redirection
NGI based or rejection
resources
allocation

Fig. 1. A priori analysis for resource allocation requests from new users

- Project support through a VO based resource allocation agreement. In


this case, an existing VO accepts the new user as one of its members and
applies its own policies with regards to how much resources this user can get
from what is already available to the VO. Example: a new user with a project
in biology will probably be redirected to biomed, who will then decide what
place to give to this project within their activities.
- Project support through an NGI based resource allocation agreement.
This is the case we present in details below.
60 G. Mathieu and H. Cordier

The exact composition of the scientific committee deciding on demands above


defined thresholds will soon be defined in the context of the first French User Forum
in September 2011. This should certainly involve scientific coordinators from the user
communities, under the NGI umbrella.
The scientific committee also decides on the values for thresholds, as well as on
any additional criterion needed for the evaluation of scientific validity of a given
project and its interest for France Grilles.

4.2.2 Project Support through an NGI Based Resource Allocation Agreement


Depending on the scope of the resource allocation, each involved body should be able
to decide at its own level. Agreements on “physical” resources (e.g. CPU) should be
decided by sites, while agreements on services (e.g. support) should be made by the
NGI. This is because the final decision should be taken by who controls the resources.
Each site, as resource provider, has a different funding schema and is the best placed
entity to commit to provide resources. At a higher level, the NGI doesn’t have to
control these resources but could just act as a relay.
The threshold principle applied within the a priori analysis can also be used to
determine whether the project will be supported by the NGI through the creation of a
new VO or through a catch-all VO.
A process proposal is described on figure 2.

INFORMAL Project support


REQUEST:
"I want to use
the Grid" New VO

PROJECT
Resources "Catch all"
NGI VO
request
Operations

Needs for Needs for


Services Resources

Site
Sitebased
based Site
Sitebased
based
Services Resources
support
support
support support
proposals proposals
agreements
agreements agreements
agreements
negotiations
Service Resource
Providers Providers
Resource
allocation

User
Fig. 2. Establishment of an NGI based resource allocation agreement
Resource Allocation for the French National Grid Initiative 61

The result of the process is the establishment of a resource allocation agreement


between the resource providers (sites), service providers (sites and/or NGI) and the
user.

4.2.3 “A Posteriori” Analysis


Resource usage verification for supported projects (i.e. those who have been allocated
resources through an a priori analysis) leads to an a posteriori analysis of the initial
application and possible review of new requests by projects.
The goal of this analysis is to:
- Assess the validity of the initial request
- Monitor the possible growth of the project, and take into account new
resulting needs
At this stage, there is a need to define a second threshold in the amount of used
resources above which the user/project which have been integrated to the catch-all
community need to “emancipate” and start a new community.
In the case of a “VO based resource allocation” (see fig.1), this a posteriori
analysis should allow to assess new needs for the considered VO. This will then be
taken into account as part of the process of allocating resources to established
communities.
Detailed workflows and implementation of an a posteriori analysis will be the
subject of a deeper study in the months to come.

4.3 Allocating Resources to Established International Communities

4.3.1 Scope of the Process


We aim that France Grilles resource allocation strategy include the case of international
VOs whenever possible.
We are fully aware that some project driven communities (e.g. WLCG) already
have a clear resource allocation mechanism: our goal is neither to temper with this nor
to add an extra layer that would unnecessarily complicate the process. However, it is
utterly important to provide a frame for international VOs who wish to negotiate
resource allocation with NGIs and that our model is in line with WLCG’s.

4.3.2 Proposed Principle


We propose to deal with international VOs/VRCs in a similar way to French
communities, by considering only the French part of this VO/VRC (e.g. LCG-France
for LHC VOs). From an NGI point of view, the interlocutor is then the representation
of this VO/VRC in France.
From a VO point of view, France Grilles can act as a facilitator to reach
agreements with sites. Depending on which granularity the VO considers convenient
to deal with, agreements can be built at NGI level, or at site level. As an example,
MoUs are already covered by the biomed VO for the latter category.
62 G. Mathieu and H. Cordier

VRC EGI NGI

VO Sites

Strategic policies
Facilitation
Concrete agreements

Fig. 3. Interactions between partners involved in resource allocation

4.3.3 Measuring “French” Usage of Resources


One of the needs expressed at the beginning of this paper is to report on service
delivery to the French community. If this is easy to do in the context of regional or
national VOs, this is a more complex problem in the case of international ones.
Practical methods can be implemented to distinguish between French vs. foreign
usage within a VO (e.g. calculating the ratio of certificate DNs issued by the French
CA). Ideally, the request made to EGI to produce as a metric the percentage of usage
par certificate DN per VO should help in that matter.
This approach can be limitative with regards to some usage of the grid such as pilot
jobs in the scope of international communities; we expect though that this starting
point in our estimation of the resource usage will improve with time.

5 Next Steps
Jointly to this work on a national resource allocation strategy, we are currently in the
implementation process of national VO for France Grilles. Establishing such a VO
addresses the need for an easier integration of new users by establishing a VO
supported nationwide and open to all. Our usage scenario is to add this national VO to
the already available offer provided by local and regional VOs, so as to provide the
French community with a larger spectrum of possibilities to answer their needs. This
way we also build upon the existing structure and manpower set-up in regional grids
to remain as close as possible from the end-user.
Resource allocation through the national VO can be seen as a possible implementation
of the NGI based resource allocation agreement described earlier (see Fig.2).
As mentioned in section 4, a deeper study of the modalities of an a posteriori
analysis is also needed to make any further progress. Part of our effort in the months
to come will be dedicated to that.
Resource Allocation for the French National Grid Initiative 63

Also, and as exposed earlier, the usage of a tool to monitor and follow negotiations
between any resource providers and user communities is currently under study. This
could lead to the set-up of an a posteriori usage dashboard and possibility to drive the
process for a priori allocations for resource allocation through the national VO. Also
such assessments could be used for the real time implementation of the resource
allocation.

References
1. EGI-Inspire web site, http://www.egi.eu/projects/egi-inspire/
2. EGEE web site, http://www.eu-egee.org
3. France Grilles web site, http://www.france-grilles.fr
4. Ferrari, T.: EGI Operations Architecture, EU deliverable D4.1,
https://documents.egi.eu/public/ShowDocument?docid=218
5. GENCI web site, http://www.genci.fr
6. Rivière, C.: GENCI: Grand Equipement National de Calcul Intensif. Rencontre GENCI
ORAP, Paris (2007), http://www.genci.fr/spip.php?article13
7. Rivière, C.: Rapport annuel 2009 de GENCI,
http://www.genci.fr/spip.php?article92
8. WLCG web site, http://lcg.web.cern.ch
9. WLCG MoU, Annex 9 “Rules of Procedure for the Resources Scrutiny Group (RSG)”,
http://lcg.web.cern.ch/LCG/mou.htm
10. PL-Grid web site, http://www.plgrid.pl
11. Szepieniec, T., Radecki, M., Tomanek, M.: A Resource Allocation-centric Grid Operation
Model. In: Proceedings of the ISGC 2010 Conference, Taipei, Taiwan (2010)
12. Bazaar Project Web Page, http://grid.cyfronet.pl/bazaar
On Importance of Service Level Management
in Grids

Tomasz Szepieniec1 , Joanna Kocot1 , Thomas Schaaf2 ,


Owen Appleton3 , Matti Heikkurinen3 , Adam S.Z. Belloum4 ,
Joan Serrat-Fernández5, and Martin Metzker2
1
ACC Cyfronet AGH, Krakow
2
Ludwig-Maximilians-Universitaet, Munich
3
Emergence Tech Limited, London
4
University of Amsterdam
5
Universitat Politcnica de Catalunya, Barcelona

Abstract. The recent years saw an evolution of Grid technologies from


early ideas to production deployments. At the same time, the expecta-
tions for Grids shifted from idealistic hopes — buoyed by the successes
of the initial testbeds — to disillusionment with available implementa-
tions when applied to large-scale general purpose computing. In this pa-
per, we argue that a mature e-Infrastructure aiming to bridge the gaps
between visions and realities cannot be delivered without introducing
Service Level Management (SLM). To support this thesis, we present an
analysis of the Grid foundations and definitions that shows that SLM-
related ideas were incorporated in them from the beginning. Next, we
describe how implementing SLM in Grids could improve the usability
and user-experience of the infrastructure – both for its customers and
service providers. We also present a selection of real-life Grid application
scenarios that are important for the research communities supported by
the Grid, but cannot be efficiently supported without the SLM process in
place. In addition, the paper contains introduction to SLM, a discussion
on what introducing SLM to Grids might mean in practice, and what
were the current efforts already applied in this field.

Keywords: SLM, service delivery, Grids, ITIL.

1 Introduction

Since the 1990’s, when the term ‘Grid’ was coined, Grids have changed from early
prototype implementations to production infrastructures. However, despite ma-
turing considerably during this time, Grids still suffer from the lack of service
management solutions that would be suited to an infrastructure of the size and
user base of the current Grid. The maturing Grid technologies need to incor-
porate understanding of the business models of the users and service providers.
When possible, they should be composed from standard business solutions that
support service management and delivery.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 64–75, 2012.

c Springer-Verlag Berlin Heidelberg 2012
On Importance of Service Level Management in Grids 65

In parallel with the evolution of the infrastructures, the understanding of


what the Grid should be was subject to a change that was by no means less
significant and rapid. The users of the Grids, as well as the specialists from the
distributed computing domain, were at first fascinated with the potential of the
Grid technology. However, they gradually became disappointed with what was
really offered to them, and shifted towards new trends and paradigms (with the
same elevated hopes). This change is somewhat alarming, as these new technolo-
gies, will be (or already are) facing the same problems [1] related to provision of
computational and storage resources to the users. There is a danger of repeating
the vicious enthusiasm-disillusionment cycle, as long as the users are looking for
“miracle cures” and there are over-optimistic proponents of untested solutions.
The authors believe that the Grid technologies are in fact mature enough
to meet most of the user needs. However, the quality of service provision and
management needs much more attention. A professional service management
approach is the key to engaging with users and improving their satisfaction ser-
vices. It also acts as a tool for capturing and transferring requirements and best
practices that can be used for more informed evaluation of new e-Infrastructure
services (and more efficient uptake, eventually). This process becomes even more
critical when an e-Infrastructure intends to serve more and more demanding and
complex projects. For customers engaged in such initiatives, warranties related to
resource provisioning and service level are crucial. The common de facto assump-
tion of the e-Infrastructure service providers, which sees any vague, qualitative
service level or best-effort operation (beyond what is provided by the software it-
self) as sufficient, is no longer valid. Grid Computing and other e-Infrastructures
must follow similar paths towards maturity as the general solutions available for
IT services. The realization of this goal can be sped up by basing it on doc-
uments such as ITIL [2], which provide best practices for implementation and
management of processes important to contemporary Grids.
In this paper, we argue that mature and competitive e-Infrastructures im-
plementing the Grid ideas cannot be delivered efficiently without implementing
processes of Service Level Management (SLM). To prove this thesis, we provide
a range of arguments – starting from the elements of Grid theory in Sec. 4,
through the analysis of benefits for Grid customers and providers in Sec. 5 that
SLM might bring, to actual scenarios of real computations using Grids in Sec. 6.
Additionally, we give a short introduction to SLM in Sec. 2, and to ideas how
SLM can be applied to Grids in Sec. 3. Related works and implementation are
presented in Sec. 7 and in Sec. 8.

2 Background: On the Relevance of SLM


The efficient delivery of high-quality IT services — especially in a constantly
changing environment, with ever-growing customer and user demands — poses
a major challenge for the IT service providers. To rise to this challenge, more
and more (commercial) providers are adopting IT service management (ITSM)
66 T. Szepieniec et al.

processes as described by the IT Infrastructure Library (ITIL) [2]1 , or ISO/IEC


20000 [3]2 .
ITSM can be regarded as a set of organisational capabilities and processes
required by an IT (service) provider to keep his utility and warranty promises /
commitments. In this context, Service Level Management and Service Delivery
Management are the most important sub-disciplines:
Service Level Management (SLM) describes the processes of:
– defining a catalogue of IT service offerings;
– specifying services and service components, including their dependencies
and available service level options;
– negotiating and signing Service Level Agreements (SLAs) with customers,
underpinning SLAs with internal Operational Level Agreements (OLAs)
and suitable contracts with external suppliers,
– monitoring and reporting on the fulfilment of SLAs as well as (early)
notifications of SLA violations.
Service Delivery Management (SDM) provides guidelines for managing
the delivery of SLA-aware IT services through their lifecycle including:
– planning of details of service delivery;
– monitoring and reporting on capacity, availability, continuity, and secu-
rity;
– managing changes and releases in a controlled manner;
– maintaining accurate information on the infrastructure and its configu-
ration;
– handling incidents and user requests, and resolving and avoiding prob-
lems.
Following a process approach in the implementation of Service Level Manage-
ment and Service Delivery Management means providing a clear definition of
tasks, activities and procedures. This must be supported by unambiguous dele-
gation of responsibilities, identification of all interfaces, as well as steps to ensure
adequate documentation, traceability and repeatability of all processes.
The main focus of this paper is Service Level Management, since it forms the
foundation for effective Service Delivery Management. In general, SLM is a vital
part of customer-oriented provision of high-quality IT services. It is important
for achieving an improved relationship between an IT service providers and their
customers, as well as for aligning “what the IT people do” to “what the busi-
ness requires”. In the relationship management domain SLM provides common
understanding of expectations, mutual responsibilities and potential constraints
between different domains.
Various approaches for supporting effective SLM have evolved from research
and practice, mostly focused on business IT, throughout the last decade. Still,
it should be noted that SLM in general is evolving beyond the “traditional” IT
service provisioning scenarios – hence, introducing it to such infrastructures as
Grids can be seen as part of a natural progression.
1
A set of handbooks describing good practices for ITSM.
2
An international standard for ITSM which features a process framework, which is
in many aspects aligned with ITIL.
On Importance of Service Level Management in Grids 67

3 Model of SLM for Grids


Before specifying how the Service Level Management principles can be applied
to Grid infrastructures, the main actors and relations between them have to be
identified and described. The main actors considered in an SLM model of a Grid
infrastructure are:
– Virtual Organisation (VO) is a set of individuals and organisations (i.e.
users) that cooperate by sharing resources according to formal or informal
contract which defines the rules of cooperation. We understand that a Virtual
Organisation is the customer of a Grid Initiative.
– Grid Initiative (GI) is an approved body that provides Grid computing ser-
vices or represents Grid providers in a region, country or group of countries.
Grid Initiatives may be organised in larger bodies, creating a hierarchical
structure, with primary GIs federated in secondary GIs etc. In Europe, for
example, the primary GIs are created at national level – forming National
Grid Initiatives (NGIs), and are federated in the European Grid Initiative
(EGI.eu). The infrastructure and middleware supporting the GIs on all levels
constitute a Grid. The GI is a single point of contact for a VO, representing
the Grid as a whole. The added value of a GI may range from a simple ag-
gregation (GI as “mediator”) to full integration (GI as “service provider”)
of the underlying resources.
– Site is an infrastructure provider that offers computing and storage infras-
tructure available through Grid protocols; they usually do not provide the
whole set of technical services to support Grid Computing.
– An External Partner/Supplier is supporting any of the above mentioned
primary actors in the fulfilment of their duties.
The SLM model for Grids assumes that not all the actors interact with each other
directly, and that the interaction between the parties can be formalised using
a set of agreements. The model for these interactions is presented in Fig. 1. It
allows the relationships to form hierarchical SLA&OLA framework that is com-
patible with a general IT SM approach. However, the model was designed to be
applicable to different types of GIs in terms of amount of extra warranties added
on higher level of OLAs or SLAs. The model allows the following interactions:

– VO – GI: The GI is responsible for provision of a Grid service to the VO.


Formalisation of such relationship is done through a Service Level Agreement
(SLA). SLA describes the Grid service, documents Service Level Targets, and
specifies the responsibilities of the GI and the VO.
– GI – Site: The Site is responsible for delivering services to the GI customers
(VOs). These relationships are formally described with an Operation Level
Agreement (OLA). The OLA framework within a GI supports the fulfilment
of the targets agreed in the SLAs between the GI and its VOs. Hence, OLAs
may be established for one of the two purposes: in order to support one or
more specific existing or intended SLAs, or as a general and/or preparatory
basis for establishing new services/SLAs.
68 T. Szepieniec et al.

Fig. 1. OLA/SLAs defining relations between actors in Grids. 1y GI and 2y GI stand


for Primary and Secondary GI, respectively.

– Primary GI – Secondary GI: The nature of this relation is similar to the


interaction between GI and a Site, and may be described with the same
formalisms (OLAs).
– External Partner/Supplier – GI: The relationship between External Part-
ner/Supplier and a GI (primary or secondary) or any other actor is formalised
through an Underpinning Contract (UC). As UCs are formal contracts with
external bodies they may contain references to general terms and conditions
or specification of commercial and legal details.

4 Elements of SLM in Grid Theory

The term Grid was introduced to describe a federated infrastructure, providing


computing resources to its users. Ian Foster, considered as one of the original
authors of the Grid concept, required delivering nontrivial qualities of service as
one of three main characteristics of the Grid in his most commonly cited defi-
nition of Grid [4]. This feature was also explained as various qualities of service
which are set-up to meet complex user demands. Even if Forster’s definitions
do not explicitly mention a need of negotiation and signing of an agreement –
an SLA, it is obvious that the quality of service needs to be described using
measurable metrics.
On the other hand, Plaszczak and Weiner [5] claim that one of three main
advantages of Grid Computing is on-demand provisioning as opposed to clas-
sical resource provisioning realised by purchasing and installing hardware and
software. If this process is to be reliable to the user, a kind of warranty that the
resources provisioned are available when they are needed, is crucial. Therefore,
On Importance of Service Level Management in Grids 69

such a warranty has to be a subject of an agreement between the provider and


the customer.
A similar conclusion can be drawn from [6], where the authors introduce a
distinction between customers with specified expectations and customers that
can accept any (unspecified) quality of service (QoS). According to the authors,
the former needed to be serviced by so-called “commercial” Grids, that require
an SLA framework. The latter are just a limited class of users and applica-
tions – which means that without SLM, Grid technologies shall become niche
technologies of very limited usage.
However, the history of Grid Computing, apart from these ambitious theo-
ries, provides also an explanation why the current infrastructures provide their
resources only on best effort basis. The first implementations of Grid-like tech-
nologies were built by voluntary computing based on desktop machines like the
Seti@Home Project3 , in which only best-effort QoS approaches were possible.
Many people still believe that Grid Computing is simply voluntary computing
and it will always remain of low QoS4 . We consider this view a stereotype. One
can note that technologies aimed at federating resources are orthogonal to single
resource reliability. It applies to federated resources with low reliability (vol-
untary computing), as well as to resources with high reliability (professionally
managed computer centres) – both types of resources have their own groups
of users. Obviously, reliability and other QoS parameters of federated infras-
tructures are strictly related to the same parameters with which single services
are provided and it means that providing federated resources with high level of
quality is possible.

5 Actors Perspective
In this section we analyse how each actor of the SLM model for Grids would
benefit from introducing Service Level Management solutions. We will also assess
the cost of such operation.

5.1 Customer Perspective


The groups most directly interested in the quality of the Grid services are their
users. Here we focus on issues typically raised by them, which can be solved by
introducing SLM to the Grid infrastructure. This analysis is partially based on a
survey the authors performed on users of different Grid infrastructures and the
Virtual Organisation managers gathered at EGI User Forum 20115.
Any users activity in the infrastructure is usually a stage of a scientific plan,
project or experiment that needs to be accomplished in a limited time. This is
strictly related to a need for a warranty of availability of certain resources, ful-
filling certain requirements (parameters) within the requested time. Therefore,
3
http://setiathome.berkeley.edu/
4
This observation was confirmed recently by a survey performed on participants of
the International Supercomputing Conference ISC’11.
5
http://uf2011.egi.eu/
70 T. Szepieniec et al.

a crucial benefit of introducing the SLM mechanisms to Grid is the possibility


of planning ahead the activities that require services. This requirement was con-
firmed by the aforementioned survey results – the users perceived “no or poor
warranty of obtaining resources in reasonable (finite) time“ to be the second6
most discouraging issue with Grid technology. Also, “improving warranty” was
the most desired improvement suggested by the questioned users.
Naturally, the introduction of any kind of warranty introduces additional man-
agerial overhead for users who need to apply and negotiate for such warranties.
The balance between benefits and costs of additional effort seems important suc-
cess factor of the SLM deployment. In our survey, 60% respondents indicated
that they are ready to invest in more strict and complex procedures in exchange
for improvements in the Grid quality issues.
The motivation for improvement in management of the Grid services can be
also drawn from how the users evaluate the infrastructures they had experience
with. It is significant, that the quality of the resource provision and management
in most infrastructures, according to their users, is considerably lower than the
quality of the resources themselves. The disparity of the average grade spans from
0,52 to 1,35 in 1-5 scale, for the larger, international infrastructures. While, for
national infrastructure the same parameters are perceived as slightly better.
All this agree with a fundamental psychological fact: user’s satisfaction strongly
depends on predictability of resource characteristics.

5.2 Sites Perspective

Usually, sites (or resource providers in general) tend to be reluctant in adopting


SLM, as they see themselves as the side that is forced to promise and give
warranties. However, deeper analysis shows some important benefits for them
too, coming from the adoption of SLM.
Primarily, in the SLM process, a provider obtains detailed specification of the
user needs usually some time in advance. This gives them the opportunity to bet-
ter manage the resource provision and perform capacity planning – by allowing the
providers to better estimate the parameters of the resources that would be needed
by customers. That, in consequence, leads to optimising the resource costs. Based
on the known requirements, the provider can better handle prioritization also in
terms of executing internal policies and preferences. What is more, introducing
SLM facilitates (or enables) accounting, which, especially in academia, usually
requires justification in terms of reported results of scientific research. With SLM
the previous result reports may serve in negotiating new SLAs.
In further perspective, SLM stimulates and strengthens the relations with cus-
tomers, which naturally results in evolution of the maturity of the providers, who
better know customer needs and can assess their satisfaction. SLM may serve also
to improve the communication with the customers to keep the providers better
informed about the user needs, distribute offers and provide means for marketing
solutions. The latter actions are now usually neglected by the resource providers,
6
The first were “technical difficulties” – which are out of scope of this document.
On Importance of Service Level Management in Grids 71

especially academic computer centres, although their R&D departments require


close collaboration with the users.
The main cost of introducing SLM from the resource providers point of view
is the additional effort of maintaining SLM-related processes which include ne-
gotiations, reconfiguration of resources, usage monitoring, and accounting.

5.3 Grid Infrastructure Provider Perspective

The Grid Infrastructure can be perceived as a virtual resource provider, since,


what it offers, are resources operated by other (physical) resource providers.
So, the benefits and cost of adopting SLM are similar to a provider’s described
above. However, the perspective of GI is broader, as it can handle many resource
providers.
In ITIL all the activities should be focused to deliver more value to the users.
Value comes from two elements: resources itself and the warranties. In terms
of resources GI usually does not provide anything that cannot be delivered by
the sites. However, by maintaining OLAs with sites, the GI can deliver more
warranty than any specific task. In that sense, a GI that implements SLM can
provide more value to customers, and, in this way, be competitive among the
Grid infrastructures. Otherwise, the GI’s role is limited to providing technical
solutions for integration.

6 Application Scenarios That Require SLM

6.1 Large Collaboration Case

From the beginning of the realisation of the Grid concepts, the key Grid cus-
tomers were large-scale projects with worldwide collaboration. For such projects,
implementation of at least some processes from Service Level Management seems
unavoidable. Giving the example of the main customers of the European Grid
Initiative (EGI) [7], we show how SLM was necessary for them and how it was
realised.
The most representative example of a large European project using Grids is
Large Hadron Collider (LHC) built in CERN. LHC is the largest research de-
vice worldwide, gathering thousands of researchers in four different experiments.
Each of these experiments requires petabytes of storage space and thousands of
CPU cores to process the data. The data produced by the experiments are pro-
duced continuously while the LHC is running. Therefore, there should be enough
resources capable of handling large data amounts and throughput supplied, both,
on the short- and on the long- term. This includes computational and storage
resources, as well as network facilities. Thus, the long-term goals require special
focus on infrastructure planning.
The LHC way of defining contracts related to resources was to launch the
process of Memorandum of Understanding (MoU)preparation and signing. The
process was extremely hard and problematic - it required many face-to-face
72 T. Szepieniec et al.

meetings and took several months. MoUs had to be signed by each Resource
Provider. It was possible to agree only on very general metrics related to capacity
of resources provided in the long term. The process of MoUs signing was planned
to be a single action. However, they required fulfilling other quality metrics
defined by OLAs acknowledged by the sites entering the Grid infrastructure.
These metrics were not related to any specific customer.
Even with these simple means, the result of signing of the MoUs was a con-
siderable increase of job success rate (a factor describing the amount of tasks
submitted to the Grid that could be completed normally) [9].

6.2 Data Challenges


Other important scenarios for Grid Computing are experiments that need to
mobilise large resources for a relatively short period of time. Usually, this kind
of experiments, also known as data challenges, are planned and co-located with
a public event or a tight schedule of some research - the International Telecom-
munication Union Regional Radio Conference in 2006 may serve here as an
example. During this event, representatives from 120 countries negotiated a
radio-frequencies plan7 . The conference lasted about one month and required
weekly major revisions of the global plan and daily over-night refinements for
certain regions. Both processes were computationally intensive and needed to be
completed in a defined period of time, as their results were needed to continue
negotiations.
Scenarios like this clearly show a need for warranty that the required resources
are available on time and in sufficient amount. Technically, in such well described
and scheduled computations, a Service Level Agreement would specify resource
reservations. Reservation schedule can be subject of negotiations before signing
an SLA.

6.3 Urgent Computing


Urgent computing is a class of applications that typically request large num-
ber of computing power at a specific time. Early Warning Systems (EWS) are
good example of such applications. New generations of EWS framework, such
as the one targeted by the UrbanFlood project8 , are used to extend existing
ones with new internet and sensor networks technologies. The EWS targeted in
the UrbanFlood project run as an internet service, able to host multiple EWSs,
corresponding to various environmental issues and belonging to different organi-
zations and authorities. In such systems, data streams from sensors need to be
processed in order to analyze a current status of the monitored systems, make a
prediction, validate a model, or recommend an action. Sophisticated simulation
models that are computationally intensive are used to process the collected data.

7
https://twiki.cern.ch/twiki/pub/ArdaGrid/ITUConferenceIndex/
C5-May2006-RRC06-2.ppt
8
http://www.urbanflood.eu
On Importance of Service Level Management in Grids 73

In case of emergency, the processing of the data needs to be delivered in real


time in which case urgent computations might be triggered automatically and
require a rapid access to a large computing power.
Even though Grids have the potential infrastructure in term of available com-
puting power to handle such emergency cases, there is no guarantee that the
required computing power will be available at the time it is needed, as the cur-
rent state of Grid administration is based on best effort and the queuing time
of the jobs submitted to the Grid varies from a couple of minutes to a couple of
hours and sometimes more. Reservation of computing resources is currently the
only way to guarantee the availability of computing resource in Grid but is not
applicable in the case of urgent computing as it is not known in advance when
the resources will be needed. Furthermore, in some emergency scenarios, such as
flooding, the Grid infrastructure could be hampered and, thus, Grid sites become
isolated and unreachable. For this reason it would be ideal to use resources close
to the sensor data so as to limit the point of failures. In major emergencies it
would also be ideal to have mutable resources where computing resources stops
whatever they are doing and focus all their attention to the emergency.
The solution of urgent computing cases lies clearly in the area of SLM. Ad-
ditionally, it shows how different SLAs influence each other, by e.g. including
an option of killing other jobs based on one SLA by the others in case of emer-
gency. It also shows that SLA should be implemented in the configuration of the
site and linked with proper authorization procedure [8]. Those are valid research
tasks.

7 Related Works

The Extended Telecommunication Operations Map (e-TOM) [10] is a reference


framework promoted by the Telemanagement Forum for processes to be con-
ducted within the network operators and service providers. e-TOM is hierarchi-
cal, in the sense that processes are grouped in categories or levels. Among others,
it puts special emphasis on service delivery and service level management. In the
following paragraphs we summarize the structure of that framework in respect
to SLM.
Service Level Management is covered by a level-2 process called Service Qual-
ity management. This, in turn, is decomposed into seven level-3 processes. The
latter processes are meant to monitor, analyse, improve and report the service
quality. In addition, in case of service degradation, these processes track and
manage the service quality resolution.
SLAs, OLAs and SLSs are defined within a model (informal model) in [11].
Each of these concepts is related to each other and to a set of actors and metrics
to allow determining potential performance degradations according to what is
established in the above contracts.
Management of SLA is also of particular relevance within e-TOM. A closely
related process in that field is Customer QoS/SLA Management, which is a
level-2 process that is decomposed in other six level-3 processes aimed to the
74 T. Szepieniec et al.

assessment and report the SLA fulfilment to the customer. Also, these level-
3 processes cover the lifecycle of the process of management and resolution of
eventual SLA violations.
Performance management is not only considered in the relationship between the
service provider and their customers, but also between the service provider and
their partners/suppliers. In this sense, it is worth to mention the level-2 process
called Supplier/Partner Performance Management. It decomposes further into five
level-3 processes covering aspects like the performance assessment, its reporting
and eventual actions to undertake in case that the contracted quality drops below
established thresholds. The performance of the service to be provided by a supplier
or a partner is also collected in SLAs (Supplier/Partner SLA).

8 Examples of Implementation of SLM Elements in Grids


Although there was no coordinated effort to introduce Service Level Manage-
ments in the main European Grid Initiatives (the gSLM project is the first one),
there were several attempts to implement some of its aspects to the infrastruc-
tures, mainly (but not only) at the national level.
The example of such are projects SLA@SOI9 and SLA4D-Grid10 . The first
project is concerned mainly with Service-Oriented infrastructures and aimed on
industrial use cases. Its main concern is assuring predictability and dependabil-
ity for the business partners. These features are achieved by introducing an SLA
framework for automatic SLA negotiation and management, which may not be
possible in such large infrastructures as Grids. The aim of the second project
is to design and implement a Service Level Agreement layer in the middleware
stack of the German national Grid initiative D-Grid. The targets of the SLAs
in the project are warranties of the quality of service and fulfilment of the ne-
gotiated business conditions. The SLA4D-Grid project focuses on the tools for
automatic SLA creation and negotiation, offering also support for monitoring
and accounting, it does not, however, provide a model for an integrated SLA
framework enabling interaction with other Grid infrastructures than D-Grid.
Important part is effort to standardise SLA negotiation protocols based WS-
Agreement [12].
Within PL-Grid Project11 , a project called Grid Resource Bazaar [13] has
developed a framework for Service Level Agreement negotiation, designed for
resource allocation in the Polish NGI. In this framework, an NGI can act as a
mediator between user groups and sites. Users can apply for resources, specifying
several optional metrics that are later translated into computational and storage
resource configurations. Sites can optionally delegate negotiations of some SLAs
to NGI, based on special types of Operational Level Agreement. The process is
maintained using a specialized Web-based platform that facilitates complexity
management.
9
http://www.sla-at-soi.eu/
10
http://www.sla4d-grid.de/
11
http://www.plgrid.pl/en
On Importance of Service Level Management in Grids 75

9 Summary

In this paper authors advocate introducing Service Level Management to Grid


Computing. Extensive analysis shows the need of such solutions and the benefits
they could provide. However, it is also clear that adoption of the SLM processes in
federated infrastructures is challenging, and requires considerable effort to deliver
and maintain. Today, IT infrastructure services are becoming closer to other
industrial standards in their approach, in order to meet business requirements
of their customers. The Grid infrastructures cannot ignore this trend, lest they
will lose their users as a consequence of poor levels of user-satisfaction.

Acknowledgments. This work was funded by EU FP7 gSLM. Tomasz Szepi-


eniec thanks PL-Grid Project for support of this work.

References
1. Schwiegelshohn, U., et al.: Perspectives on Grid computing. Future Generation
Grid Computing 26(8), 1104–1115 (2010)
2. Taylor, S., Lloyd, V., Rudd, C.: ITIL v3 - Service Design, Crown, UK (2007)
3. ISO/IEC 20000-1:2011 IT Service Management System Standard
4. Foster, I.: What is the Grid: A Three Point Checklist, Grid Today, July 20 (2002)
5. Plaszczak, P., Wellner, R.: Grid computing: the savvy manager’s guide. Elsevier
(2006) ISBN: 978-0-12-742503-0
6. Leff, A., Rayfield, J.T., Dias, D.M.: Service-Level Agreements and Commercial
Grids. IEEE Internet Computing 7(4), 44–50 (2003)
7. Candiello, A., Cresti, D., Ferrari, T., et al.: A Business Model for the Establishment
of the European Grid Infrastructure. In: The Proc. of the 17th Int. Conference on
Computing in High Energy and Nuclear Physics (CHEP 2009), Prague (March
2009)
8. Kryza, B., Dutka, L., Slota, R., Kitowski, J.: Dynamic VO Establishment in Dis-
tributed Heterogeneous Business Environments. In: Allen, G., Nabrzyski, J., Seidel,
E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009, Part II. LNCS,
vol. 5545, pp. 709–718. Springer, Heidelberg (2009)
9. Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on
the Grid: late job binding with lightweight User-level Overlay. Accepted for pub.
in FGCS (2011)
10. Business Process Framework, Release 8.0, GB921, TMForum (June 2009)
11. SLA Management Handbook, Release 3.0, GB917. TMForum (May 2010)
12. WS-Agreement Specification, version 1.0. Approved by OGF,
http://forge.gridforum.org/
13. Szepieniec, T., Tomanek, M., Twarog, T.: Grid Resource Bazaar. In: Cracow Grid
Workshop 2009 Proc., Krakow (2010)
On-Line Monitoring of Service-Level
Agreements in the Grid

Bartosz Balis1,2 , Renata Slota1 , Jacek Kitowski1 , and Marian Bubak1,3


1
AGH University of Science and Technology,
Department of Computer Science, Krakow, Poland
2
ACC Cyfronet AGH, Krakow, Poland
3
University of Amsterdam, Institute for Informatics, Amsterdam, The Netherlands
balis@agh.edu.pl

Abstract. Monitoring of Service Level Agreements is a crucial phase of


SLA management. In the most challenging case, monitoring of SLA ful-
fillment is required in (near) real-time and needs to combine performance
data regarding multiple distributed services and resources. Currently ex-
isting Grid monitoring and information services do not provide adequate
on-line monitoring capabilities to fulfill this case. We present an applica-
tion of Complex Event Processing principles and technologies for on-line
SLA monitoring in the Grid. The capabilities of the presented SLA mon-
itoring framework include (1) on-demand definition of SLA metrics us-
ing a high-level query language; (2) real-time calculation of the defined
SLA metrics; (3) advanced query capabilities which allow for defining
high-level complex metrics derived from basic metrics. SLA monitoring
of data-intensive grid jobs serves as a case study to demonstrate the
capabilities of the approach.

Keywords: on-line monitoring, SLA monitoring, Grid computing, com-


plex event processing.

1 Introduction and Motivation

Grid infrastructures federate resources from different providers [11], hence Ser-
vice Level Agreements between computing centers comprising the Grid, and
users running jobs, are needed to ensure the desired quality of service [10,7]. An
essential phase in SLA management is the monitoring of SLA fulfillment. The
prevailing approach is off-line SLA monitoring: data about resource usage and
performance is periodically sampled, stored, and subsequently analyzed for SLA
violations, like in the European EGI/EGEE infrastructure [13]. In on-line SLA
monitoring, on the other hand, resource usage and performance are analyzed
on the fly which allows for immediate alerts or corrective actions when an SLA
violation is detected or predicted.
We present a framework for on-line monitoring of SLA contracts in the Grid.
The solution is based on leveraging Complex Event Processing for on-line mon-
itoring in the Grid – GEMINI2 [1]. In this approach, basic SLA performance

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 76–85, 2012.

c Springer-Verlag Berlin Heidelberg 2012
On-Line Monitoring of Service-Level Agreements in the Grid 77

metrics are collected on-line, while complex SLA metrics can be defined on-
demand as queries in a general-purpose continuous query languages EPL, and
calculated in real-time in a CEP engine. Advanced query capabilities are afforded
by this approach: value aggregations, filtering, distributed correlations, joining
of multiple streams of basic metrics, etc. Furthermore, client-perspective SLA
monitoring is made possible. The capabilities of the solution are demonstrated
in a case study: SLA monitoring of data-intensive Grid jobs.
This paper is organized as follows. Section 2 presents related work. Section 3
describes the framework for on-line SLA Monitoring in the Grid. In section 4,
SLA monitoring of data intensive jobs is studied. Section 5 concludes the paper.

2 Related Work

On-line monitoring of large-scale infrastructures is essential for many purposes


such as performance steering [16], system intrusion detection [12] or self-healing
[3]. There are a few approaches for on-line SLA monitoring in the Grid [4,5,15].
Menychtas et al. [5] propose a QoS provisioning approach which takes into
account real-time monitoring information about jobs and resources. A generic
mapping mechanism is employed in order to map low-level metrics to high-level
QoS parameters.
Litke et al. [4] present an execution management framework for OGSA-based
Grids which, given a set of client requirements expressed in SLAs, finds can-
didate services satisfying these requirements, executes them, and monitors the
SLA fulfillment. The monitoring service is tightly coupled with the framework
and provides basic QoS metrics (such as CPU / memory / disk usage, network
bandwidth). On-line monitoring boils down to periodical notifications of QoS
metrics.
Truong et al [15] describe a framework for monitoring and analyzing QoS
metrics of Grid services. On-line monitoring of QoS is based on the SCALEA-G
framework [14]. In comparison to CEP, query capabilities of SCALEA-G are
limited. A client essentially can choose which entities to monitor, select desired
metrics, and optionally specify XQuery or XPath filters (data is represented in
XML). Moreover, unlike CEP, XQuery/XML have not been designed for real-
time queries over data streams.
There have been some efforts to support on-line SLA monitoring for Web
Services [6,9]. Michlmayr et al. [6] present an approach similar to our work in
that it leverages event-based monitoring and Complex Event Processing. How-
ever, the way CEP is used has some restrictions in comparison to our approach.
Basically only three event streams exist which represent QoS properties at the
level of service, service revision, and service operation, respectively. The use
of CEP query constructs is limited to sliding windows, aggregations and filter-
ing. Overall the approach is strictly oriented to Web Services. In contrast, we
propose a generic framework in which event streams represent individual per-
formance metrics which can be combined into high-level composite metrics. The
way these metrics are mapped into SLA obligations is out of scope of this paper.
78 B. Balis et al.

In [9], the authors propose the timed automata formalism to express SLA
violations, and automatically generate monitors for these violations. Exactly the
same can be achieved with Complex Event Processing: continuous query lan-
guage enables one to express SLA violations, while installing a query in a generic
CEP engines is equivalent to creating a new monitor. However, CEP has the
advantage of availability of mature and efficient technologies. Moreover, a con-
tinuous query engine is more user-friendly and arguably no less expressive than
a timed automata. In fact, automata are formalisms often used in the implemen-
tation of CEP engines [8].

3 On-Line SLA Monitoring Framework

3.1 Architecture

Fig. 1 presents a high-level view over the architecture of the on-line SLA mon-
itoring framework. The SLA Monitoring Service and the Resource Information
Registry are the core components of the framework. Also shown are Resources
of the Grid Infrastructure (computers, storage devices, software services), a Re-
source Provider, and a Service-Level Management Service which uses the SLA
Monitoring Service to define SLA Metrics, and makes corrective actions when
an SLA violation takes place or is predicted.
The resources of the Grid infrastructure provide event streams of basic SLA
metrics, such as current CPU load, current memory consumption, current data
transfer rate, response time to the latest client service request, etc. Additional
metrics can also be provided by the client side (response times, transfer rates
measured by the client, etc.).

Fig. 1. Architecture of Grid On-Line SLA Monitoring Framework


On-Line Monitoring of Service-Level Agreements in the Grid 79

The streams of basic metrics are consumed by the SLA Monitoring Service
wherein they can be transformed into composite metrics derived from one or
more basic streams. The composite metrics are defined on demand using the
continuous query language, and calculated in real-time in the CEP engine. Ex-
amples of composite metrics include:

– Value aggregations: average / minimum / maximum values in a specified


time window, etc. For example: Average CPU load on every monitored host
within last 5 minutes.
– Stream joining: combination of values from different streams joined by a com-
mon attribute value. Example: return all host names whose average CPU
load over last 5 minutes exceeds 90% AND top 10 processes on those hosts
in terms of CPU consumption.
– Distributed correlations: event patterns, such as event not followed by an-
other one within a specified time, occurrence of any of two events, etc.

Additional query mechanisms available for defining composite metrics include


value filtering, results grouping and ordering, etc.
The functionality of the SLA Monitoring Service is complemented by the
Resource Information Registry. While the SLA Monitoring Service deals with
dynamic metrics of the resources, the Registry stores their static attributes (OS
info, total memory, CPU type, total storage capacity, etc.), as well as long-term
metrics (monthly, yearly, all-time average, etc.). Information about static at-
tributes is published by the resources using advertisements – special messages
sent periodically to the SLA Monitoring Service which in turn updates the Reg-
istry if necessary. If the advertisement is not received for a certain period of time,
the resource is considered unavailable and its corresponding entry is marked inac-
tive. The SLA Monitoring Service can also be configured to update the long-term
metrics in the Registry based on the values from the event streams.
Such an architecture enables one to monitor long-term SLA metrics (e.g.
monthly availability), and to define even more complex composite metrics which
combine calculations based on dynamic metrics and constraints imposed on static
attributes. The values from the Registry can be joined in a continuous query with
other real-time streams. Example: return all host names whose average CPU load
exceeds 90% within the last 5 minutes, and whose operating system is a Linux
distribution.

3.2 Design and Implementation

The SLA Monitoring Service is designed and implemented on the basis of the
GEMINI2 monitoring system [1]. GEMINI2 provides a framework for on-line
monitoring which encompasses a CEP-based monitoring server (GEMINI2 Mon-
itor) and local sensors (GEMINI2 Sensors). Monitoring data is represented as
events (collections of name – value pairs) which typically contain at least a unique
resource identifier (e.g. a host name), and a set of associated metrics (e.g. current
CPU load on the host).
80 B. Balis et al.

Sensors are responsible for measuring the metrics and publishing the associ-
ated events to a Monitor. The Monitor contains a CEP engine (Esper [2]) and
exposes a service to formulate queries in the Event Processing Language (EPL).
The event streams from Sensors are processed against the queries in the CEP
engine which results in derived complex metrics returned to the requester.
Besides monitoring event streams, Sensors also periodically publish Advertise-
ment events in the Monitor. These events register a resource with the Monitor,
along with their static attributes.

3.3 Defining Composite SLA Metrics Using EPL


Composite SLA metrics are defined as EPL queries over streams of basic metrics.
Let us consider a relatively complex composite metric which demonstrates the
query capabilities of the EPL language: return host names whose average CPU
load exceeds 95% in the last 5 minutes, and top 10 processes on those hosts in
terms of CPU usage. Expressed in EPL:

select host . hostName , avg ( host . cpuLoad ) ,


proc . pid , avg ( proc . cpuUsage )
from HostMs . win : time (5 min ) as host ,
P r o c e s s M s. win : time (5 min ) as proc
where proc . hostName = host . name /* join 2 s t r e a m s */
group by host . name , pid
having avg ( host . cpuLoad ) > 95
output all every 2 minutes
order by avg ( proc . cpuUsage ) desc /* sort r e s u l t s */
limit 10 /* d i s p l a y top 10 r e s u l t s */

This request selects attributes from two streams: HostMs (which contains host
name and host metrics such as the current CPU load), and ProcessMs (which
contains a process identifier, host name on which the process is running, and
metrics, such as the CPU usage). The streams are joined with the value of the
common attribute: the host name.

3.4 Registry
Registry is a database associated with a Monitor which contains information
about resources, specifically their static attributes (metadata) which are not
published in the monitoring event streams.
In order to combine data from the event streams and the Registry, the EPL
request can contain an SQL query. For example the request from section 3.1 is
expressed in EPL as follows:

select rh . host_name , avg ( host . cpuLoad )


from HostMs . win : time (5 min ) as host ,
sql : Registry [ ’ ’ select h o s t _ n a m e from Host
where h o s t _ n a m e = $ { host . hostName }
and host_os = ’ Linux ’ ’ ’] as rh
having avg ( host . cpuLoad ) > 0.9
On-Line Monitoring of Service-Level Agreements in the Grid 81

Fig. 2. Example deployment of resources and SLA Monitoring system components for
the monitoring of data-intensive jobs scenario

4 SLA Monitoring for Data-Intensive Computations


The capabilities of the presented SLA Monitoring solution will be demonstrated
in a case study which involves data storage and data-intensive computations. The
main entities involved in this case study are: (1) storage resources (local disks,
disk arrays and hierarchical storage management (HSM) devices); (2) jobs pro-
cessing large volumes of data, running on worker nodes of the Grid infrastructure;
(3) user interface.
An example deployment of these entities is shown in Fig. 2. In this case,
a job running on a worker node retrieves data from a disk array in order to
run a simulation, whose results are visualized on a graphical user interface.
Furthermore, the simulation is interactive: the user can steer it on the fly.
For the purpose of SLA monitoring, storage resources publish streams of
performance metrics. However, the client host and GUI application are also
instrumented in order to publish client-side performance metrics, such as re-
sponse times and inbound data transfer rates. This allows for SLA monitor-
ing also from the client perspective. The monitored entities, their attributes
(only applicable for storage resources), and SLA metrics are summarized in
Table 1.
82 B. Balis et al.

Table 1. Monitored entities, their attributes and basic SLA metrics

Entity / Static attributes Basic SLA metrics


Event stream name & long-time metrics
Local disk average read/write transfer rate current read/write transfer rate
LDMs total capacity free capacity
Disk array average read/write transfer rate current read/write transfer rate
DAMs total capacity free capacity
raid level
strip size
Hierarchical storage average read/write transfer rate current read/write transfer rate
management (HSM) device total capacity free capacity
HSMMs average mount time
average load time
average position time
number of libraries,
drivers and tapes
Client GUI
ClientPerfMs N/A response time of steering requests
Client host
DataTransferPerfMs N/A inbound data transfer rate
outbound data transfer rate

Let us consider a number of examples of composite SLA metrics formulated


in the EPL query language. The first three metrics rely on storage resource
performance metrics.

1. Return average read transfer rate for a disk array with particular ID for the
last 80 minutes.
select avg ( c u r r e n t R e a d T r a n s f e r R a t e)
from DAMs ( id = ’ IP : mountDir ’) . win : time (80 min ) ;

2. Every 5 minutes return average read transfer rate for those disk arrays for
which it exceeded 100MB/s within the last 40 minutes.
select serverName , id , avg ( c u r r e n t R e a d T r a n s f e r R a t e)
from DAMs . win : time (40 min ) group by serverName , id
having avg ( c u r r e n t R e a d T r a n s f e r R a t e) > 100
output all every 5 minutes ;

3. Return current free capacity and average write transfer rate for all disk ar-
rays managed by server zeus.cyfronet.pl. This request may be useful, e.g., to
predict the running out of the disk space.
select id , freeCapacity , avg ( c u r r e n t W r i t e T r a n s f e r R a t e)
from DAMs ( s e r v e r N a m e= ’ zeus . cyfronet . pl ’) . win : time (5 min ) ,
group by id
output all every 5 minutes ;

The next example shows a metric which combines data from event streams and
the Registry. The request selects HSM devices which currently undergo high
write transfer rates. In addition, the historical average for the device is returned.
select hsm . id , avg ( hsm . c u r r e n t W r i t e T r a n s f e r R a t e) , hsmreg . a v g W r i t e T r a n s f e r R a t e
from HSMMs . win : time (5 min ) as hsm ,
sql : Registry [ ’ ’ select a v g _ w r i t e _ t r a n s f e r _ r a t e as a v g W r i t e T r a n s f e r R a t e
from HSM
where res_id = $ { hsm . id } ’ ’]
having avg ( hsm . c u r r e n t W r i t e T r a n s f e r R a t e) > 60
On-Line Monitoring of Service-Level Agreements in the Grid 83

Finally, the following example demonstrates SLA Monitoring that includes client-
side metrics. Let us assume that the user running and steering the simulation
would like that two requirements are satisfied:
– The simulation is sufficiently responsive to user steering actions.
– The simulation results are delivered to GUI with transfer rate large enough
for real-time visualization.
Consequently, the following SLA could be requested: (a) average response time
of user interactions does not exceed 100ms, AND (b) average data transfer rate
from the processing job to the GUI does not drop below 128KB/s. Expressed in
EPL:
select avg ( a . responseTime , 90) , avg ( b . i n T r a n s f e r R a t e)
from pattern [ every ( a = C l i e n t P e r f M s( appId = ’ app1 ’) or
( b = D a t a T r a n s f e r P e r f M s( port = ’ 1111 ’) ) )
]. win : time (5 min )
having avg ( a . responseTime , 90) > 100 or
avg ( b . i n T r a n s f e r R a t e) < 128

This request consumes two event streams mentioned earlier: ClientPerfMs, which
contains, among others, response time of the latest simulation steering request;
DataTransferPerfMs which contains performance metrics of data transfers
to/from a host. The first stream also contains attribute appId which identifies
the particular simulation session, and which is used to filter the stream. The sec-
ond stream is also filtered against port number 1111 on which the GUI receives
the simulation results. The request defines an event pattern ‘AorB’ – fulfilled if
either of two event happens.

5 Conclusion
This paper presents a novel and generic solution for efficient, near real time mon-
itoring of Service Level Agreements in the Grid. This solution is based on the
application of Complex Event Processing principles and supporting technologies.
We have elaborated a generic framework in which event streams represent indi-
vidual performance metrics which, in turn, can be combined into high-level com-
posite metrics. The main features of the monitoring framework are: on-demand
definition of SLA metrics using a high-level query language, real-time calcula-
tion of the defined SLA metrics and advanced query capabilities which allow for
defining high-level complex metrics derived from basic metrics. Resource infor-
mation registry complements the functionality of the framework by providing
a space for storing historical or long-term metrics, as well as resource metadata.
The information from the Registry can also be used in continuous queries, fur-
ther enhancing the capabilities of the framework in terms of definition of complex
SLA metrics. The case study of the data-intensive application have demonstrated
the feasibility of this approach.
Future work involves the investigation of an efficent way of mapping of high-
level metrics into SLA obligations, improvement of performance of the frame-
work, and investigation of other on-line SLA monitoring use cases.
84 B. Balis et al.

Acknowledgments. This work is partially supported by the European Union


Regional Development Fund, POIG.02.03.00-00-007/08-00 as part of the PL-
Grid Project.

References
1. Balis, B., Kowalewski, B., Bubak, M.: Real-time Grid monitoring based on complex
event processing. Future Generation Computer Systems 27(8), 1103–1112 (2011),
http://www.sciencedirect.com/science/article/pii/S0167739X11000562
2. Berhardt, T., Vasseur, A.: Complex Event Processing Made Simple Using Esper
(April 2008), http://www.theserverside.com/news/1363826/
Complex-Event-Processing-Made-Simple-Using-Esper
(last accessed June 30, 2011)
3. Gorla, A., Mariani, L., Pastore, F., Pezzè, M., Wuttke, J.: Achieving Cost-Effective
Software Reliability Through Self-Healing. Computing and Informatics 29(1), 93–
115 (2010)
4. Litke, A., Konstanteli, K., Andronikou, V., Chatzis, S., Varvarigou, T.: Manag-
ing service level agreement contracts in OGSA-based Grids. Future Generation
Computer Systems 24(4), 245–258 (2008)
5. Menychtas, A., Kyriazis, D., Tserpes, K.: Real-time reconfiguration for guarantee-
ing QoS provisioning levels in Grid environments. Future Generation Computer
Systems 25(7), 779–784 (2009)
6. Michlmayr, A., Rosenberg, F., Leitner, P., Dustdar, S.: Comprehensive QoS moni-
toring of Web services and event-based SLA violation detection. In: Proceedings of
the 4th International Workshop on Middleware for Service Oriented Computing,
pp. 1–6. ACM (2009)
7. Moscicki, J., Lamanna, M., Bubak, M., Sloot, P.: Processing moldable tasks on
the grid: Late job binding with lightweightuser-level overlay. Future Generation
Computer Systems 27(6), 725–736 (2011),
http://www.sciencedirect.com/science/article/pii/S0167739X11000057
8. Mühl, G., Fiege, L., Pietzuch, P.: Distributed Event-Based Systems. Springer (Au-
gust 2006)
9. Raimondi, F., Skene, J., Emmerich, W.: Efficient online monitoring of web-service
slas. In: Proceedings of the 16th ACM SIGSOFT International Symposium on
Foundations of Software Engineering, pp. 170–180. ACM (2008)
10. Sahai, A., Graupner, S., Machiraju, V., van Moorsel, A.: Specifying and Monitoring
Guarantees in Commercial Grids through SLA. In: CCGRID 2003: Proceedings of
the 3st International Symposium on Cluster Computing and the Grid, p. 292. IEEE
Computer Society, Washington, DC (2003)
11. Schwiegelshohn, U., Badia, R.M., Bubak, M., Danelutto, M., Dustdar, S.,
Gagliardi, F., Geiger, A., Hluchy, L., Kranzlmüller, D., Laure, E., Priol, T., Reine-
feld, A., Resch, M., Reuter, A., Rienhoff, O., Rüter, T., Sloot, P., Talia, D., Ull-
mann, K., Yahyapour, R., von Voigt, G.: Perspectives on grid computing. Future
Generation Computer Systems 26(8), 1104–1115 (2010),
http://www.sciencedirect.com/science/article/pii/S0167739X10000907
12. Smith, M., Schwarzer, F., Harbach, M., Noll, T., Freisleben, B.: A Streaming Intru-
sion Detection System for Grid Computing Environments. In: HPCC 2009: Pro-
ceedings of the 2009 11th IEEE International Conference on High Performance
Computing and Communications, pp. 44–51. IEEE Computer Society, Washing-
ton, DC (2009)
On-Line Monitoring of Service-Level Agreements in the Grid 85

13. Szepieniec, T., Tomanek, M., Twaróg, T.: Grid Resource Bazaar: Efficient
SLA Management. In: Proc. Cracow Grid Workshop 2009, pp. 314–319. ACC
CYFRONET AGH, Krakow (2009)
14. Truong, H.L., Fahringer, T.: SCALEA-G: a Unified Monitoring and Performance
Analysis System for the Grid. Scientific Programming 12(4), 225–237 (2004)
15. Truong, H., Samborski, R., Fahringer, T.: Towards a framework for monitoring and
analyzing QoS metrics of grid services. In: Second IEEE International Conference
on e-Science and Grid Computing, e-Science 2006, p. 65. IEEE (2006)
16. Wright, H., Crompton, R., Kharche, S., Wenisch, P.: Steering and visualization:
Enabling technologies for computational science. Future Generation Computer Sys-
tems 26(3), 506–513 (2010)
Challenges of Future e-Infrastructure
Governance

Dana Petcu

e-Infrastructure Reflection Group,


and West University of Timişoara, Romania

Abstract. A shift of interest of both providers and consumers from


resource provisioning to a system of infrastructure services as well for a
governance system for e-Infrastructures based on a user-centric approach
is registered nowadays. Applying service level management tools and pro-
cedures in e-Infrastructure service provision practices allow users, service
providers and funding agencies to investigate e-Infrastructure services in
view of individual use cases. The shift should be sustained by legal struc-
tures, strategic and financial plans, as well as by openness, neutrality and
diversity of resources and services. e-IRG as an e-infrastructure policy
forum envisioned these trends and needs and expressed its position in
its recent white paper that is shortly presented and discussed from a
perspective of building future research agendas of individual teams.

Keywords: e-Infrastructures, governance, service-orientation.

1 Introduction
E-Infrastructure landscape is changing to comply with the service oriented pa-
radigm, which enables increased innovation potential and cost-efficient access
from a widening range of users, thereby strengthening the socio-economic im-
pact. On another hand, sustainability of current e-Infrastructures has become a
global concern and the key role is played by their governance. Efficient, effec-
tive, transparent and accountable operations are nowadays the main topics of
e-Infrastructures governance. These trends are recognized at national and Eu-
ropean levels, with forceful e-Infrastructure agendas or strategies to promote
an efficient governance for the research ecosystem. Further strategic develop-
ment of e-Infrastructures should respond to the demand for and the necessity of
Green IT, the need for massive computational power (exascale computing), the
increasing amount of data, the seamless access to services for users, the interna-
tionalization of scientific research and the involvement of the user communities
in governance of e-Infrastructures. Aligned to these efforts and requirements, e-
Infrastructure Reflection Group (e-IRG) has recently analyzed the structures as
well as organizational and relational aspects of current e-Infrastructures together
with the governance process, distinguishing strategic processes and operational
management and the various functional aspects of governance, e.g. the support-
ing legal and financing structures.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 86–95, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Challenges of Future e-Infrastructure Governance 87

As remarked in the European Digital Agenda [1] (aiming to deliver sustainable


economic and social benefits from a European digital single market based on
fast and ultra fast Internet and interoperable applications), services are moving
from the physical into the digital world, universally accessible on any device.
Attractive content and services need to be made available in an interoperable
and borderless Internet environment. This stimulates demand for higher speeds
and capacity. In the e-IRG vision, the achievement of an open e-Infrastructure,
that enables flexible cooperation and optimal use of all electronically available
resources, will help narrow the digital divide in Europe and support cohesion
by enabling improved inter-regional digital flow of ideas and technology. This
vision is sustained at least by the current European Commission programmes
FP7-ICT [2] and FP7-Infrastructures [3], as well by ESFRI [4].
Details about the e-IRG vision and recommendations, as reflected its recent
white paper [5], will be revealed and sustained in what follows. e-IRG recom-
mendations are mainly intended for national and international policy makers to
support further advancement of e-Infrastructures, but they catch the state-of-
the-art by expressing the needs of the user communities and the evolution in
the markets for information and communication services. In this paper these
recommendations are translated for the research communities with the aim to
suggest topics for their future research agendas.

2 e-IRG Recommendations
The topics that are presented in the e-IRG’s white paper address both sev-
eral questions related to e-Infrastructure, like: (1) what are the appropriate
governance models for e-Infrastructures; (2) how to advance research networks;
(3) how to facilitate access; (4) how to deal with the increasing energy demands
of computing; (5) what software is needed to fully harness the power of future
HPC systems; (6) how to adopt and implement new e-Infrastructure services;
(7) how to discover and share of large and diverse sources of scientific data. Each
question is treated in what follows.

2.1 e-Governance Management


Governance policies are needed for the further strategic development of e-Infra-
structures and they should support the free movement of knowledge across the
world. An e-Infrastructures’ ecosystem is needed in order to meet the challenge
of governing such a system effectively and efficiently. This requires a deliberation
of the strategies and involvement of all relevant stakeholders in order to realize a
solid basis for the further developments. Moreover, the shift from mere resource
provisioning to a system of infrastructure services will have a considerable impact
on how such infrastructures are funded and financed. Then users current need
is to have the choice for the best available services regardless of national bound-
aries, public or commercial commodity services, as well as to actively participate
in strategic governance decisions concerning e-infrastructures. Therefore the e-
Infrastructure governance should shift towards a user-driven approach. Different
88 D. Petcu

technical, political and commercial developments, such as the virtualisation of


services, the cloud computing, and the constant increasing need of leading edge
user communities for services far beyond what the current e-Infrastructure can
offer, drive this process.
In this context, e-IRG envisioned an user-centric approach in which timely
e-Infrastructure innovation to serve user communities (ahead of what the com-
mercial markets can provide) remain a public responsibility, while the funding
of the use of e-Infrastructures services should be paid out of the budgets of users
and their projects.
More precisely, e-IRG’s recommendations related to the e-governance man-
agement are resumed in the white paper as follows:
1. Establish a user-community-centric approach in strategic e-Infrastructure
governance, including the appropriate funding mechanisms making distinc-
tion between the funding of service provision and of innovation activities.
2. Define the long-term financial strategy for e-Infrastructures aimed at a sus-
tainable operation of services in a flexible and open environment that in-
cludes offers from commercial service providers.
3. Address the problems of barriers to cross-border service delivery and quickly
remove as many of these as possible.
4. Introduce governance models that provide efficient and effective coordina-
tion mechanisms at all levels (regional, national, European, global) while
providing the possibility for public and private research and cooperation.
5. Encourage important players in the use of e-Infrastructures, to investigate
the impact of strategic changes in e-Infrastructure governance and financing
on the operation of and access to international research infrastructures.
6. Investigate the effectiveness of legal structures for e-Infrastructures.

2.2 Future of Research Networks


e-IRG recognized that research networks are already available as a service,
but the drive towards seamless access to all services, including the connected
e-Infrastructure elements for computing or data storage as a fully integrated
ecosystem, is new. Moreover, the availability of new technologies calls for the in-
novation of the networking infrastructure and its services. This is complemented
by the emergence of new stakeholders in the research arena, creating a more
competitive environment and a market opening for innovative actors such as
brokers or the associations of users with similar interests.
Openness, neutrality and diversity should be the basic principles in develop-
ing the future networking infrastructures. Networking is inherently multi-domain
and should be built in a federative and open approach, supplying connectiv-
ity based on globally accepted standards. Network services should be made
available via a common user interface to allow integrated access to different
e-infrastructure services.
1. Innovate in network provisioning and network governance to satisfy user
demand and stay competitive at the global level.
Challenges of Future e-Infrastructure Governance 89

2. Draft an innovation agenda for research networking usable by stakeholders.


3. Build the networks as a federative and open system, giving flexibility and
worldwide connectivity to public and private researchers and with seamless
integration with other e-Infrastructure service providers.
4. Rigorously investigate the causes of the digital divide between European
researchers and combat this with the appropriate instruments.

2.3 Authentication, Authorization and Accounting

In the context of on open ecosystem functionality, one of the objective of the gov-
ernance of an authentication and authorization infrastructure (AAI) is to estab-
lish and maintain the level of mutual trust amongst users and service providers.
The current requirements are, according e-IRG study: (1) improved usability,
lowering the threshold for researchers to use the services; (2) improved security
and accountability (often conflicting with usability requirement); (3) leverag-
ing of existing identification systems; (4) enhanced sharing, allowing users to
minimize the burden of policy enforcement; (5) reduced management costs, free-
ing resources for other service or research activities, and providing a basis for
accounting; (6) improved alliance with the commercial Internet, which also im-
proves interaction between scientists and society.
In the case of identity recognition, there are several models. European NRENs
operate identity federations, and provide services to a large number of users
within academic and research communities. Based on open standards, these na-
tional identity federations focus on providing access to web-based resources, such
as data repositories. The user typically acts as a consumer. A full e-Infrastructure
should also allow the user to act as a producer of information. In this con-
text, clear and simple mechanisms for accessing and managing authorization
policies are required. Moreover, the connection between different national iden-
tity federations into a common identity space that supports real-time access to
web resources across Europe is an ongoing task as the maturity of the national
AAIs differs substantially between countries. On another hand, players outside
academia include providers of user-centric identity management models (like
OpenID used in web 2.0 applications), as well as governments offering identity
infrastructures rooted to a legally recognized and authoritative framework.
Several other technical problems are needed to be solved fast: (a) support for
the management of distributed dynamic virtual organizations; (b) robust and
open accounting solutions to monitor e-Infrastructure services; (c) integration of
user-centric and governmental infrastructures with academic AAIs.
In this context, e-IRG recommends:

1. Improve national infrastructures and their alignment with agreed standard


procedures for identity management, accounting and assurance, with the
objective of technical interoperability between all national AAIs.
2. Integrate different identity technologies.
3. Define access control policies and mechanisms in accordance with the stan-
dards and best practices adopted by the community.
90 D. Petcu

4. Draw a roadmap to book progress for all stakeholders to replace existing


authentication and authorization infrastructures based on national AAIs.

2.4 Energy and Green IT


While the major goals of Green IT are to reduce energy consumption, increase
energy efficiency and minimize the influence on the environment, the currently
work lacks a consistent vision on how to proceed globally. This also due to the
large number of stakeholders who need to be involved in solving the problems:
policy makers, hardware vendors, hardware/services providers, and end users.
Trying to provide some guidance in this context, e-IRG recommends:
1. Decrease energy consumption of e-infrastructure components by optimizing
the architectures and design more efficient software management procedures.
2. Develop more efficient ways of using energy by increasing the efficiency of
the cooling systems and reusing the heat energy.
3. Analyze environmental impact of different energy maintenance approaches.
4. Provide more service management procedures.
5. Work out and promote Green IT standards at an international level.
6. Locate data centres at optimum locations in terms of the balance between
green energy and energy efficiency.

2.5 Exascale Computing and Related Software

Several requirements to make available the exascale computing where identi-


fied by e-IRG: (a) design of new hardware and software architectures efficient
enough for exascale; (b) reduce the power consumption by using new technolo-
gies and heterogeneous architectures; (c) increase in concurrency to comply with
the change in scale at the level of parallelism that must be exploited by the soft-
ware; (d) resilient architectures, programming models and applications, which
will ensure that the system produces acceptable results even in the presence of
hardware failures; (e) development of new programming paradigms allowing the
effective use of an exascale machine (better compilers, monitoring tools, hiding
software complexity).
Moreover, a paradigm shift is foreseen in software for exascale computing.
The main components of this shift are, according e-IRG studies: (i) design a
new programming model, beyond MPI; (ii) establish a performance indicator
over differing architectures that considers multiple parameters of a configura-
tion beyond flops and execution time, e.g. cost per execution, memory usage,
bandwidth; (iii) support heterogeneous computing by operating systems, soft-
ware libraries, compilers, toolkits etc; (iv) establish testing procedures to verify
the correctness of a highly parallel implementations; (v) setting technical, logis-
tic, legal standards for community-based development; (vi) establish a practical
approach to data safety and security.
Challenges of Future e-Infrastructure Governance 91

In this context, the recommendations are the followings:

1. Develop European hardware technology in order to compete and cooperate


with the current leading countries in HPC.
2. Study of new programming models, algorithms and languages, porting soft-
ware libraries and software tools to exascale environments, and preferring
open source software solutions to leverage existing know-how.
3. Identify new grand challenges able to utilise the exascale platforms.
4. Establish collaborations between users of exascale computing, industry, com-
puter scientists and programming experts.
5. Create training materials, including robust and easy to use books for users
who are not computer scientists.
6. Ensure knowledge dissemination, and engagement with the public, policy
makers and industry, for promoting exascale computing.

2.6 e-Infrastructure Services

The emergence of e-Infrastructure as a service is requested and accepted by


the users due to the promises additional benefits. The main challenge faced
by the e-Infrastructure providers is to offer their service to users in a reliable,
scalable, customised and secure setting. They face at least the following chal-
lenges: (a) upgrade/refine the present services; (b) develop/introduce new ser-
vices; (c) improve the governance/management of e-Infrastructure operations
offered as services; (d) extend/intensify cooperation and collaboration in the
e-Infrastructure area; (e) establish and gradually introduce a sustainable busi-
ness model for e-Infrastructure operation and services. Further challenges are
discussed in more details in the next section.
Shortly, e-IRG recommends in this field the followings:

1. Involve user communities in the definition and exploitation of services.


2. Use virtualisation and service-orientation when developing and introducing
new services wherever this is efficient.
3. Apply simplified access, transparent service offerings, customized support,
standardization, improved governance models and sustainable business mo-
dels in the definition and deployment of services.
4. Promote cooperation between public sectors in the e-Infrastructure arena,
like government and healthcare, to exploit economies of scale and intensify
the contribution of e-Infrastructures in facing societal challenges at large.
5. Boost innovation by public-private partnership activities through the joint
creation of a market for e-Infrastructure resources and services.

2.7 Data Infrastructures

The massive increases in the quantity of digital data leads to the urgent need to
integrate data sources in order to build a sustainable way of providing a good
level of information and knowledge – this feature is currently missing from the
92 D. Petcu

available e-Infrastructures. The vision of a Global Data Research Infrastructure,


supported by e-IRG is that of a cost-effective, efficient collaborative data re-
search environment built on an interoperable and sustainable governance model
fulfilling user needs across geographical borders and disciplines. This ecosystem
of data infrastructures should be composed of regional, disciplinary and multi-
disciplinary elements such as libraries, archives and data centres, offering data
services for both primary datasets and publications, and should support data-
intensive science and research.
Design, implementation, operations, funding, governance and sustainability
models need to be defined to promote (a) new data management, exchange and
protection paradigm or approach; (b) the process of embedding data infrastruc-
ture into e-Infrastructure; (c) the cooperation between data providers and users
in exchanging information for better governance of data gathering and manage-
ment or to fulfill legal requirements and sharing.
The technical issues related to assembling, securing, managing, preserving
and making interoperable the huge amount of data that scientists should be
solved. One such problem is how to address the data explosion by assuring
the infrastructures scalability in terms of storage space, number of data objects
stored, number of users concurrency accessing the data, and performance of data
accessing and handling. Another problem is how to address the complexity, i.e.
deal with different, domain-specific data organisations, formats, handling policies
etc. Not at last, reaching the reliability and robustness, which have a specific
meaning in the context of data exchange, sharing and long-term preservation, in
the geographically distributed and complex infrastructure is also a big challenge.
In this context, the recommendations of e-IRG are the followings:
1. Develop an European data infrastructure gradually, addressing basic issues
such as data persistency, accessibility and interoperability first, and leav-
ing complicated issues such as privacy and legal matters (like cross-border
exchange of sensitive data) for subsequent stages.
2. Implement strategy at different levels, including low-level services such as
bitstream data storage, exchange in data infrastructures, content-related cu-
ration, preservation and data exploitation services, as well as activities aimed
at interoperability and data access federation and openness.
3. Involve stakeholders of the data infrastructure including resource providers,
existing infrastructures and initiatives and user communities in order to build
reliable and robust data services suitable to real needs.

3 Service-Orientation of the Future Open


e-Infrastructures
Services are important part of e-Infrastructure offer. Users are not interested
in the pure infrastructure part but rather in the services that are provided by
the e-Infrastructures (which services are delivered and with what quality). If not
simply the resources, but rather a combination of services running on various
Challenges of Future e-Infrastructure Governance 93

resources spread world-wide is requested by the users, creates the premises to


bring researchers together in international, virtual teams and organizations.
Basic e-Infrastructure services, such as computing, security and authentica-
tion, communication and conferencing, have been provided for more than two
decades. These services were developed as individual services based on dedicated
equipment and unique software components and their interoperability has be-
came a problem. The changing requirements, like the increasing need for shared
international access to remote resources, increased security, economies of scale
for shared use, and more recent emergence of virtualization techniques gradually
led to federated services in the Grid, service-oriented architectures, and the pro-
vision of sophisticated on-demand access to different shared resources, like hard-
ware, software, infrastructure etc. Infrastructure-as-a-Service (IaaS) is emerging
in both academic research and industry to exploit the opportunities provided by
the Cloud paradigm. It provides an on-demand provision of requested resources
for a widening spectrum of applications, and also stimulates a service-oriented
approach to software development and deployment. An important feature of the
on-demand provisioning is that most of the higher-level complex services are
based on well-defined interoperable and distributed lower level services.
As e-IRG has identified, a major implication of the services shift appears in
the changing division of responsibilities between the user and the supplier: the
responsibility of linking the service demand to the user need is moved to the
supplier (it means widening the distance of the users to the physical resources).
The white paper of e-IRG has underline in different chapters several needs,
challenges and recommendations in building e-Infrastructure services:

1. The governance system should be supported by an elaborate system of met-


rics to establish the value and costs of the services and delivery systems.
2. Formalizing the quality and management aspects of service provision prac-
tices and complementing these with tools and procedures from the estab-
lished IT service management discipline is urgently requested.
3. Cross-organizational service level management need to be supported by go-
vernance structures.
4. Open and adaptable standards for using the heterogeneous e-Infrastructure
services should be developed, promoted and supported on all functional levels
and in all application areas.
5. Integrated user access to the various international e-Infrastructure services.
6. Services need to be application-oriented, easily accessible, open and flexible
so be able to adapt to technological changes and evolving user needs.
7. Network services should be made available via a common user interface to
allow integrated access to different e-infrastructure services.
8. Robust and open accounting solutions for the e-Infrastructure are needed to
monitor the services and allow for comprehensive service level management.
9. Virtualisation should be used to build virtual research environments and
virtual research communities.
10. Improved friendliness of access, adapted customisation of services, and tai-
lored support and training are needed to attract new user communities.
94 D. Petcu

11. Multi-tenancy of services should enable sharing of e-Infrastructure resources


and costs across a large number of users, improved resource utilization, in-
creased peak-load capacities, operating resources in locations with low costs.
12. Special services are to be offered by establishing service portals or centres
dedicated to specific user communities, specialized service providers and spe-
cific large-scale projects.
13. Coordination should result in exchanging services and sharing service port-
folios among co-operating e-Infrastructure providers, as well as in joint ten-
dering or licensing by them.
14. Stability and sustainability of the infrastructure are to be improved by de-
veloping and gradually introducing fair and straightforward business models,
business standards and charging practices.
15. Contentious governance issues that impact the adoption of IaaS must be
addressed: include transparency, privacy, security, availability, performance,
data protection, adoption of open standards.
16. Applying service level management tools and procedures in service provision
practices allow users, service providers and funding agencies to investigate
e-Infrastructure services from a perspective of individual use cases.
17. Exchanging services, sharing service portfolios, and other forms of improved
cooperation by and between national e-Infrastructure service providers should
be exploited for better geographic and disciplinary coverage.
18. Innovative development of e-Infrastructure services should be protected by
involving research and education users in the development of services.
19. Non-commercial e-Infrastructure providers should be proactive, rather than
simply copying commodity services already offered by commercial providers.
20. Fair and transparent business models are to be introduced in order to in-
crease integration and sustainability of e-Infrastructure services and to gua-
rantee a fair distinction between commercial and non-commercial services.
21. Similarities between e-Infrastructure services and services required by other
sectors, such as government, health etc should be investigated by exchanging
experiences and transferring knowledge from research sector to others.
22. Appropriate services and mechanism should extend, improve and facilitate
(automate) the data handling, preservation, curation and exploitation lead-
ing to (a) one-stop shop delivery of data services, (b) federated access to
data, (c) assurance that valuable data will be accessible, protected, pre-
served and curated over decades, (d) reduced costs by exploiting economies
of scale thanks to a critical mass of resources providers, data storage and
processing resources and users.

4 Conclusions
While the topics presented in this paper are referring a variety of concerns re-
lated to future e-Infrastructures, a general trend towards service orientation
can be concluded. Only this orientation can ensure that future e-Infrastructures
will reach a wider European community of users. This vision has been catch in
Challenges of Future e-Infrastructure Governance 95

e-IRG’s recent white paper that was exposed and re-interpreted in this paper
from the perspective of the researchers to be involved in develop, deliver or use
the future e-Infrastructures.

References
1. European Commission, A Digital Agenda for Europe (2010),
http://ec.europa.eu/information_society/digital-agenda/
2. European Commission, Work Programme 2011-2012. Cooperation. Theme 3. ICT -
Information and Communication Technologies (2011),
http://cordis.europa.eu/fp7/ict/
3. European Commission, Work Programme 2011. Capacities. Part 1: Research Infras-
tructures (2010), http://cordis.europa.eu/fp7/ict/e-infrastructure/
4. European Strategy Forum on Research Infrastructures, Strategy Report and
Roadmap Update (2010), http://ec.europa.eu/research/infrastructures/
5. e-Infrastructure Reflection Group, White paper (2011), http://www.e-irg.org
Influences between Performance Based Scheduling
and Service Level Agreements

Antonella Galizia1, Alfonso Quarati1, Michael Schiffers2,4, and Mark Yampolskiy3,4


1
Institute for Applied Mathematics and Information Technologies,
National Research Council of Italy, Genoa, Italy
{antonella.galizia,alfonso.quarati}@ge.imati.cnr.it
2
Ludwig-Maximilians-Universität München, Germany
schiffer@nm.ifi.lmu.de
3
Leibniz Supercomputing Centre, Garching, Germany
Mark.Yampolskiy@lrz.de
4
Munich Network Management (MNM) Team

Abstract. The allocation of resources to jobs running on e-Science infrastruc-


tures is a key issue for scientific communities. In order to provide a better effi-
ciency of computational jobs we propose an SLA-aware architecture. The core
of this architecture is a scheduler relying on resource performance information.
For performance characterization we propose a two-level benchmark that in-
cludes tests corresponding to specific e-Science applications. In order to eva-
luate the proposal we present simulation results for the proposed architecture.

Keywords: resource allocation, benchmarks, scheduling, SLA.

1 Introduction

A proper resource-to-job matching is of paramount importance for a better


exploitation of e-Science environments where heterogeneous resources are shared for
coordinated problem solving in multi-institutional virtual organizations [1]. In
addition, specific requirements are often associated with compute intensive scientific
jobs, e.g., weather prediction WRF1, or molecular dynamics GROMACS2, which may
lead to further efficiency issues. In such computation intensive applications, a better
resources-to-job matching can lead to significant improvements in the computation
speed [2]. A performance aware job execution can be realized if there is adequate
information available regarding the resource capabilities and the qualities of the
services provided over the resources. A generally accepted method to evaluate and
compare the performance of computer platforms is through benchmarking and
benchmarks based metrics [3] [4].

1
http://www.wrf-model.org/
2
http://www.gromacs.org/

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 96–105, 2012.
© Springer-Verlag Berlin Heidelberg 2012
Influences between Performance Based Scheduling and SLA 97

It is common practice to express service quality expectations in Service Level


Agreements (SLA). SLAs are negotiated between customers of a service and service
providers. This practice has proven to be an effective means not only for enforcing
providers to the desired quality but also to reorganize the complete service provision-
ing in order to use available resources more efficiently. In this context is the optimal
exploitation and semantics definition of supported quality ranks, e.g., gold, silver,
bronze, still an unsolved problem.
We focus here on performance as a single quality parameter. In our research we
consider SLAs as a description of performance objectives to be achieved and main-
tained during the job execution. The main idea is to apply the congruent policy, where
resources are characterized by considering several performance ranks and jobs are
allocated to the most suitable resource according to the performance rank specified
for in the their submission. To enable the description of both jobs and resources, a
proposal for Grid environments has been presented [5].
In this paper, we abstain from discussions about SLA negotiation and how parame-
ters can be specified in an SLA or a Service Level Specification (SLS). Instead, in
Section 2 we propose an SLA-aware architecture incorporating a novel scheduling
mechanism which takes into account fine grained knowledge about resource capabili-
ties, information about job preferences, knowledge about the load of involved re-
sources, and requirements specified in the SLA. In Section 3 we present the bench-
marks used to rank resources with respect to specific metrics. In Section 4 we simu-
late the behavior of the proposed job allocation policy based on performance aware
SLAs. In Section 5 we conclude the paper and discuss future plans.

2 An SLA Aware Job Allocation Architecture

A Service Level Agreement (SLA) is a contract between customers of a service and


its provider. This contract specifies all service related commitments, i.e., with which
quality the particular service will be provided to the customer and how this quality
can be measured in order to verify the fulfillment of the contract. In some cases SLAs
also specify penalties which will be due in case the committed service quality cannot
be achieved. Further, since the quality parameters committed to the customer cannot
always be measured directly on the infrastructure, the provider usually associates an
SLS with an SLA. The purpose of an SLS is to specify how the provider’s infrastruc-
ture is monitored and how the monitored parameters are used in order to calculate
quality parameters committed to the customer.
In this paper we do not discuss the SLA negotiation process and issues related to
the specification of parameters SLAs or SLSs. Instead, we are interested in architec-
tural considerations necessary for predicting a job’s quality and for scheduling of jobs
to resources the performance of which is sufficient for the fulfillment of commit-
ments. In this work, we consider SLAs as a source for the end users specific require-
ments which should be fulfilled. For instance, a user could specify in SLA that his
submitted application should be scheduled to be executed in the next half hour and the
job processing should not take longer than two hours.
98 A. Galizia et al.

Figure 1 shows the general principle of job submissions in the context addressed
here (see also [1]). The job submitted by a customer/user is placed in the queue of a
global scheduler. The main goal of the global scheduler is to decide on which infra-
structure component this job should be computed. As of now, this is often done taking
into account only the current filling state of local queues of all available resources and
the very coarse grained classification of these resources, e.g., CPU- or GPU-based
computation unit. After the decision is taken, the job is moved from the global queue
to the local queue of the selected computation unit.

Job

Global queue
Global scheduler
Job Job Job

Computers Local queue


Job Job
Job Job
Job Job
Job Job

Fig. 1. Two-layer job scheduling

In order to better support performance-aware SLA requirements, we see the neces-


sity to extend this model significantly. This is in particularly useful in Grid environ-
ments where most of the existing meta-schedulers, as Maui/Moab scheduling suite
[6], Condor-G [7], and GridWay [8], mainly focus on resource requirements, queue
policies, and average load. By the way, we argue that for this purpose the global
scheduler should incorporate two complex components: 1) a fine grained analysis of
the performance of the available resources based on an evaluation of different (artifi-
cial) computational tasks; and 2) a scheduling mechanism which takes into account
fine grained knowledge about resource capabilities, information about job prefe-
rences, knowledge about the load of involved resources, and requirements specified in
SLAs.
We propose using benchmarks as an approved and broadly accepted technique for
such a fine grained assessment of resource qualities. Figure 2.a) outlines this strategy.
A set of well-prepared benchmarks can be defined in advance and stored as a part of
this unit. Generally, two benchmark scheduling strategies can be used. First, bench-
marks can be scheduled event-based, e.g., if some hardware/software change events
Influences between Performance Based Scheduling and SLA 99

were encountered. However, this will require either a notification system or the
benchmarks must be started manually. An alternative strategy is to start the bench-
marks periodically. This eliminates the necessity of an event messaging system, but it
bears the risk of possible interferences with productive jobs. Therefore, this strategy is
often combined with additionally defined policies, e.g., to schedule benchmarks only
in the case of empty local queues. For our work, both approaches could be adopted
and we abstain from recommendations and further discussions of this topic.

Fig. 2. – a) Fine grained resources evaluation – b) Benchmark driven job allocation

The extended scheduling engine is outlined in Figure 2.b). The result prediction
component is the core of the engine. In the first place, it takes into account the infor-
mation about fine grained resource performance, the states of the local queues, and
the job description. During submission phase, job requests have to specify; the job
description should include a specification to which class of computations this particu-
lar job belongs. This information is needed in order to perform a better match with the
benchmark tests used for the resource ranking. Based on the information and schedul-
ing policies the device for executing the job is selected. After the job is scheduled the
performance evaluator component is in charge of qualitatively monitoring the job
execution. This information can be used for the verification of performance goals as
stated in SLA. Further, the evaluation of the job execution performance – together
with the previous predictions – should be used in the prediction verification compo-
nent. The purpose of this component is to determine the deviation of the results from
their predictions. The deviation in turn can be used in the result prediction component
to reduce the prediction error before signing any SLA.
Therefore, in order to fulfill the end-user requirements specified in SLA, it is ne-
cessary to take into account two main information: estimated execution time at differ-
ent available resources and the estimated waiting time of the related queues. For both
100 A. Galizia et al.

estimations we consider the results provided by the prediction component, which in


turn is based on use of benchmarks.
The remainder of this paper focuses on the benchmark part of the proposed archi-
tecture, the core components of the proposed architecture as depicted in Figure 2b. In
order to explain the principles of the component we abstain from a discussion of the
job allocation in its full extend. Instead, we simulate the benchmark driven job alloca-
tion without the feedback loop including prediction and verification components.

3 Benchmarks Characterizing Resource Performance

The rank of resources on a performance basis may be obtained by expanding the


description of computational resources with some indicator that characterizes their
reaction under different workloads, [5].
To this aim, we integrate two complementary approaches: 1) the use of micro-
benchmarks, to supply basic information derived from low-level performance metrics;
2) the exploitation of application-driven benchmarks, to get a closer insight into the
behaviour of resources for a class of applications under more realistic conditions. In
particular, we considered the following tools for micro-benchmarks: I) Flops [9] re-
turns Million of Floating-point Operations Per Second (MFLOPS) to measure CPU
performance, II) STREAM [10] and CacheBench [11] measure the bandwidth re-
quired for writing and reading operations, expressed as Bytes per second, to evaluate
respectively main memory and cache, III) MPPTest [12] measures the Latency and
Bandwidth to evaluate machine’s interconnection, and IV) b_eff_io [13] returns
Bandwidth to estimate I/O systems. These metrics are well established and generally
used to evaluate resource performance capacities; moreover we use freely available
tools that could be widely deployed and run [14]. Application-driven benchmarks are
more suitable to mimic the real job workload because of their proximity with the ap-
plication at hand. In the following we consider, as case studies, two applications of
our interest, i.e., linear algebra and isosurface extraction. For the first class of applica-
tions, we selected the well-known High Performance Linpack (HPL) benchmark [15].
For the second, we realised a lightweight version of the application [16], character-
ized by a reduced computational cost, but still capable to maintain a representative
run of the real application (ISO). A deep discussion about the definition and effec-
tiveness of a two-level benchmark methodology has been presented in [17].

4 Evaluation of a Performance-Based Job Allocation

To evaluate the effectiveness of our architecture we simulated the job allocation


policy based on performance SLAs and supported by benchmark results. We
considered different application scheduling scenarios to appreciate the actual impact
on SLA commitments. In particular, we compared the performance-based SLAs, i.e.,
taking into account the congruent policy, with a general global scheduler, depicted in
Influences between Performance Based Scheduling and SLA 101

Figure 1. It is reasonable to base the job allocation strategy on the classical round-
robin procedure. We further considered the rank of the resources based on an
established application benchmark, i.e., ISO and HPL ranks.
To test the two components added to the global scheduler, we collected perform-
ance values of five resources under our domain/access, considering both level of
benchmarks. To simulate the chosen scenarios and to compare the scheduling strate-
gies we employed the Java Modelling Tools [18], an open source tool for perform-
ance evaluation and workload characterization of computer and communication sys-
tems based on queuing networks. In the reminder of this section, we present the re-
sources and experimental results. Please note that in order to focus on the evaluation
of the overall concept we simplify the job allocation component via removing the
feedback loop consisting of prediction and verification components.

4.1 Characterizing the Test Bed

We collected the performance information of five resources under our domain/access.


The aim is to consider different architectures to test the effectiveness of the first com-
ponent added to the global scheduler, i.e., the fine grained analysis of the perfor-
mance, and the improvement we achieved because of the second component, i.e., the
benchmark driven job allocation. Resources are described in Table 1; it actually high-
lights the architectural heterogeneity of our test bed, especially regarding the compu-
ting power (number of CPUs), the type of interconnection and the memory size.

Table 1. Test bed infrastructure

Proc. Type N° Core Network RAM


2 Quad Core Xeon 2.5
Ibm 32 Infiniband 64 GB
GHz
2 AMD Opteron 275 Gigabit
michelangelo 64 424 GB
2,2GHz dual core Ethernet
SC1458 Proprietary 372 proprietary 1.9 TB
Gigabit
Paperoga dual 3 GHz Intel Xeon 8 16 GB
Ethernet
Gigabit
Cluster1 2.66 GHz Pentium IV 16 16 GB
Ethernet

The double-level benchmark was run to gain a precise description of the actual per-
formance offered by the computational systems along different metrics axes. Figures
3 and 4 depict the performance values of the respective micro and application bench-
marks, we briefly discuss them in the following.
As Figure 3 outlines, the resources provide different performances with respect to
the considered benchmarks. For example, SC1458 achieves almost the best ranks for
the aggregated values and interconnection performance but performs poorly consider-
ing the ranks of the single cores. For the benchmarks michelangelo and ibm performs
better.
102 A. Galizia et al.

Fig. 3. Ranking of resources based on micro-benchmarks

Figure 4 reports the relative performance of ISO and HPL, each resource is tagged
with a value in the range [1,…,5], where greater values correspond to worse perform-
ance (e.g., ibm and SC1458 rank first according to ISO and HPL respectively). The
ranking was based on the execution Wall Clock Time (WCT).

Fig. 4. Test bed ranking according to HPL and ISO benchmarks

Figures 3 and 4 show that, as expected, none of the resources is the best in all
cases, therefore the importance of an accurately designed performance-aware schedul-
ing of the jobs is essential for fulfilling the SLA.
Influences between Performance Based Scheduling and SLA 103

4.2 Simulating the Architecture


In order to the compare the performance of a fine grained description of available
resources regarding different computation tasks, and information about job prefe-
rences, we model our systems as a queuing network composed of 5 nodes, corres-
ponding to our heterogeneous test bed, plus a scheduler which dispatches arriving
jobs to the resources. In the global scheduler depicted in Figure 1, different schedul-
ing strategies can be used, e.g., a round robin job allocation. However, for the perfor-
mance-based SLA architecture we favor the usage of the Congruent Policy job alloca-
tion, which takes into account the appropriate resource properties. Moreover, we con-
sidered two more job allocation strategies based on information derived using the
established ISO and HPL benchmarks respectively. Our objective is to minimize the
Response Time of the system, that takes into account the time that a job takes to be
executed (service time) plus the time spent in queue (waiting to be executed).
In the simulation we considered a workload composed of two parallel applications
(linear algebra and isosurface extraction) that have been modelled as two open classes
with exponentially distributed inter-arrival and service times [19]. Service times are
obtained through a real experimentation on the base of the benchmark values as re-
ported in Table 2. They can be considered as the results of the prediction component.

Table 2. Mean service times of each application class


(in parentheses the number of processors spawned for each resource)

IBM Michelangelo SC1458 Paperoga Cluster1


(32) (32) (128) (8) (16)
ISO 2.4 3 35 13 7
HPL 33 25 4.5 55 62

Fig. 5. Response times according to different scheduling strategies at increased workload


104 A. Galizia et al.

In Figure 5 the response times of each strategy at increasing workloads are shown.
It is immediately clear that the proposed performance-based SLA outperforms the
other schedulers. This is not surprising since each resource is exploited as its best
respect to the incoming workloads, i.e. each application is allocated to the resources
that execute the code in the most efficient way, in our analysis with minor execution
time. It leads to faster execution and lower waiting time. Both parameters impact (in
this case positively) on the response time. An increase of computation intensive work-
loads also influences our scheduling mechanism, however the growth of response
time is moderate compared with other tested strategies.

5 Conclusion

In this paper we proposed a performance-based SLA-aware architecture. The main


idea is to characterize resources on the base of specific benchmarks in order to allow
suitable job allocations. We have demonstrated simulation results which show clear
benefits and which give an indication of what can be expected if our proposed archi-
tecture will be implemented for the job scheduling.
In particular, we have analyzed and tested just a first part of the proposed architec-
tural concept. We plan to spend further efforts in the elaboration and analysis of the
performance prediction and evaluation components. This will include an evaluation of
different methods for the prediction of expected job execution performance as well as
for the correction based on the deviation between expected and measured results.

Acknowledgements. The authors would like to thank the members of the Munich
Network Management (MNM) Team for their support and many useful discussions.
As a group of researchers from the Ludwig-Maximilians-Universität München, the
Technische Universität München, the University of the German Federal Armed
Forces, and the Leibniz Supercomputing Centre of the Bavarian Academy of Science
and Humanities, the MNM Team focuses on computer networks, IT management,
High Performance Computing, and inter-organizational distributed systems. The team
is directed by Prof. Dr. Dieter Kranzlmüller and Prof. Dr. Heinz-Gerd Hegering. For
more information please visit http://www.mnm-team.org.
This work has partially been funded by the Seventh Framework Program of the
European Commission (Grants 246703 (DRIHMS) and 261507 (MAPPER)), and by
the project REsource brokering for HIgh performance, Networked and Knowledge
based applications (RE-THINK), P.O.R. Liguria FESR 2007-2013.

References
1. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure, 2nd
edn. Elsevier (2004)
2. Distributed European Infrastructure for Supercomputing Applications (May 10, 2011),
http://www.deisa.eu/science/benchmarking
Influences between Performance Based Scheduling and SLA 105

3. Hockney, R.W.: The science of computer benchmarking. Software, environments, tools.


SIAM, Philadelphia (1996)
4. Simmhan, Y., Ramakrishnan, L.: Comparison of Resource Platform Selection Approaches
for Scientific Workflows. In: 19th ACM International Symposium on High Performance
Distributed Computing, HPDC 2010, pp. 445–450 (2010), doi:10.1145/1851476.1851541
5. Clematis, A., Corana, A., D’Agostino, D., Galizia, A., Quarati, A.: Job–resource mat-
chmaking on Grid through two-level benchmarking. Future Generation Computer Sys-
tems 26(8), 1165–1179 (2010)
6. Bode, B., et al.: The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters.
In: 4th Annual Linux Showcase and Conference, Atlanta, USA (2000)
7. Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: A Computation
Management Agent for Multi-Institutional Grids. Cluster Computing 5(3), 237–246 (2002)
8. Huedo, E., Montero, R., Llorente, I.: A framework for adaptive execution in grids. Soft-
ware Practice and Experience 34(7), 631–651 (2004)
9. Flops Benchmark (May 10, 2011),
http://home.iae.nl/users/mhx/flops.html
10. McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance
Computers. In: IEEE Technical Committee on Computer Architecture (TCCA) Newsletter
(1995)
11. Mucci, P.J., London, K., Thurman, J.: The CacheBench Report, University of Tennessee
(Cachebench Home Page) (May 10, 2011),
http://icl.cs.utk.edu/projects/llcbench/cachebench.html
12. Gropp, W., Lusk, E.: Reproducible Measurements of MPI Performance Characteristics. In:
Margalef, T., Dongarra, J., Luque, E. (eds.) PVM/MPI 1999. LNCS, vol. 1697, pp. 11–18.
Springer, Heidelberg (1999), http://www-unix.mcs.anl.gov/mpi/mpptest/
13. Rabenseifner, R., Koniges, A.E.: Effective File-I/O Bandwidth Benchmark. In: Bode, A.,
Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 1273–
1283. Springer, Heidelberg (2000)
14. Tsouloupas, G., Dikaiakos, M.: GridBench: A Tool for the Interactive Performance Explo-
ration of Grid Infrastructures. Journal of Parallel and Distributed Computing 67, 1029–
1045 (2007)
15. The High Performance LINPACK Benchmark (May 10, 2011)
http://www.netlib.org/benchmark/hpl/
16. D’Agostino, D., Clematis, A., Gianuzzi, V.: Parallel Isosurface Extraction for 3D Data
Analysis Workflows. Distributed Environments, Concurrency and Computation: Practice
and Experience (2011), doi:10.1002/cpe.1710
17. Clematis, A., D’Agostino, D., Galizia, A., Quarati, A.: Profiling e-Science Infrastructures
with Kernel and Application Benchmarks. Submitted for the publication in Journal of
Computer Systems Science and Engineering
18. Casale, G., Serazzi, G.: Quantitative System Evaluation with Java Modeling Tools. In:
ICPE 2011, Karlsruhe, Germany, March 14-16 (2011)
19. Lazowska, E.D., Zahorjan, J., Scott Graham, G., Sevcik, K.C.: Quantitative System Per-
formance - Computer System Analysis Using Queuing Network Models. Prentice-Hall,
Inc. (1984)
User Centric Service Level Management
in mOSAIC Applications

Massimiliano Rak, Rocco Aversa,


Salvatore Venticinque, and Beniamino Di Martino

Dipartimento di Ingegneria dell’Informazione, Seconda Università di Napoli


massimiliano.rak@unina2.it

Abstract. Service Level Agreements (SLAs) aims at offering a simple


and clear way to build up an agreement between the final users and the
service provider in order to establish what is effectively granted by the
cloud providers. In this paper we will show the SLA-related activities
in mOSAIC, an european funded project that aims at exploiting a new
programming model, which fully acquires the flexibility and dynamicity
of the cloud environment, in order to build up a dedicated solution for
SLA management. The key idea of SLA management in mOSAIC is that
it is impossible to offer a single, static general purpose solution for SLA
management of any kind of applications, but it is possible to offer a set of
micro-functionalities that can be easily integrated among them in order
to build up a dedicated solution for the application developer problem.
Due to the mOSAIC API approach (which enable easy interoperability
among moSAIC components) it will be possible to build up applications
enriching them with user-oriented SLA management, from the very early
development stages.

1 Introduction

Cloud Computing is the emerging paradigm in distributed environments. In


cloud solutions everything from hardware to application layers are delegated to
the network. Following this main idea many different technologies and solutions
are born and assumed the cloud hat.
Cloud Computing technologies are moving along two orthogonal directions:
on one side they aim at transforming existing datacenters into Cloud Providers,
which offer everything in the form of a service (as a service paradigm); on the
other side they aim at building applications and solutions that, as much as
possible, fit with user’s needing.
As an example, one of the more important effects of the Cloud Computing
paradigm is the extreme flexibility in application and system re-configurability: it
is incredibly simple to acquire new resources and to cut away them if they are not
required. Due to this flexibility and capability of self-adaptation to the request,
cloud applications become User Centric. It means that the role of the final user
is central. The quality metrics (availability, performance, security) should aim at

This research is partially supported by FP7-ICT-2009-5-256910 (mOSAIC).

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 106–115, 2012.

c Springer-Verlag Berlin Heidelberg 2012
User Centric Service Level Management in mOSAIC Application 107

improving the quality perceived from users, while resource administration and
optimization assume the role of acquiring the right amount of resources, which
are compliant with the needing of the final users, instead of optimizing the usage
of already acquired ones.
Service Level Agreements (SLAs) aim at offering a simple and clear way to
build up an agreement between the final users and the service provider in order
to establish what is effectively granted. A Service Level Agreement (SLA) is
an agreement between a Service Provider and a Customer, that describes the
Service, documents Service Level Targets, and specifies the responsibilities of
the Provider and the Customer.
From User point of view a Service Level Agreement is a contract that grants
him about what he will effectively obtain from the service. From Application
Developer point of view, SLAs are a way to have a clear and formal definition
of the requirements that the application must respect.
But, in such a context, how it is possible for an application developer to take
into account the quality perceived by EACH final user of his application? This
problem is solved by the adoption of SLA templates (i.e. predefined agreements
offered to final users). This means that developers must identify at design time
the constraints to be fulfilled, the performance index to be monitored and the
procedure to be activated in case of risky situations (i.e. when something happens
that may lead to disrespect the agreement).
At the state of art many research efforts have been spent in order to de-
fine standards for SLA description (WS-Agreement [1], WSLA [6]) or operative
framework for SLA management (SLA@SOI[10,2], WSAG4J[11]). As it is shown
more in detail in the related work section 2, it is fully recognized the needing
of SLA management in the cloud context, but there are, at the state of art, no
clear proposals about an innovative approach to SLA management that takes
into account the User Centric view, which is typical of the cloud environment.
In this context the mOSAIC project [4,7] proposes a new, enhanced program-
ming paradigm that is able to exploit the cloud computing features, building
applications which are able to adapt themselves as much as possible to the avail-
able resources and to acquire new ones when needed (more details in section
3). mOSAIC offers together an API for development of cloud applications which
are flexible, scalable, fault tolerant, provider-independent and a framework for
enabling their execution and access to different technologies.
The key idea of SLA management in mOSAIC is that it is impossible to offer
a single, static, general purpose solution for SLA management of any kind of
applications, but it is possible to offer a set of micro-functionalities that can be
easily composed in order to build up a dedicated solution for the application
developer problem. In other words, thanks to mOSAIC API approach (which
enables easy interoperability between moSAIC components) it will be possible
to build up applications with user-oriented SLA management features, from the
very early development stages.
The reminder of this paper is organized as follows: next section (section 2)
summarizes the state of art of SLA management solutions, while the following
108 M. Rak et al.

one briefly summarizes the main concepts related to mOSAIC API and how it
is possible to develop applications using mOSAIC. Section 4 proposes the vision
of the SLA problem in the context of cloud applications, which is detailed in
the section dedicated to the architectural solution 5. A brief section dedicated
to examples (section 6) shows how the approach has been applied in simple case
studies. The paper ends with a section dedicated to the current status, future
work and conclusions.

2 Related Work

As anticipated in the introduction a lot of research work exists on the problem


of Service Level Management in cloud. The biggest part of this work focuses on
building solutions which make possible to a cloud provider to offer (using the
SLA) its services with a granted quality. In this direction the most interesting
results come from SLA@SOI [10,2]. It is an European project that aims (together
with other relevant goals) at offering an open source based SLA management
framework. SLA@SOI results are extremely interesting and offer a clear start-
ing basis for the SLA management in complex architectures. The main target
of the project goes in the direction of enriching the Cloud Provider offer: they
aims at developing a solution that can be integrated with Cloud technologies
(like Open Nebula) in order to offer Cloud Services (mainly Infrastructures and
image management services) trough an SLA-based approach. SLA@SOI offers
solutions to design SLAs together with the offered services and to generate and
manage them trough many different representations. The CONTRAIL project
[3] proposes a complex architecture for building cloud providers for both Infras-
tructures and Platform as a services solutions. In their proposal they assumed an
heavy role for the SLA management: each service request is enriched with a SLA,
and service evolution in the architecture is traced together with its own SLA.
They are reusing and applying many of the SLA@SOI technologies in order to
build up their solutions. Even in the OPTIMIS project [8], which aims at build-
ing a Platform as a service solution on the basis of federated clouds (i.e they
offer a complex dynamic system able to acquire virtual machines and storage
resources from many different providers and deploy over them enterprise solu-
tions), the role of SLA is very relevant. In Optimis, moreover, they have special
care in managing the dynamicity and elasticity of acquired resources. OPTIMIS
needing puts in evidence an additional aspect of the SLA management in cloud
environment: the autonomicity, the needing of building solutions which are able
to self-manage themselves. SLA and Autonomic approaches are strictly related
each other, but partially in conflict: how to grant an agreement for a given
specific service on resources that dynamically adapt their behavior due to the
dynamic workload changes (how many requests, which kind of requests, ...) in a
completely independent way? This topic is completely open and, at the best of
author’s knowledge there are no available solutions. mOSAIC Project assumes
that these two requirements are complementary and equally needed: the appli-
cation developer needs solutions able to self-adapt themselves and, at the same
User Centric Service Level Management in mOSAIC Application 109

time, to have a clear and full notion of the state of the system allowing him to
take decisions, which lead to offer the grants needed in SLA management.

3 mOSAIC API
In mOSAIC a Cloud Application is developed as a composition of inter-connected
building blocks. A Cloud ”Building Block” is any identifiable entity inside the
cloud environment. It can be the abstraction of a cloud service or of a software
component. It is controlled by user, configurable, exhibiting a well defined behav-
ior, implementing functionalities and exposing them to other application compo-
nents, and whose instances run in a cloud environment consuming cloud resources.
Simple examples of components are: a Java application runnable in a plat-
form as a service environment; or a virtual machine, configured with its own
operating system, its web server, its application server and a configured and
customized e-commerce application on it. Components can be developed fol-
lowing any programming language, paradigm or in-process API. An instance of
a cloud component is, in a cloud environment, what an instance of an object
represents in an object oriented application.
Communication between cloud components takes place through cloud re-
sources (like message queues – AMPQ, or Amazon SQS) or through non-cloud
resources (like socket-based applications).
Cloudlets are the way offered to developers by mOSAIC API to create compo-
nents. Cloudlet runs in a cloudlet container that is managed by the mOSAIC Soft-
ware platform. A Cloudlet can have multiple instances, but it is impossible at run-
time to distinguish between two cloudlet instances. When a message is directed to
a cloudlet, it can be processed by anyone of the cloudlet instances. The number of
instances is under control of the cloudlet container and is managed in order to grant
scalability (respect to the cloudlet workload). Cloudlet instances are stateless.
Cloudlets use cloud resources trough connectors. Connectors are an abstrac-
tion of the access model of cloud resources (of any kind) and are technology in-
dependent. Connectors control the cloud resource trough technology-dependent
Drivers. As an example, a cloudlet is able to access to Key-Value store systems
trough a KVstore Connector, it uses an interoperability layer in order to control
a Riak, or a MemBase KV driver.
Therefore a Cloud Application is a collection of cloudlets and cloud compo-
nents interconnected trough communication resources.
Details about the mOSAIC programming model and about the Cloudlet con-
cept, whose detailed description is out of the scope of this paper can be found
in [4,5,7].

4 Vision and Approach


Consider a service developer who is using the mOSAIC project solution in order
to build up a new cloud application that offers complex services to its own users.
Providers offer simple SLA, just granting the availability of the resources
and (in some cases) the bandwidth and latency for the access network (see
110 M. Rak et al.

Parameters). The cloud application will be accessed by a web interface, whose


main functionalities are complex operations which may need some seconds to be
performed. The developer would like to offer to its user a SLA, in order to grant
average and maximum response time, together with a given availability. The SLA
problem for a Cloud Application can be now described in the following terms:
how can the developer ensure the user requirements in terms of SLA (parameters
like average and maximum response time, service availability, ...) using resources
which are not under its control and whose SLA grants different kind of service
properties (e.g. availability of the VM)?. What the mOSAIC Platform should
offer to developers in order to solve this problem?
The problem of Service Level Agreement in mOSAIC can be modeled by two
layers. At each layer there is a different agreement: (1)Agreement between final
users and Application (2) Agreement between Application and Cloud providers
The agreement between users and the application (point 1) means that the
application and the user shares an SLA. Inside the cloud application a negotiator
concludes the agreement that will be shared with the final user. Moreover, from
the developer’s point of view, we need tools able to predict the application
behavior, to monitor its evolution, to modify the application behavior and/or
the cloud resources acquired in order to: (a) accept/refuse the agreements, (b)
identify risky situations for stipulated agreements, (c) apply the actions needed
to grant a stipulated agreements.
About the cloud resources to be acquired (point 2), the developer has to search
and choose the Cloud Providers, negotiating with them (acting as a client) the
SLA and monitoring the promised service levels. Note that the application is
forced to search for a provider which allow to grant as much as possible the
ones offered to the user. Moreover, at the state of the art, no Cloud Provider
offers negotiable SLA, but they just offer a set of static SLAs. It is the mission
of a federation framework (like mOSAIC) to build up a SLA negotiation sys-
tem on the top of the existing ones. Agreement enforcement, instead, implies
a large set of different problems: monitoring the state of the resources in order
to identify risky situations, execution of recovery procedures in order to react
to dangerous states and, last but not least, agreement mapping or delegation.
Agreement mapping (or delegation), means definition of the SLA terms which
can be offered to final users on the basis of the SLA offered by the providers.
Agreement enforcement implies the capability of predicting the application be-
havior. This is strictly dependent on the application and can be performed only
from the service developer, using techniques and tools application dependent
(i.e. Petri Nets, simulation techniques or analytical models). From the mOSAIC
point of view the above presented problem can be managed as follows: the first
Agreement problem (user-application) should be solved at Software Platform
Level, i.e. including in the API a set of components which let the final applica-
tion to easily build up a negotiation system and using tools able to change the
kind or the quantity of acquired cloud resources. The second Agreement problem
(application-providers) is solved by the Cloud Agency.
User Centric Service Level Management in mOSAIC Application 111

5 General Overview of the SLA Architecture


The main assumptions in SLA management for mOSAIC applications can be
summarized as follows:
– microfunctionality approach: mOSAIC will offer a set of cloudlets and com-
ponents whose interface is defined in terms of exchanged messages. Appli-
cation developers have to simply use the default components, connect them
trough queues and develop the ones needed to build up his own custom
solution.
– User Centric Approach, which means that it should be possible for an appli-
cation developer to maintain information for each user, taking into account
his requirements and what the application promised to him.
– Support for autonomicity of the components. In cloud environment a lot of
efforts exists in development of applications, resources, services with elastic
features, i.e. able to change their behavior due to execution conditions.
The key idea of the microfunctionalities approach is to provide a set of compo-
nents, even for SLA management, to be composed together the functional ones
according to the application requirements and to its logic. This choice is manda-
tory because there is not only one solution that solves the SLA general problem
independently from the application.
mOSAIC global architecture is composed of: mOSAIC API, which includes
the SLA components, the framework, which uses a provisioning system (Cloud
Agency) and the tools needed to run the mOSAIC applications (cloudlet con-
tainers, application deployers, ..).
Negotiation of SLA with cloud providers on a federation basis is completely
solved by the Cloud Agency [9]. Here the Cloud Agency is used as a black box
that can book on user behalf the best set of cloud resources for his application,
after that the developer has defined requirements of desired resources. Cloud
Agency will be able to offer negotiable SLAs on the top of many different com-
mercial cloud providers. The Cloud Agency solves the problem of Agreement
between applications and Cloud Providers, offering the Negotiator, which im-
plements the SLA negotiation towards a large set of different CPs and, once the
SLA is approved, delivers the resources to the application.

SLA User Negotiation. This module contains all the cloudlets and compo-
nents which enable interactions between user and the application in terms
of SLA negotiation.
SLA Monitoring/Warning System. This module contains all the cloudlets
and components needed to detect the warning conditions and generates
alerts about the difficulty to fulfill the agreements. It should address both
resource and application monitoring. It is connected with the Cloud Agency.
SLA Autonomic System. This module includes all the cloudlets and com-
ponents needed to manage the elasticity of the application, and modules
that are in charge of making decisions in order to grant the respect of the
acquired needed to fulfill the agreements.
112 M. Rak et al.

Fig. 1. mOSAIC SLA Module Organization

6 An Example of mOSAIC Approach


In the following we aim at showing in practice how the approach works on a
simple example, derived from the mOSAIC Case studies. The mOSAIC Service
Developer has an application developed on the top of a job submission system
and aims at developing a cloud application, which offers SLAs and controls the
job submission system, varying parameters like the number of virtual machines
running in the cloud. User’s behaviour can be easily described: each user has a
job description that is filled in a User Request file, the user invokes the cloud
application offering the file as a parameter. On the Service Provider portal the
user is able to access the state and available results of the application execution.
The mosaic developer aims at building an application, which acquires virtual
clusters from commercial cloud providers, configured in order to be able to exe-
cute the requested jobs. Moreover the developer will like to offer the same service
at different quality levels (varying grants like exclusive use of machine or the maxi-
mum cpu consumption allowed), expressed trough SLAs. The developer needs the
ability to monitor the SLA and to know what is granted to each different user.
Adopting an SLA based approach, the user’s behaviour changes as follows:
(a) User subscribes to the application signing an Agreement (SLA) (Agreement
use case) (b) User submits his job request to the application (Submit use case)
(c) User queries the portal about the status of his requests (Check).
In order to manage the SLAs, the application has the following duties:
– maintains the list of all the agreements signed;
– maintains the list of all the requests done;
– acquires and maintain resources from cloud provider in order to execute the
requests of the users;
– monitors the SLA fullfillment;
– activates procedures in risky situations.
User Centric Service Level Management in mOSAIC Application 113

The complexity of the application depends on the grants offered by the SLA
and on the kind of target application running on the top of the job submission
system (as an example: is it easy to predict its response time?). The complexity
due to the application behaviour (its predictability, the action to take in order
to grant application dependent parameters, ..) cannot be defined in general. On
the other side the management of the SLA toward the user (negotiation), the
monitoring of resource status, the management of the SLA storage are common
to all the applications. In the following we will focus on the components offered
in mOSAIC for such requirements.
As a first step we design the SLA that the developer aims at offering to
the users, in order to simplify the approach we model it just by two simple
parameters: the maximum amount of Credits the final user wants to pay and
the maximum number of requests the user is allowed to submit. Moreover the
application assures that the services will be offered on dedicated resources (they
will not sell the same resources to two users). This agreement is represented as
a WS-Agreement template, some pieces of them are described in listing 1.1

Listing 1.1. Example of User SLA request in WS-Agreement


[..]
< wsag:Variab l eS e t >
< wsag:Variable wsag:Name = " UserRequests " wsag:Metric = "
xs:integer " >
< wsag:Location >
$ this / wsag:Terms / wsag:All /
wsag:ServiceDescriptionTerm
[ @wsag:Name = ’ Term1 ’ ]/ mosaic:Job S ub m is s i on /
MaxRequests
</ wsag:Location >
</ wsag:Variable >
< wsag:Variable wsag:Name = " UserCredit " wsag:Metric = "
xs:integer " >
[..]
< wsag:Guara n te e Te r m wsag:Name = " MaxCredit " >
< w s a g : S e r v i c e L e v e l O b j e c t i v e>
< wsag:KPITarge t >
< wsag:KPIName > MaxCreditLev el </ wsag:KPIName >
< w s a g : C u s t o m S e r v i c e L e v e l > UserCredit GT 0 </
wsag:CustomServiceLevel>
</ wsag:KPITarg et >
</ w s a g : S e r v i c e L e v e l O b j e c t i v e>
</ wsag:Guara n te e Te r m >
< wsag:Guara n te e Te r m wsag:Name = " MaxRequests " >
[..]

Note that the monitoring of the SLA can be done independently from the state
of the acquired resources, but just by tracing the requests. This means that we
will not use the autonomic and monitoring modules of the SLA architecture.
114 M. Rak et al.

Once the SLA has been defined, we show briefly how to design the application,
whose behaviour includes the SLA negotiation and agreement storing, once it
has been signed. For each user’s request the application evaluates the acquired
resources, the available credit, eventually starts new resources and then submits
the job to the acquired VC.
Following the microfunctionalities approach the application can be designed
as in picture 2. The mOSAIC API offers a simple SLAgw component, which im-
plements the WS-Agreement protocol (toward the final user) and sends messages
on predefined queues in order to update the application. As a consequence the
programmer has to develop few cloudlets: an Agreement Policy Cloudlet,
which has the role to accept or not an SLA, a Request Cloudlet, which has
the role of fowarding the user requests to the job suibmission system, and two
cludlets,Resource Policy Cloudlet and Guarantee Policy Cloudlet, which
have respectively the role of tracing the acquired resources and generate warning
for risky conditions. Cloudlets cooperate only trough message exchange, coor-
dinating their actions. As an example, Agreement Policy Cloudlet receives the
messages from SLAgw each time a new SLA requests takes place. Moreover it
sends messages to the SLAgw in order to update the agreement state and to
query about the status of the Agreements. Messages data are represented in
JSON (that helps when data need to be stored in a KV store). As an example
the messages sent by the SLAgw are JSON representations of ServiceTypes and
GuaranteeTerms extracted from the WS-Agreement. Note that they can be cus-
tomized by the final user (WS-Agreement standard is open to this) and only final
user knows how to represent them. The Monitoring cloudlet regularly checks the
status of each user and eventually applies penalties (not reported in the WSAG
for simplicity and sake of space).

Fig. 2. Example of SLA-based application

7 Conclusions
As outlined in section 2, the management of Service level agreement is an hot
topic in cloud environment. In the mOSAIC project, which aims at designing
User Centric Service Level Management in mOSAIC Application 115

and developing a cloud provider independent API, we propose a set of features


that should help the application developer to integrate SLA management in his
aplications. In this paper we have outlined the vision and approach proposed in
mOSAIC, the organization of the framework dedicated to SLA management and
a simple example in which we have designed a SLA management system around
a simple cloud application.

References
1. Andrieux, A., Czajkowski, K., Dan, A., Keahey, K., Ludwig, H., Nakata, T.,
Pruyne, J., Rofrano, J., Tuecke, S., Xu, M.: Web services agreement specification
(ws-agreement). In: Global Grid Forum. The Global Grid Forum, GGF (2004)
2. Comuzzi, M., Kotsokalis, C., Rathfelder, C., Theilmann, W., Winkler, U., Za-
cco, G.: A Framework for Multi-level SLA Management. In: Dan, A., Gittler,
F., Toumani, F. (eds.) ICSOC/ServiceWave 2009. LNCS, vol. 6275, pp. 187–196.
Springer, Heidelberg (2010),
http://dx.doi.org/10.1007/978-3-642-16132-2-18,
doi:10.1007/978-3-642-16132-2-18
3. CONTRAIL: Contrail: Open computing infrastructres for elastic computing
(2010), http://contrail-project.eu/
4. Leymann, F., Ivanov, I., van Sinderen, M., Science, B.S.S., Publications, T. (eds.):
Towards a cross platform Cloud API. Components for Cloud Federation (2011)
5. IEEE (ed.): Building an Interoperability API for Sky Computing (2011)
6. Keller, A., Ludwig, H.: The wsla framework: Specifying and monitoring service level
agreements for web services. Journal of Network and Systems Management 11(1),
57–81 (2003)
7. mOSAIC: mosaic: Open source api and platform for multiple clouds (2010),
http://www.mosaic-cloud.eu
8. optimis: Optimis: the clouds silver lining (2010),
http://www.optimis-project.eu/
9. Venticinque, S., Aversa, R., Di Martino, B., Petcu, D.: Agent based cloud provi-
sioning and management, design and prototypal implementation. In: Leymann, F.,
et al. (eds.) 1st Int. Conf. Cloud Computing and Services Science (CLOSER 2011),
pp. 184–191. ScitePress (2011)
10. Theilmann, W., Yahyapour, R., Butler, J.: Multi-level SLA Management for
Service-Oriented Infrastructures. In: Mähönen, P., Pohl, K., Priol, T. (eds.) Ser-
viceWave 2008. LNCS, vol. 5377, pp. 324–335. Springer, Heidelberg (2008)
11. Waeldrich, O.: Wsag4j (2008),
https://packcs-e0.scai.fraunhofer.de/wsag4j/
Service Level Management for Executable
Papers

Reginald Cushing1 , Spiros Koulouzis1 , Rudolf Strijkers1,3 ,


Adam S.Z. Belloum1 , and Marian Bubak1,2
1
University of Amsterdam, Institute for Informatics, The Netherlands
2
AGH University of Science and Technology,
Department of Computer Science, Poland
3
TNO Information and Communication Technology, The Netherlands

Abstract. Reproducibility of Science is considered as one of the main


principles of the scientific method, and refers to the ability of an exper-
iment to be accurately reproduced, by third person, in complex exper-
iment every detail matters to ensure the correct reproducibility. In the
context of the ICCS 2011, Elsevier organized the executable paper grand
challenge a contest to improve the way scientific information is commu-
nicated and used. While during this contest the focus was on developing
methods and technique to realize the idea of executable papers, in this
paper we focus on the operational issues related to the creation a viable
service with a predefined QoS.

1 Introduction
The idea of interactive paper is not new; the very first steps in this field were
introduced by with HyperText Markup Language [1]. A reader of a web page
was able to navigate from page to page by simply clicking on the link-associated
with a certain concept. The technical details of the systems supporting Hyper-
Text Markup Language is rather complex, however, the way HyperText Markup
Language are exposed to both the readers and writer of web pages is intuitive,
for the reader its just a colour encoded text, while for the writer it is just a
simple line of code with a very simple syntax. When applets and ECMAScript
(http://en.wikipedia.org/wiki/ECMAScript) were introduced the concept of hy-
pertext has been pushed further readers of web document were execute small
applets and client-side scripts to run simple application. The Executable Paper
(EP) Grand Challenge organized by the Elsevier in the context of International
Conference on Computational Science (http://www.iccs-meeting.org/) is to push
this concept one step further to include scientific publications. However, this is
not a trivial transition as many scientific publications are about complex experi-
ments, which are often computing and data intensive, or require special software
and hardware. Propriety software used by experiments is also subject to strict
licensing rules. The papers published in the grand challenge workshop propose
various solutions to realize the executable paper concepts [6,7,8,9]. The papers
focuses on the technical details and technology choices but give little attention

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 116–123, 2012.

c Springer-Verlag Berlin Heidelberg 2012
SLM for Executable Papers 117

to the operational aspects associated with the deployment of such a service and
what would be the impact on the stakeholders to provide a reliable and scalable
service allowing re-execution of published scientific.
The rest of the paper is organized as follows Section 2 describes the executable
paper life cycle, Section 3, discussion the exploitation of executable papers, Sec-
tion 4 describes the implementation of executable paper using Cloud approach,
and Section 5 discusses SLM needed achieve a certain QoS.

2 Executable Papers Lifecycle

The concept of EPs is feasible only if the lifecycle governing this concept is clear
and the role of the different actors is well defined through the entire lifecycle
of the production of the executable paper. This lifecycle starts from the time
authors decide to write the paper, going through the review process, and ending
by the publication of the paper. The role of the authors, in the current publi-
cation cycle, finishes when the paper is accepted for publication. The publisher
is the second actor, as he makes the paper available and accessible to potential
readers. The third actor is not directly active in the creation of the paper but
still very important as it provides the needed infrastructure to the author to per-
form the experiment to be included in the paper. The third actor is usually the
institution to which the author belongs at the time he is writing his paper. After
the publication of the paper, maintaining the infrastructure needed to reproduce
scientific experiments is not the primary interest of research institutions. A very
important question is then posed, which actor will take the role of providing
the needed logistic to keep the EP alive. We believe that the publisher is the
only actor that is capable to take over this task. However providing a service
that allows a reader to re-run experiments is completely different from provid-
ing a service that just give access to a digital version of the paper. In this case
the publisher will have to maintain a rather complex computing and storage
infrastructure that might be beyond the scope of the publisher actual interests
and expertise. Outsourcing this task to a specialized computing service provider
might be a possible solution where Service Level Agreements (SLAs) play a vi-
tal role in maintaining an EP and re-running experiments in a timely fashion
so as to maintain an acceptable reader experience. We will develop further this
solution in the rest of the paper.

3 Exploitation of Executable Papers

Reproducibility of Science is considered as one of the main principles of the


scientific method, and refers to the ability of an experiment to be accurately
reproduced, by third person, in complex experiment every detail matters to
ensure the correct reproducibility. Dissemination of the knowledge contained in
scientific paper often requires details that be can hardly described in words and
if added to the paper will make the paper more difficult to read.
118 R. Cushing et al.



   

  
    


 


Results
Submission

Comments

Submission
of revision Accepted for
publication

  
   

  


Fig. 1. Lifecycle of EP, experiment results trigger the writing of scientific papers, it is
thus important that readers of these paper are able to explore and re-execute if needed
these experiments

There are a couple of daily scenarios in science where the concept of executable
paper is indeed needed. The first one is the review process of scientific publications,
often reviewers selected by conference organizers and publishers to assess the qual-
ity of newly submitted papers have to verify the results published. For that, they
need to trace back the path to initial data or to verify parameters used in a specific
step of the proposed method and in certain cases even re-run part of the experiment.
The second most common scenario in an EP is while scientists are reading the
already published paper. Often they are interested in reusing part of the pub-
lished results whether these results are algorithms, methods or tools. Currently
this is done by contacting the authors and try to get the needed information but
often the authors are not reachable or their current research topics are different
from the one published in the paper.
From these exploration scenarios, we can identify the actors active during the
various phases of the lifecycle of the executable paper (Table 1).
With the emergence of reliable virtualization technologies, which are capa-
ble of hiding the intricacies of complex infrastructure, publishers can offer more
than just a static access to scientific publications [5,4]. The reader of a published
scientific publication should be able to re-execute part of the experiment. Figure
2 illustrates the interactions between various entities in the EP scenario. SLAs
between readers and publisher exist which define a certain QoS expected by the
reader such as maximum time for re-running experiments. Readers are often
affiliated to institutions for which an SLA between the institution and the pub-
lisher could exist. The publisher manages a set of SLAs with service providers
for outsourcing the re-execution of the experiment. Since experiments vary in
complexity, the SLAs would define which provider is capable of executing the
experiment within QoS parameters.
SLM for Executable Papers 119




 
      
 
  
   





    
  
    
 

 


    

  


Fig. 2. Interaction of various entities during the lifecycle of an EP. In (1) the author
creates an EP, (2) the reviewer reviews the paper and possibly re-run the experiment.
(3) A reader that reads the EP after publication and can also re-run the experiment.
(4) The publisher that upon request from the reviewer or reader can outsource the
execution of the experiment. Depending on the SLA between reviewer, reader and the
publisher, the publisher can choose amongst a set of SLA to pick the best service
provider which can deliver the QoS requested by the reviewer or reader.

Table 1. Main Actors involved in the realization and executable paper lifecycle

Actor Role Active in phase


Actor 1: Scientist author of a scientific experimentation and writing
publication
Actor 2: Scientist provide the computing experimentation
affiliation infrastructure
Actor 3: Publisher publishes scientific review and publication
publications
Actor 4: Reviewer assess the quality of review
scientific publications
Actor 5: Provider provide the computing publication
infrastructure
Actor 6: Scientist any scientist who want publication
to re-run experiments (customer)
120 R. Cushing et al.

4 Challenges Facing the Implementation of Executable


Papers
The implementation of the executable paper concept faces a number of challeng-
ing points of different nature: administrative, intellectual property and technical
challenges.
– Administrative issues: are related to the role of the actor, which will provide
the computing infrastructure to re-run part of the entire experiments. As
we have pointed out is section 3, when the paper is published there is no
guarantee, that infrastructure used to produce the results is available for
re-runs.
– Intellectual property issues: most of scientific experiments are using third
party software which is licensed to the institution of the author of the EP at
a certain time and under certain conditions, which might change in time. In
certain case even the data used in scientific experiment is subject to licensing,
and privacy issues.
– Technical issues: are related to the environment in which the experiment
has been performed, CPU architecture, operating system, and third party
libraries.
The technical issues even if they might be in some case complex are still easier to
solve as the virtual machine technology is nowadays able to create self-contained
and reliable system platform which supports the execution of a complete op-
erating system. Virtual images can be started on-demand to re-run a certain
application, this approach is widely used in Cloud computing [10].
If the virtual machine approach can solve the problems of working environ-
ment and library dependencies, it still has some issues with IP issues, Jeff Jones
explains in his blog how tracking software assets on virtual images is gaining
momentum [2]. Even if it is possible to re-run experiment published in EP, there
are still IP issues that need to be solved. Whoever will provide a service able
to re-run published scientific experiments, has to acquire licenses for common
software in a certain scientific domain that might partially solve the IP problem.
Publishers are the potential actors which are able to provide a service which
implements the executable paper concept, using the virtualization technique,
they can provide, without having to know the details of a given experiment,
which is able re-run published experiment. To implement such a solution, pub-
lishers either has to develop in-house the expertise and the infrastructure needed
to re-execute EP or to outsource the provisioning of the needed infrastructure to
Cloud and Grid providers. Technically Cloud providers such Amazon, Microsoft,
Google etc. are able to provide the need infrastructure against a fixed cost [3].

5 Discussions
Any solution for the EP has to be intuitive and should not add much further
burden on the actors involved in the EP lifecycle. A number of tools and services
SLM for Executable Papers 121

have to be developed to support all these actors in accomplishing their respec-


tive task. From the author point of view, the services needed are: a service for
collecting provenance information when he/she is doing the experimentation, a
service for creating annotation when writing the paper, and framework to create
a virtual image of environment in which the experiment has been performed.
From the reviewer point of view services are need to interact with the paper
query details when needed, and re-run experiments.
The publisher, as a service provider, will play a key role in the realization of the
EP. Currently publishers provide access to scientific papers, such a service has
be extended to upload the created virtual images needed to re-run the published
experiments.
Because in the proposed approach publishers will outsource the provisioning
of the needed infrastructure to an independent service provider, a server level
management is needed. In case of EP the usage of the resources may vary a lot
from a couple computing nodes to a much more larger infrastructure. Publishers
may offer a whole spectrum of EP categories covering a wide spectrum of features
from fast and immediate to slow or scheduled at later time. Publisher acts as a
composite service provider whereby they integrate externally provided services
at run-time into end-to-end composite services.
From the provider point of view (publisher) each service can be provided by
different infrastructure provider, in different implementations, and with differ-
ent functional characteristics. The provider has to determine at runtime, which
supplier to use in the composite service and has to manage the service provision
in an automated fashion. Assuming the supplier fails under a SLA between the
provider and the supplier, a fail-over supplier is provisioned and the execution
is re-established automatically.
The SLA between the customer and the provider is fulfilled without the cus-
tomer being aware of failures and the interaction with other SLAs that exist in

Table 2. Executable paper use case steps

DESCRIPTION Step Action


1
Publisher asks the authors to describe the list of requirements
needed to reproduce the experiment in term of CPU, memory,
input data, software, special device, and list all Intellectual
property issues
2 Publisher decides based on the input received from the author
either to deliver paper in an executable form or not.
3 Publisher inform the author that the paper is going to be
published in executable version, and ask them to prepare the
virtual Machine
4 The authors generate a Virtual Machine to re-run experiment
and all the data needed for the re-execution
EXTENSIONS Step Branching Action: Publisher does not accept to
publish the paper in an executable form
5 Publisher publish the paper is a static form
122 R. Cushing et al.

the architecture; i.e. that the data, QoS, outage requirements between the cus-
tomer and the service provider are fulfilled or the SLA consequence occurs. In
Table 2, we identify the steps needed to publish the paper or not in an executable
form. These steps describe the interaction between two actors: the publisher and
the authors. The publisher initiates this use case after the paper has been ac-
cepted for publication. Not all papers can be published as executable papers
because they are either very expensive to reproduce, need special hardware or
software (intellectual property issues), or request access to private data that are
not likely to be provided (privacy issues). In Table 3, we identify the steps needed
to execute an executable paper. These steps describe the interaction between two
actors: the publisher and the provider of the computing infrastructure. This use
case is initiated by a scientists who want to re-execute a published experiment.

Table 3. Executable paper use case steps

DESCRIPTION Step Action


1
Scientist request to re-executed an experiment published
in an executable paper,
2 Publisher offers different ways of re-execution of the experiment
fast, immediate, slow, scheduled (each has a given cost)
3 Scientist select one way to re-execute
4 Publisher contact the infrastructure provider and ask to
run the experiment based on the SLA established between
the publisher and the infrastructure provider
EXTENSIONS Step Branching Action: scientist does not accept to
any of the proposed way for re-execution
5 Publisher drop a request for execution and close the case

6 Conclusion
We have identified a number of challenges facing the implementation of EP con-
cept, we have classified them into three categories: technical, administrative, and
intellectual property. In this positioning paper we have described one approach
to address the technical challenges and identified the role that each actor in-
volved in the lifecycle of EP. Among other issues we have stressed in this paper,
the issue of provisioning the needed infrastructure when the EP is published,
and pointed out a technique that can help to solve this problem which is the use
of new virtualization techniques to provide a working environment for the pub-
lished experiments. We discussed the feasibility of this technique and described
two scenarios related to the operational aspects associated with the deployment
of an executable paper service and the role of each actor throughout the exe-
cutable paper lifecycle.
We believe that publisher can play a key role in implementing the EP concept,
in our proposal the publisher does not have to develop in-house the expertise to
SLM for Executable Papers 123

maintain the infrastructure needed to re-run published experiments. The pub-


lisher can outsource this task a to providers of computing infrastructure like
Grid and Cloud providers. In order to achieve a certain QoS of the EP service,
the publisher has to have a well established service level management with the
infrastructure providers.
However there are open IP questions, which has to be solved. Typically a
license is acquired for a single copy of software running on a specific hardware.
For a server software like Microsoft Windows Server a license is needed even for
each client who uses Microsoft server technology. With IaaS approach it is not
easy to track which software is used for what and how many times. This is a
main reason why IaaS providers encourage their users to use free open source
solutions on cloud resources. In order to give the authors the legal possibility
to use proprietary software and tools for their EP a new license strategy which
combines the two approaches (IaaS and SaaS) has to be developed.

References
1. Markup Languages, http://en.wikipedia.org/wiki/Markup_language
2. Jones, J.: Tracking Software Assets on Virtual Images Gains Momentum for Soft-
ware Asset Management Professionals,
http://blogs.flexerasoftware.com/elo/2010/07/
tracking-software-assets-on-virtual-images-gains-
momentum-for-software-asset-management-professional.html
3. Basant, N.S.: Top 10 Cloud Computing Service Providers of 2009 (2009),
http://www.techno-pulse.com/2009/12/
top-cloud-computing-service-providers.html
4. Hey, B.: Cloud Computing. Communications of the ACM (51) (2008)
5. Armbrust, et al: A View of Cloud Computing. Communications of the ACM 4(53)
(2010)
6. Strijkers, R.J., Cushing, R., Vasyunin, D., Belloum, A.S.Z., de Laat, C.,
Meijer, R.J.: Toward Executable Scientific Publications. In: ICCS 2011, Singa-
pore’s Nanyang Technological University, June 1–3 (2011)
7. Limare, N., Morel, J.M.: The IPOL Initiative: Publishing and Testing Algorithms
on Line for Reproducible Research in Image Processing. In: ICCS 2011, Singapore’s
Nanyang Technological University, June 1–3 (2011)
8. Kauppinen, T.J., Mira de Espindola, G.: Linked Open Science - Communicating,
Sharing and Evaluating Data, Methods and Results for Executable Papers. In:
ICCS 2011, Singapore’s Nanyang Technological University, June 1–3 (2011)
9. McHenry, K., Ondrejcek, M., Marini, L., Kooper, R., Bajcsy, P.: Towards a Univer-
sal Viewer for Digital Content. In: ICCS 2011, Singapore’s Nanyang Technological
University, June 1–3 (2011)
10. Strijkers, R., et al.: AMOS: Using the Cloud for On-Demand Execution of e-Science
Applications. In: IEEE e-Science 2010 Conference, December 7–10 (2010)
11. Kertesz, A., et al.: An SLA-based resource virtualization approach for on-demand
service provision. In: Proceedings of the 3rd ACM International Workshop on Vir-
tualization Technologies in Distributed Computing (2009)
12. Belloum, A., Inda, M.A., Vasunin, D., Korkhov, V., Zhao, Z., Rauwerda, H., Breit,
T.M., Bubak, M., Hertzberger, L.O.: Collaborative e-Science Experiments and Sci-
entific Workflows. In: IEEE Internet Computing (August 2010)
Change Management in e-Infrastructures
to Support Service Level Agreements

Silvia Knittl1 , Thomas Schaaf2 , and Ilya Saverchenko2


1
Ludwig-Maximilians-Universität München (LMU),
Oettingenstraße 67, D-80538 Munich
knittl@mnm-team.org
http://www.mnm-team.org/
2
Leibniz Supercomputing Centre (LRZ), Boltzmannstr. 1,
D-85748, Garching near Munich
{schaaf,saverchenko}@mnm-team.org

Abstract. Service Level Agreements (SLAs) are a common instrument


for outlining the responsibility scope of collaborating organizations. They
are indispensable for a wide range of industrial and business applications.
However, until now SLAs did not receive much attention of the research
organizations that are cooperating to provide a comprehensive and sus-
tainable computing infrastructures or e-Infrastructures (eIS) to support
the European scientific community. Since many eIS projects have left
their development state and are now offering highly mature services, the
IT service management aspect becomes relevant.
In this article we are concentrating on the inter-organizational change
management process. At present, it is very common for eIS changes to be
autonomously managed by the individual resource providers. Yet such
changes can affect the overall eIS availability and thus have an impact on
the SLA metrics, such as performance characteristics and quality of ser-
vice. We introduce the problem field with the help of a case study. This
case study outlines and compares the change management process de-
fined by PRACE and LRZ, which is one of the PRACE eIS partners and
resource providers. Our analysis shows, that each of the organizations
adopts and follows distinct and incompatible operational model. Follow-
ing that, we demonstrate how the UMM, a modeling method based on
UML and developed by UN/CEFACT, can be applied for the design of
inter-organizational change management process. The advantage of this
approach is the ability to design both internal and inter-organizational
processes with the help of uniform methods. An evaluation of the pro-
posed technique and conclusion ends our article.

Keywords: Change Management, SLA, Maintenance, e-Infrastructures.

1 IT Service Management and e-Infrastructures


An e-Infrastructure (eIS) is an environment intended to support the scientific
community by providing digital information processing and computational tech-
nologies [14]. Examples of eIS are the pan-european network GÉANT, grids and

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 124–133, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Change Management in e-Infrastructures to Support SLAs 125

distributed high performance computing infrastructures [4]. The common char-


acteristic of these eIS is that they are established as a cooperation of service
providers that are delivering a fraction of services to support a common goal.
Until recent it was uncommon for partners participating in eIS projects to
endorse service provisioning contracts (Service Level Agreements (SLAs)). SLAs
however, are a common instrument for outlining the scope of responsibility of
collaborating organizations. According to the IT Infrastructure Library (ITIL)
an SLA provides means of assuring a defined level of service quality delivered by
a Service Provider [12]. To achieve this kind of warranty, it is necessary to mon-
itor and measure availability and performance of all services against the targets
defined in an SLA and to produce a corresponding service report. Information
about all relevant assets, their interrelationship, as well as the associated SLAs
are stored in a tool called Configuration Management System (CMS) [12].
Now, that some of the projects have matured into established organizations
offering sustainable services to the scientific community, they pay more atten-
tion to SLAs as the mechanism for guaranteeing the necessary quality of the
provided services. Service characteristics commonly addressed in SLAs are as-
surances about specific levels of availability. To meet the desired availability
levels, each participating member has to align its internal operational processes
with the global eIS requirements. The challenge here relates to the fact that ev-
ery participating organization itself is an autonomous entity. As a consequence
it can not be assumed, that the operational processes, for instance change man-
agement, within each organization are comparable. There might be differences
in, for example, the role models, workflows or tools used.
The change management process according to ITIL defines procedures for ef-
ficient and prompt handling of all changes and thus is able to minimize service
disruptions caused by incidents. Changes associated with a maintenance of an
established eIS or the need to enhance it with new features requested by its users
are a possible cause of disruptions that result in an outage of one or more ser-
vices. Since it is common for SLAs to describe required availability parameters,
changes can affect these agreements. Therefore, there is a strong interrelationship
between the service level management and the change management processes.
The change manager, for example, needs to know the corresponding availabil-
ity parameters to be able to schedule changes in such a way as to minimize the
negative impact on the eIS availability. The service level manager is responsible
for generating service reports and thus needs information from the change man-
agement about upcoming and past changes that might affect or have affected the
availability. All relevant information is stored in the Configuration Management
System (CMS).
While ITIL gives a good guidance for establishing IT service management
(ITSM) within a single organization, for the alignment of inter-organizational
ITSM a new approach needs to be taken. In figure 1 we demonstrate the problem
field on an abstract level. As common in eIS projects, participating organizations
(depicted as I and Z in the figure) define and follow internal ITSM processes.
To facilitate collaboration and information exchange (depicted as A and A’)
126 S. Knittl, T. Schaaf, and I. Saverchenko

between the organizations an interface for inter-organizational ITSM is required.


Such interface should cover organizational, informational, functional and com-
municational aspects.
of organization Z
private process

end

A‘
start

Collaborative process
private process of
organization I

A end

start

Fig. 1. Collaborative processes and information exchange in the style of [2]

In this paper we will give an overview of our approach to address the chal-
lenges in the area of inter-organizational change management (ioCM). Our goal
was to design an ioCM process that can be adopted by all partners of PRACE,
a persistent European eIS. Thus, our concept incorporates extensions of well-
established best practices frameworks like ITIL for inter-organizational use and
adapts collaborative standards in the modeling field. In the following section
PRACE, a European eIS project, is introduced. Before presenting the major
concept areas of the proposed management process, we will give a brief overview
of related work in the area of ioCM. For the process design we have adapted the
UN/CEFACT Modeling Methodology (UMM), developed by the UN/CEFACT
(United Nations Center for Trade Facilitation and Electronic Business) to sup-
port the development of inter-organizational business processes [17]. We conclude
with an overview of our future plans in section 5.

2 Case Study: e-Infrastructure for High Performance


Computing
In this section we describe the importance of ioCM based on the example of
PRACE and LRZ. The unique characteristic of this scenario is that both or-
ganizations have independently defined their own internal change management
processes. Due to the tight integration of the LRZ and PRACE eIS effective co-
ordination of operation and administration activities is required. The challenge
Change Management in e-Infrastructures to Support SLAs 127

is to establish a collaborative inter-organizational change management process to


support the maintenance announcement task within the PRACE environment.
In figure 2 the principle set up of the PRACE environment is depicted.

PRACE Research Infrastructure

Researcher and
Research Groups

Site LRZ other Site


...

High Performance Computing [similar set up]


depends on

...
Power Network Software Storage

Fig. 2. Small excerpt of the PRACE infrastructure

The Partnership for Advanced Computing in Europe (PRACE) is a Euro-


pean project aimed at deployment and operation of a persistent pan-European
research infrastructure for high performance computing. PRACE brings together
European HPC centers with the focus on the coordinated system selection and
design, coherent management of the distributed infrastructure, software deploy-
ment, optimization of applications and promotion of the state of the art appli-
cation development methodologies. The Leibniz Supercomputing Centre of the
Bavarian Academy of Sciences and Humanities (LRZ) is one of the leading Ger-
man high performance computing centers. LRZ offers a wide range of services,
including data storage, visualization facilities and high performance computing
among others, to the German universities and research institutes. LRZ is in-
volved in many national and international IT research projects, PRACE being
one of them.
The change management processes defined at LRZ are based on the indus-
try standards, such as ITIL and ISO20000. The processes define organizational
roles, specify tasks and responsibilities and outline workflows that have to be fol-
lowed for implementation of an infrastructure change. Figure 3 shows the Test,
Plan and Implement phases of the LRZ’s change management process. Other
phases, such as Approval, Authorization or Rollback, are omitted for reasons
of brevity.
Let us consider a situation in which LRZ has to upgrade one of the backbone
routers. This change will have a major impact on the LRZ infrastructure and
will result in a downtime of IT services offered by LRZ. As such, following the
128 S. Knittl, T. Schaaf, and I. Saverchenko

Fig. 3. Excerpt of the LRZ change management process according to [13]

change management process, the router upgrade will be thoroughly documented


and tested. The implementation date will be discussed with all affected parties
and selected such as to minimize the impact on the LRZ infrastructure.
The change will also affect availability of the PRACE eIS since services hosted
by LRZ, one of the partner sites, will be temporarily unavailable. However, since
LRZ and PRACE change management processes are independent, the change
will not be coordinated with the PRACE management and operational staff.
Under some circumstances, PRACE might not even be aware of the changes im-
plemented by its partners. Lack of coordination between PRACE and its part-
ner sites can, potentially, result in a large scale disruption of the PRACE eIS.
To avoid this negative impact on the eIS and facilitate information exchange
and coordination of activities the change management processes implemented at
LRZ have to be integrated with the corresponding internal processes defined by
PRACE.

3 Related Work

While many articles covering the area of change management are available,
hardly any related work is addressing change management in eIS. In [16] there is a
discussion on inter-organizational change management in public funded projects.
But in this article the authors mainly focus on the sociological aspects like the
need for communication between the public and the participating project part-
ners. Aspects of inter-organizational ITSM are not considered in this paper.
Also in [11] communication is identified as one vital concern in e-Government
projects. Within their analysis the authors concentrate on the structures that
Change Management in e-Infrastructures to Support SLAs 129

need to be established to foster changes in sense of innovations within the do-


main of e-Government. For that the key issues are identified as communication
concept, competencies of stakeholders, the ambiguity of goals and the collabo-
rative form as such. The authors believe that methods of enterprise architecture
will provide support for such kind of projects.
In [7] change management is, as well, stated to have the strongest relationship
with the inter-organizational information and communication technology. Thus
the advise is to invest in the organizational change management, i.e. dedicate
resources and change management communication. There is no recommendation
on how the inter-organizational change management process can be established.
Although, hardly any article is addressing ioCM processes within eIS, there
are technical platforms in place that are able to implement inter-organizational
workflows, which are the technical representation of specific processes as for
example in [10]. In [5] cross-organizational workflows are implemented based
on contracts between the cooperating partners, while [1] is concentrating on
the interaction protocols between the collaboration partners and uses semantic
web technologies for their implementation. These approaches can be used as a
technical foundation for the SLA-based inter-organizational change management
process as we specify it in the next section.

4 Design of an Inter-organizational Change-Management


To develop an interface for the ioCM process within an eIS project, the infras-
tructure requirements need to be captured and the corresponding models need
to be designed. For that we are adopting the modeling method UN/CEFACT
Modeling Methodology (UMM) that has been originally developed in the B2B
environment for supporting the international trade [17]. This method is based
on the Unified Modeling Language (UML), which has been proven to fit best for
inter-organization workflow modeling [3]. The advantage of using UMM is that
it is a standardized method that enables information exchange in a technology-
neutral, implementation-independent manner. As such, the collaborating part-
ners can share the common models independent of the locally selected imple-
mentation technology. We adapt the UMM to our scenario and demonstrate that
it is possible to address both the requirements of the local private process as well
as the global project goals by using the same methodical approach.
For implementing an interface within our scenario the following design goals
have to be met (structured according to the dimensions organization, informa-
tion, function and communication model according to [6]):
Organizational Model: The organizational model addresses the operational
and organizational structure of the organization. In case of an eIS, cooperative
structures, roles and groups are modeled. In figure 4 the cooperating partners
LRZ and PRACE and their corresponding roles like Change Manager or opera-
tional staff are shown (for more details see [15]).
Information Model: The information model contains information about con-
figuration items (CIs) and their interrelationships. In [9] we have already
130 S. Knittl, T. Schaaf, and I. Saverchenko

uc ChangeManagement

«BusinessCollaboration»
inter-organizational
Change-Process

CAB Change Manager LRZ PRACE

(from LRZ) (from LRZ)

Change Manager operational Staff


eCAB staff
(from PRACE) (from PRACE)
(from LRZ) (from LRZ)

Fig. 4. Partner View inter-organizational Change Management process

class BIELibrary

«ABIE»
«ABIE» Request-for-Change_Document
SLA_Contract «ABIE»
IT_Serv ice «BBIE»
«BBIE» + Change-Acceptance: DateTime [0..1]
+ Description: Text [0..*] - PRACE-relevant + Change-Authorization: Text [0..1]
+ End: Date [0..*] «BBIE» + Change-Category: Code [0..1]
+ Identification: Identifier [0..*] + Description: Text [0..*] + Change-Creation: DateTime [0..*]
+ ItemIdentifier: Identifier [0..*] + Identification: Identifier [0..*] + Change-Identification: Identifier [0..*]
+ Name: Text [0..*] + IT-Service-Name: Text [0..*] + Change-Revew-Status: Code [0..*]
+ Start: Date [0..*] + Description: Text [0..*]
+ Name: Text [0..*]

«ABIE» «ABIE»
Av ailability_Parameter Maintenance-Announcement_Serv ice

«BBIE» «BBIE»
+ Identification: Identifier [0..1] + Description: Text [0..*]
+ Value: Text [0..1] + Identification: Identifier [0..*]
+ Name: Text [0..*]

Fig. 5. Information Modell for MAS

described the development process of an inter-organizational information model,


which we are using here as well. For integrating the Maintenance Announcement
Service (MAS) within the already established local change procedures the cor-
responding information model will have to be extended. In Figure 5 some parts
of the resulting model are shown. The IT-Service entity needs to be enhanced
with an attribute that marks the relevance of changes to the corresponding inter-
organizational services. In our case the attribute is called PRACE-relevant. If
there is a change on a CI that has the setting PRACE-relevant = yes, then
the changing of this CI have to be announced with MAS. For this, there is a
further CI needed for the description of the MAS contents as can be seen in
the figure. As described above, the information system containing the details
about the CIs is called CMS, in the inter-organizational setting a corresponding
inter-organizational CMS (ioCMS) has to be in place.
Change Management in e-Infrastructures to Support SLAs 131

Functional Model: Local processes are autonomously designed; they need to


be extended by appropriate interfaces. Our analyses of the above use case re-
sulted in the conclusion that MAS has to be integrated into the LRZ internal
step of planning activities (c.f. section 2). As a part of the planning activity
a decision regarding the activity’s scope should be taken. If the value of the
flag is yes, i.e. PRACE-relevant = yes, the planned activity affects the PRACE
e-Infrastructure and the MAS process has to be started. Otherwise the local pro-
cess of LRZ should be followed. The main outcome of these planning activities
is an updated Forward Schedule of Changes (FSC). Figure 6 outlines a small ex-
cerpt of this integration. In case the MAS process is launched, the corresponding
communication mechanisms have to be in place as described below.

act ChangeManagement

Change Process
PRACE

Change
Announcement
Message

Change
Announcement
Message

Plan Change Start Maintenance


io-relevant? Announcement Serv ice
LRZ

[yes]

next steps of local Change


process
[no]
End of Plan
Change
autorizedChange-received

Fig. 6. Process Area: Maintenance announcement integrated in local procedures

Communication Model: The goal of the first integration stage is to enable


the propagation of change announcements from a PRACE execution site to the
global PRACE service platform. For that, messages have to be sent from the
site’s information system to the global PRACE platform via a push mechanism.
Since PRACE does not have an adequate ioCMS in operation at the moment,
the initial MAS implementation will be based on web services. The MAS frame-
work and data exchange interfaces will be deployed at LRZ. The corresponding
technical specification will be shared with all PRACE execution sites in order to
evaluate the framework in their environment. With a growing maturity of the
inter-organizational ITSM process we are expecting that also bi- and multilat-
eral communication channels supporting complex interactions between various
stakeholders will be necessary in the future. For instance, if PRACE needs to
coordinate the maintenance tasks across multiple partner sites. Therefore, we
132 S. Knittl, T. Schaaf, and I. Saverchenko

will further evaluate the inter-organizational workflow systems as discussed in


section 3 for the application in our scenario.

5 Summary
In this article we have presented a framework for inter-organizational change
management and described an application scenario based on an international
e-Infrastructure (eIS) project. The goal of change management is to establish
mechanisms for coordination of activities for maintenance of existing and imple-
mentation of new services in an eIS. Change management provides means for
exchange of information about planned, ongoing and completed changes that
affect availability of eIS components and thus is essential for successful Service
Level Management (SLM). In the majority of eIS providing services to the sci-
entific community the areas of SLM and change management still receive very
little attention. However, since eIS projects are becoming mature in their service
offering, the overall ITSM needs to be professionalized.
To address the challenge we have applied standards, both in the modeling
and ITSM fields, to our problem domain. The selected standards include the
UMM modeling method, developed originally for B2B environment and adopted
to inter-organizational provider networks and the ITIL process framework. This
methodology has a number of advantages. International, well established stan-
dards can be applied to the design of both the intra- and inter-organizational
ITSM processes. models that result from this approach can be easily shared and
applied by all partners within an eIS, which we will demonstrate in the future by
implementing a model repository accessible to all eIS partners. Having defined
the design concepts, we are going to implement them in the PRACE environment
described in our case study. In the following stages of our work we are intending
to implement our framework in other eIS projects we are involved in. Within
this article we have focused on the operational process of change management.
Even though, at present, not every of the collaborating partners within the eIS
project has implemented basic ITIL processes, we think, that there is a high
potential for standardization, which we will present in [8] based on an analysis
of the ITIL adoption rate of three different eIS projects.

Acknowledgement. The authors thank the members of the Munich Network


Management Team for valuable comments on previous versions of this paper. The
MNM-Team, directed by Prof. Hegering and Prof. Kranzlmüller, is a group of
researchers from the University of Munich, the Technische Universität München,
the German Federal Armed Forces University in Munich, and the Leibniz Su-
percomputing Centre of the Bavarian Academy of Sciences and Humanities. The
web server of the MNM Team is located at http://www.mnm-team.org/.

References
1. Andonoff, E., Bouaziz, W., Hanachi, C.: Protocol management systems as a
middleware for inter-organizational workflow coordination. IJCSA 4(2), 23–41
(2007)
Change Management in e-Infrastructures to Support SLAs 133

2. Dietrich, J.: Nutzung von Modellierungssprachen und -methodologien stan-


dardisierter B2B-Architekturen für die Integration unternehmensinterner
Geschäftsprozesse. Ph.D. thesis, Technische Universität Berlin (2007)
3. Dussart, A., Aubert, B.A., Patry, M.: An Evaluation of Inter-Organizational Work-
flow Modelling Formalisms. Journal of Database Management (JDM) 15(2), 74–104
(2004)
4. e-sciencetalk: Mapping the e-infrastructure landscape (November 2010),
http://www.e-sciencetalk.org/briefings/
EST-Briefing-15-Landscape-Newt.pdf
5. Grefen, P.W.P.J., Aberer, K., Hoffner, Y., Ludwig, H.: CrossFlow: Cross-
Organizational Workflow Management in Dynamic Virtual Enterprises. Tech. Rep.
CTIT Technical Report 00-05, University of Twente (2000)
6. Hegering, H.G., Abeck, S., Neumair, B.: Integrated Management of Networked
Systems - Concepts, Architectures and their Operational Application. Morgan
Kaufmann (1999)
7. Kallioranta, S.M., Vlosky, R.P.: Inter-organizational information and communi-
cation technology adoption in the business-to-business inferface. Tech. Rep. 84,
Louisiana State University Agricultural Center (September 2008),
http://www.lsuagcenter.com/NR/rdonlyres/
8F013DDA-9114-47C5-8BAC-71CEEA6BE0BF/53401/wp84.pdf
8. Knittl, S., Beronov, K.: E-infrastructure projects from a bavarian perspective: Po-
tentials of standardization. In: eChallenges 2011, Florence, Italy (2011)
9. Knittl, S., Brenner, M.: Towards a configuration management system for hybrid
cloud deployments. In: 6th IFIP/IEEE International Workshop on Business-driven
IT Management, BDIM 2011 (June 2011)
10. Meng, J., Su, S.Y.W., Lam, H., Helal, A., Xian, J., Liu, X., Yang, S.: DynaFlow: a
dynamic inter-organisational workflow management system. International Journal
of Business Process Integration and Management 1(2), 101–115 (2006)
11. Nilsson, A.: Management of technochange in an interorganizational e-government
project. In: 41st Hawaii International International Conference on Systems Science
(HICSS-41 2008), Waikoloa, Big Island, HI, USA, p. 209 (2008)
12. (OGC): ITIL V3 complete suite - Lifecycle Publication Suite. The Stationery Office
Ltd. (2007)
13. Schaaf, T., Hartmannstruber, N.: Change Management Prozessbeschreibung. LRZ-
internal Process Documentation, Version 0.8 (October 2010)
14. e SciDR: Toward a european e-infrastrucutre for e-science digital repositories -
a report for the european commission. Tech. Rep. 2006 S88-092641, The Digital
Archiving Consultancy Limited, Twickenham (2008),
http://cordis.europa.eu/fp7/ict/e-infrastructure/docs/e-scidr.pdf
15. Somborn, M.: Entwicklung eines Informationsmodelles als Grundlage für den
PRACE Maintenance Announcement Service. Master’s thesis, Technische Univer-
sität München (2011)
16. Sutanto, J., Kankanhalli, A., Tay, J., Raman, K.S., Tan, B.C.Y.: Change Man-
agement in Interorganizational Systems for the Public. Journal of Management
Information Systems 25(3), 133–175 (2008)
17. UN/CEFACT: Un/cefact’s modeling methodology (umm): Umm meta model -
foundation module version 1.0. Online,
http://www.unece.org/cefact/umm/UMM_Foundation_Module.pdf
PROPER 2011: Fourth Workshop
on Productivity and Performance Tools for HPC
Application Development

Michael Gerndt

Technische Universität München


Fakultät für Informatik I10, Boltzmannstr. 3, 85748 Garching, Germany

Foreword
The PROPER workshop addresses the need for productivity and performance
in high performance computing. Productivity is an important objective during
the development phase of HPC applications and their later production phase.
Paying attention to the performance is important to achieve efficient usage of
HPC machines. At the same time it is needed for scalability, which is crucial in
two ways: Firstly, to use higher degrees of parallelism to reduce the wall clock
time. And secondly, to cope with the next bigger problem, which requires more
CPUs, memory, etc. to be able to compute it at all.
Tool support for the user is essential for productivity and performance. There-
fore, the workshop covers tools and approaches for parallel program development
and analysis, debugging and correctness checking, and for performance measure-
ment and evaluation. Furthermore, it provides an opportunity to report success-
ful optimization strategies with respect to scalability and performance.
This year’s contributions reflect this spectrum nicely. The invited presentation
by Mitsuhisa Sato about Challenges of programming environment and tools for
peta-scale computers (programming environment researches for the K computer)
takes place during the first session ”Programming Interfaces”, chaired by Felix
Wolf. The second session is about ”Performance Analysis Tools” and guided by
Michael Gerndt. The topic of the last session is ”Performance Tuning” and the
chair is Allen Malony.
We would like to thank all the authors for their very interesting contribu-
tions and their presentations during the workshop. In Addition we thanks all
the reviewers for the reading and the evaluation of all the submitted papers.
And furthermore, we would like to thank the EuroPar 2011 organizers for their
support and for the chance to offer the PROPER workshop in conjunction with
this attractive conference. We are most grateful for all the administrative work
of Petra Piochacz. Without her help the workshop would not have been possible.
The PROPER workshop was initiated and is supported by the Virtual In-
stitute - High Productivity Supercomputing (VI-HPS), an initiative to promote
the development and integration of HPC programming tools.

September 2011
Michael Gerndt, Workshop Chair
Scout: A Source-to-Source Transformator
for SIMD-Optimizations

Olaf Krzikalla, Kim Feldhoff,


Ralph Müller-Pfefferkorn, and Wolfgang E. Nagel

Technische Universität, Dresden, Germany


{olaf.krzikalla,kim.feldhoff,ralph.mueller-pfefferkorn,
wolfgang.nagel}@tu-dresden.de

Abstract. We present Scout, a configurable source-to-source transfor-


mation tool designed to automatically vectorize C source code. Scout
provides the means to vectorize loops using SIMD instructions at source
level. Our main focus during the development of Scout is a maximum
flexibility of the tool in two ways: being capable of vectorizing a wide
range of loop constructs and being capable of targeting various modern
SIMD architectures. Scout supports several SIMD instructions sets like
SSE or AVX and is easily extensible to upcoming ones.
In the second part of the paper we present results of applying Scout’s
vectorizing capabilities to CFD production codes of the German Aerospace
Center. The complex loops used in these codes often inhibit the automatic
vectorization of usual C compilers. In contrast, Scout is able to vectorize
most of these loops. We measured the resulting speedup for SSE and AVX
platforms.

1 Introduction

Most modern CPUs provide SIMD units in order to support data-level paral-
lelism. One important method of using that kind of parallelism is the vectoriza-
tion of loops. However, programming using SIMD instructions is not a simple
task. SIMD instructions are assembly-like low-level intrinsics and often steps like
finalization computations after a vectorized loop become necessary. Thus tools
are needed in order to efficiently exploit the data-level parallelism provided by
modern CPUs.
In the context of the HI-CFD project [4] we needed a mean to comfortably
vectorize loops written in C. We are going to target various HPC platforms with
different instruction sets and different available compilers.

2 Related Tools

Naturally, a vectorization tool is best built in a compiler. Indeed, all current C


compilers provide auto-vectorization units. But a compiler must reason about
the correctness of the vectorized program automatically. This reasoning can be
done by an extensive dependency and aliasing analysis and a lot of approaches

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 137–145, 2012.

c Springer-Verlag Berlin Heidelberg 2012
138 O. Krzikalla et al.

are available to vectorize various forms of codes, especially loops [7]. However in
practice it is not possible to always reason about the absence of dependencies
(e.g. in a loop with indirect indexing). Thus means are needed in order to provide
meta information about a particular piece of code. For instance the Intel compiler
allows a programmer to augment loop statements with pragmas to designate the
absence of inner-loop dependencies.
We have tested some compilers with respect to their auto-vectorization ca-
pabilities. For some loops in our codes the available means to provide meta in-
formation were insufficient (see Sect. 3.3). Sometimes subtle issues arose around
compiler-generated vectorization. For instance in one case a compiler suddenly
rejected the vectorization of a particular loop just when we changed the type of
the loop index variable from unsigned int to signed int. A compiler expert
can often reason about such subtleties and can even dig in the documentation
for a solution. But an application programmer normally concentrates on the
algorithms and cannot put too much effort in the peculiarities of each used com-
piler. The vectorization of certain (often more complex) loops was rejected by
all compilers regardless of inserted pragmas, given command-line options aso.
We have checked other tools specifically targeting loop vectorization. In [6]
a retargetable back-end extension of a compiler generation system is described.
Being retargetable is an interesting property (see also Sect. 3.2) but for our
project it did not come into consideration due to its tight coupling to a particular
compiler system. SWARP [9] seems to depend solely on a dependency analysis
– something we could not rely on.

3 The Vectorizing Source-to-Source Transformator Scout


We decided to develop a new tool in order to comfortably exploit the parallel
SIMD units. The tool shall transform C source code. The output is also C source
code, but with vectorized loops augmented by SIMD intrinsics. The respective
SIMD instruction set is configurable. Thus the tool is usable as an universal vec-
torizer and it is aimed to become an industrial-strength vectorizing preprocessor.
We have called this vectorization tool Scout.
Scout exposes a command line interface as well as a graphical user inter-
faces. Internally it uses the clang parser [1] to transform C source code to
an abstract syntax tree (AST). The vectorization and other optimizations are
then performed on that AST. Eventually the transformed AST is rewritten to
C code. Scout is published under an Open Source license and available via
http://scout.zih.tu-dresden.de
We have opted for a strict semi-automatic vectorization. That is, as with
compilers, the programmer has to annotate the loops to be vectorized with
#pragma directives. The directive #pragma scout loop vectorize in front of a
for statement triggers the vectorization of that loop. Before the actual vector-
ization starts, the loop body is simplified by function inlining, loop unswitching
(moving loop-invariant conditionals inside a loop outside of it [3]) and unrolling
of inner loops whereever possible. The resulting loop body is then vectorized
using the unroll-and-jam technique.
Scout: A Source-to-Source Transformator for SIMD-Optimizations 139

3.1 Unroll-and-Jam
Various approaches to vectorize loops exist. Traditional loop-based vectorization
transforms a loop so, that every statement processes a possible variable-length
vector [5]. With the advent of the so-called multimedia extensions in commodity
processors the unroll-and-jam approach became more important [8]. In [7] this ap-
proach is descibed mainly as a mean to resolve inner-loop dependencies. However,
we use this approach in a more general way. First, we partially unroll each state-
ment in the loop according to the vector size. Then we test whether the unrolled
statements can be merged to a vectorized statement. Unvectorizeable statements
(e.g. if-statements including their bodies) remain unrolled. Only their memory ref-
erences to vectorized variables are accordingly adjusted. All other statements are
vectorized by decomposing them to vectorizeable expressions. Scout allows the
user to vectorize arbitrarily complex expressions (see Sect. 3.2).
A nice consequence of using the unroll-and-jam approach is the possibility
to vectorize different data types (e.g. float and double) in one loop simulta-
neously. The vector sizes of vectorized data types may differ, but the largest
vector size has to be a multiple of all other used vector sizes. The loop is then
unrolled according to that largest vector size and vectorizeable statements of
other data types are then only partially merged together and remain partially
unrolled.
Listing 1 demonstrates the vectorization of different data types for a SSE
platform. The vector size for float is 4 and for double it is 2. Hence the loop is
unrolled four times. Then all operations for float values can be merged together
(in the example only the load/store operations). In contrast only two unrolled
consecutive operations for double values (one load and the division) are merged
to a vectorized operation leaving the double operations partially unrolled. Vec-
torized conversion operations are generated automatically whenever needed.

float a [100]; double b [100]; float a [100]; double b [100];


double x ; __m128 av ;
# pragma scout loop vectorize __m128d xv1 , xv2 , bv1 , bv2 ;
for ( int i =0; i <100; ++ i ) { for ( int i =0; i <100; i +=4) {
x = a [ i ]; av = _ m m _ l o a d u _ p s ( a + i );
x = x / b [ i ]; xv1 = _ m m _ c v t p s _ p d ( av );
a[i] = x; xv2 = _ m m _ c v t p s _ p d (
} _ m m _ m o v e h l _ p s ( av , av ));
bv1 = _ m m _ l o a d u _ p d ( b + i );
bv2 = _ m m _ l o a d u _ p d ( b + i + 2);
xv1 = _mm_div_pd ( xv1 , dv1 );
xv2 = _mm_div_pd ( xv2 , dv2 );
av = _ m m _ m o v e l h _ p s (
_ m m _ c v t p d _ p s ( xv1 ) ,
_ m m _ c v t p d _ p s ( xv2 ));
_ m m _ s t o r e u _ p s ( a + i , av );
}

Listing 1. Mixing types in vectorization


140 O. Krzikalla et al.

3.2 Configuring Scout


A central requirement is the configurability of Scout with respect to existing as
well as upcoming SIMD architectures. This aspect is controlled by supplying a
configuration file to Scout which describes the properties of the target SIMD
platform. A configuration file is written in pure C++. Actually C++ is not
designed as a configuration language and thus we had to stretch the semantics.
However the choice of C++ has a lot of advantages: it is not necessary to learn
yet another configuration syntax, it is possible to use the usual preprocessing
means (conditional compilation, includes) in the configuration thus alleviating
the maintenance costs, and the AST of the configuration file can be generated
and processed by clang. The actual intrinsics are wrapped up in string literals
making the configuration valid C++ even if the headers for the actual target
SIMD platform are not available on the translation machine. Listing 2 shows
an excerpt of a configuration file for the data type float targeting the SSE
architecture.

namespace scout {

template < class , unsigned > struct config ;

template < >


struct config < float , 4 > {
typedef __m128 type ; // target SIMD type
enum { align = 16 }; // alignment requireme n t

static void s t o r e _ a l i g n e d ( float * , type ) { // function name


" _ m m _ s t o r e _ p s (%1% , %2%)"; // predefined by Scout
}

static float add ( float a , float b ) { // expressio n mapping


a + b; // statement is an expression
" _mm_add_ps (%1% , %2%)";
}

static float c o n d i t i o n _ l t ( float a , float b , float c , float d ) {


a < b ? c : d;
" _ m m _ b l e n d v _ p s (%3% , %4% , _ m m _ c m p l t _ p s (%1% , %2%))";
}

static float sqrt ( float ) { // function mapping


float sqrtf ( float ); // statement is a function declarati o n
" _mm_sqrt_p s (%1%)";
}
}

} // namespace scout

Listing 2. Scout configuration for a typical SIMD architecture

For each supported data type the configuration provides a specialized class
template named config placed in the namespace scout. The first template
Scout: A Source-to-Source Transformator for SIMD-Optimizations 141

parameter denotes the underlying base type of the particular vector instruction
set. The second integral template parameter denotes the vector size of that set.
A set of predefined type names, value names and static member functions are
expected as class members of the specialization.
There are two general kinds of static member functions. If the function name
is predefined by Scout, then the function body consists of only one statement –
the string literal denoting the intrinsic. Load and store operations are defined in
this way.
If the function name of the static member functions is not predefined, then
the string literal in the function body is preceded by an arbitrary number of
expressions and/or function declarations. In that case expressions and function
calls in the original source code are matched against these configuration expres-
sions and functions and are vectorized according to the string literal if they fit.
This option adds great flexibility to Scout. Indeed it is not only possible to use
various instruction sets in their atomic shape but also combine them to more
complex or idiomatic expressions a priori.
Listing 3 demonstrates the vectorization capabilites of Scout by using the
condition_lt and sqrt functions of Listing 2.

float a [100] , b [100]; float a [100] , b [100];


float x ; __m128 a_v , b_v , x_v , c0_v ;
# pragma scout loop vectorize c0_v = _mm_set1_ ps (0.0);
for ( int i =0; i <100; ++ i ) { for ( int i =0; i <100; i +=4) {
x = a [ i ] < 0 ? b [ i ] : a [ i ]; a_v = _ m m _ l o a d u _ p s ( a + i );
a [ i ] = sqrtf ( x ); b_v = _ m m _ l o a d u _ p s ( b + i );
} x_v = _ m m _ b l e n d v _ p s ( b_v , a_v ,
_ m m _ c m p l t _ p s ( a_v , c0_v ));
x_v = _mm_sqrt_ p s ( x_v );
_ m m _ s t o r e u _ p s ( a + i , x_v );
}

Listing 3. Vectorization of complex expressions and function calls

double a [100] , c [100]; __m128d b_v ;


int d [100]; int j_v [2];
# pragma scout loop vectorize double a [100] , c [100];
for ( int i =0; i <100; ++ i ) { int d [100];
int j = d [ i ]; for ( int i =0; i <100; i +=2) {
double b = a [ j ]; j_v [0] = d [ i ];
// c o m p u t a t i o n s j_v [1] = d [( i + 1)];
b_v = _mm_set_pd ( a [ j_v [0]] ,
// introduces an inner - loop a [ j_v [1]]);
// dependency if d [ i ]== d [ i +1]: // vectorize d c o m p u t a t i o n s
# pragma scout vectorize unroll
c [ j ] += b ; // compute every element separately :
} c [ j_v [0]]+= _ m m _ e x t r a c t _ p d ( b_v ,0);
c [ j_v [1]]+= _ m m _ e x t r a c t _ p d ( b_v ,1);
}

Listing 4. Partial vectorization


142 O. Krzikalla et al.

3.3 Partial Vectorization

Most loops in our codes follow very basic schemes: they read data from several
arrays, do some heavy calculations and then either write or accumulate the re-
sult in a different array. Hence and under the reasonable assumption, that there
are no pointer aliasing issues, pure writes normally don’t introduce any depen-
dencies. Accumulation operations however involve a read and write operation to
the same memory location and hence can introduce dependencies, especially if
there is indirect indexing involved. Such dependencies could prevent whole loops
from being vectorized. But actually most of the calculation can be performed in
parallel, just the accumulation process itself needs to remain serial. Thus we in-
troduced a pragma directive forcing a statement to compute each vector element
separately (Listing 4).

4 Practical Results

Beside the usual test cases we have applied Scout to two different CFD pro-
duction codes used in the German Aerospace Center. Both codes are written
in C using the usual array-of-structure approach. That approach is rather un-
friendly with respect to vectorization, because vector load and store operations
have to be composite. Nevertheless we did not change the data layout but used
the source code as is only augmented with the necessary Scout pragmas. The
TM
presented measurements were mainly done on an Intel R
Core 2 Duo P8600
TM
processor with a clock rate of 2.4 GHz, operating unter Windows 7 using the
Intel
R
compiler version 11.1. The AVX measurements were done on a an Intel R

TM

R
Sandy Bridge processor, using the Intel compiler version 12.
The first code computes interior flows in order to simulate the behavior of jet
turbines. In the loops direct indexing is used meaning array indices are linearly
transformed loop indices. We have split the code in four computation kernels
and present the splitted results for a better understanding of the overall and
detailed speedup in Fig. 1. It shows typical speedup factors of the vectorized
kernels produced by Scout compared to the originals.
As expected, we gain more speedup with more vector lanes, since more com-
putations can be executed in parallel. Kernel 2 even outperforms its theoretical
maximum speedup, which is a result of the other transformations (in particular
function inlining) performed by Scout implicitely.
Table 1 shows the effects of AVX on the performance of a complete run. The
first row shows the average time of one run including the computation kernels
and some framework activity. Naturally, this measurement method reduces the
overall speedup gained due to the vectorization but leads to very realistical
results. After all, the application of Scout reduces the runtime automatically by
about 10%. We expected a much better speedup by stepping up from SSE4 to
AVX because the vector register size has doubled on AVX.
However, the additional gains were rather negligible. The second row shows
the main reason for this behavior. The CPI metric (Clockticks per Instructions
Scout: A Source-to-Source Transformator for SIMD-Optimizations 143

5,0 2,5
4,5
4,0 2,0
3,5 kernel1
3,0 1,5 kernel2
speedup

2,5 kernel3
2,0 1,0 kernel4
1,5 total
1,0 0,5
0,5
0,0 0,0
30 32 34 36 38 40 42 44 46 48 30 32 34 36 38 40 42 44 46 48
problemsize problemsize

Fig. 1. Speedup of CFD kernels on Intel Core 2 Duo due to the vectorization (left side:
single precision, four vector lanes, right side: double precision, two vector lanes)

Retired) is an indication of how much latency affected the execution. Higher CPI
values mean there is more latency. In our case the latency is caused mainly by
cache misses. This comes with no surprise, because with a doubled vector size
also a doubled amount of data gets pumped through the processor during one
loop iteration. Even if this effect is well documented [2] a CPI value of 2.0 still
means there is a lot of room for improvements. In Sect. 6 we outline a possible
approach in order to address that issue.

Table 1. Effects of Scout to a CFD production code on Intel Sandy Bridge

Intel 12 Intel 12 Scout + Intel 12 Scout + Intel 12


SSE4 AVX SSE4 AVX
avg. Runtime [sec] 6.31 6.32 5.70 5.65
CPI 0.88 0.88 1.34 2.00

The second CFD code computes flows around an air plane. Unlike the other
code it works over unstructured grids. That is, the loops use mostly indirect
indexing to access array data elements. Most loops in that kernel could only be
partially vectorized (see Sect. 3.3). Nevertheless we could achieve some speedup
as shown in Table 2. We had two different grids as input data to our disposal.
First we vectorized the original code. However the gained speedup of about 1.1

Table 2. Speedup of a partially vectorized CFD kernel on Intel Core 2 Duo (double
precision, two vector lanes)

Relation original to merged to original to


vectorized merged and vectorized merged and vectorized
Grid 1 1.070 1.391 1.489
Grid 2 1.075 1.381 1.484
144 O. Krzikalla et al.

was not satisfying. Then we merged some loops inside the kernel together to
remove repeated traversal over the indirect data structures. This made the code
more compute-bound and resulted in a much better acceleration of about 1.4
just due to the vectorization. Eventually the overall speedup was nearly 1.5.

5 Summary and Conclusion


Traditionally, auto-vectorization is considered to be a compiler feature. However,
for various reasons compilers fail to vectorize a wide range of loops. Thus we have
introduced Scout in order to better exploit existing SIMD features supplied by
most modern processors. By using the unroll-and-jam approach we were able
to extend loop vectorization by some new and unique features and capabilites.
We are not aware of a compiler or another vectorization tool which provides the
means for a partial vectorization (Sect. 3.3). In addition, at least all compilers
available to us refused to vectorize loops with mixed data types (Sect. 3.1). Most
compilers have the capability to detect and vectorize common expressions to
idiomatic vectorized counterparts. However, these capabilities are mostly hidden
in the code of the compiler and cannot be configured by the user. On the other
hand the configuration framework of Scout provides a great tool in order to be
able to vectorize code for various target platforms, even for user-specific ones
(Sect. 3.2).
Section 4 presents the use of vectorization technology from a practitioners
point of view. It is worth mentioning that by just augmenting the source code
with pragmas and using Scout we could always achieve considerable speedups. Of
course, a further hand-tuning of code may lead to even better results (Table 2).
However, we emphasize that Scout nowadays is used nearly transparently in the
software production process of the German Aerospace Center in order to speedup
their codes automatically.

6 Future Work
While the achieved acceleration presented in this paper was already rather good,
it was not as exciting as one would expect due to the number of available vector
lanes. Of course Amdahl’s law plays a rather large role in our results. We did
not change the data layout and thus had to live with composite load and store
operations. That in turn leads to a smaller parallel portion of code and hence
lesser speedup.
But the presented AVX results, especially the raise of the CPI value, indicate
memory accesses as another major obstacle for performant SIMD code. Actu-
ally, compute-bounded code often gets memory-bound due to vectorization. Of
course, the cache pressure can be reduced by a carefully hand-crafted data lay-
out. But the cache size is a hard limit and even hand-crafting sometimes is
not worth the rather huge effort. Thus, in order to regain a load balance be-
tween memory and computation, we will explore the energy-saving possibilites
of memory-bounded computations.
Scout: A Source-to-Source Transformator for SIMD-Optimizations 145

Our approach combines Scout and a performance event governor (pegov)[10].


pegov increases a CPUs p-State - thereby reducing its frequency and voltage -
during the execution of memory-bound code. As presented in [10] this can lead
to substantial energy savings. Thus, first Scout makes the code faster, but also
increases the memory burden. Even though memory-bounded regions are rarely
speed up by vectorization, one can increase the performance metric "energy
efficiency" by using a performance aware governor like pegov. We expect, that
the combination of these approaches can produce faster and more energy-efficient
code automatically.

Acknowledgments. This work has been funded by the German Federal Min-
istry of Education and Research within the national research project HI-CFD (01
IH 08012 C) [4].

References
1. clang: a C language family frontend for LLVM, http://clang.llvm.org (visited
on March 26, 2010)
2. Intel VTune Performance Analyzer Basics: What is CPI and how do I use it?
http://software.intel.com/en-us/articles/intel-
vtune-performance-analyzer-basics-what-is-cpi-and-how-do-i-use-it/
(visited on June 6, 2011)
3. Loop unswitching, http://en.wikipedia.org/wiki/Loop_unswitching (visited
on July 19, 2011)
4. HICFD - Highly Efficient Implementation of CFD Codes for HPC Many-Core
Architectures (2009), http://www.hicfd.de (visited on March 26, 2010)
5. Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form.
ACM Trans. Program. Lang. Syst. 9, 491–542 (1987),
http://doi.acm.org/10.1145/29873.29875
6. Hohenauer, M., Engel, F., Leupers, R., Ascheid, G., Meyr, H.: A SIMD optimiza-
tion framework for retargetable compilers. ACM Trans. Archit. Code Optim. 6(1),
1–27 (2009)
7. Kennedy, K., Allen, J.R.: Optimizing compilers for modern architectures: a
dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco
(2002)
8. Larsen, S., Amarasinghe, S.: Exploiting superword level parallelism with multi-
media instruction sets. In: Proceedings of the ACM SIGPLAN 2000 Conference
on Programming Language Design and Implementation, PLDI 2000, pp. 145–156.
ACM, New York (2000), http://doi.acm.org/10.1145/349299.349320
9. Pokam, G., Bihan, S., Simonnet, J., Bodin, F.: SWARP: a retargetable preprocessor
for multimedia instructions. Concurr. Comput.: Pract. Exper. 16(2-3), 303–318
(2004)
10. Schöne, R., Hackenberg, D.: On-line analysis of hardware performance events for
workload characterization and processor frequency scaling decisions. In: Proceed-
ing of the Second Joint WOSP/SIPEW International Conference on Performance
Engineering, ICPE 2011, pp. 481–486. ACM, New York (2011),
http://doi.acm.org/10.1145/1958746.1958819
Scalable Automatic Performance Analysis
on IBM BlueGene/P Systems

Yury Oleynik and Michael Gerndt

Technische Universität München


Fakultät für Informatik I10, Boltzmannstr. 3, 85748 Garching, Germany

Abstract. Nowadays scientific endeavor becomes more and more hungry


for computational power of the state-of-the-art supercomputers. However
the current trend in the performance increase comes along with tremen-
dous increase in power consumption. One of the approaches allowing to
overcome the issue is tight coupling of the simplified low-frequency cores
into massively parallel system, such as IBM BlueGene/P (BG/P) com-
bining hundreds of thousands cores. In addition to revolutionary system
design this scale requires new approaches in application development and
performance tuning. In this paper we present a new scalable BG/P tai-
lored design for an automatic performance analysis tool - Periscope. In this
work we have elicited and implemented a new design for porting Periscope
to BG/P which features optimal system utilization, minimal monitoring
intrusion and high scalability.

Keywords: Performance analysis, Scalability of Applications & Tools,


Supercomputers.

1 Introduction

Traditional supercomputer design which relies on the high single core perfor-
mance delivered by high frequency has a natural scalability limit coming from
unaffordable power consumption and cooling requirements. The BlueGene [1]
developers addressed this challenge from two aspects: by utilizing moderate-
frequency cores and by tightly coupling them at unprecedented scales, which
allows power consumption to grow linearly with the number of cores. This leads
to a high density, low-power, massively parallel system design.
Unfortunately the peak performance offered by modern supercomputers can
not be achieved by straight forward application porting, one would have to in-
vest significant efforts in achieving reasonable execution efficiency. In order to
make this efforts affordable, new instruments supporting application develop-
ment have to be developed. This is specially true for performance analysis tools.
On one hand, the performance analysis results of small runs often can not be
extrapolated to the desired number of cores due to new performance phenomena

This work is partially funded by BMBF under the ISAR project, grant 01IH08005A
and the SILC project, grant 6.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 146–155, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 147

manifesting itself only at large scales. On the other hand the amount of raw per-
formance data which has to be recorded for large real-world applications running
on hundreds of thousands cores is simply too big for the commodity evaluation
approaches. The way performance analysis is done has to be rethought as well as
other aspects of extremely parallel computing. Among the challenges to be over-
come are efficient recording, storing, analysis and visualization of the discovered
results.
Periscope [4], being an automatic distributed on-line performance analysis
tool, addresses the challenges of large scale performance analysis from multi-
ple aspects. The distributed architecture of Periscope allows it to scale together
with the application relying on multiple agents. On the other hand, on-line anal-
ysis of the profile based raw performance data significantly reduces memory
requirements. However even then the amount of performance data collected for
a large scale run is big enough to overwhelm the user with too much information.
Periscope addresses the issue in two ways, first, automatic search for performance
inefficiencies dramatically decreases the amount of presented results by report-
ing only important potential tuning opportunities. Second, performing scalable
reduction, based on clustering algorithms, allows to keep the amount of reported
results constant independently from the growing number of cores.
Historically the development of Periscope was carried out based on the ar-
chitecture of commodity clusters, where the maximum scalability levels were
considered to be in order of tens of thousands of cores running standard Unix-
like kernels. Therefore porting Periscope to a new cluster was a matter of minor
adjustments, whereas the overall architecture was being preserved. However with
the introduction of BlueGene/P systems it was realized that the straightforward
porting approaches would not work. The two main reasons for that were an order
of magnitude increase in number of cores and limited operating system func-
tionality. In order to adapt Periscope to the challenges posted by BlueGene/P,
significant improvements to the tool’s architecture were developed.
The rest of the paper is composed as follows: first we describe the architecture
specifics of BG/P as well as the analysis model and architecture of Periscope.
From the cross-analysis we derive three promising approaches, from which one
was implemented and is discussed in more details. Alternative tools are discussed
in related work. In the evaluation section we apply Periscope to the NAS Paral-
lel Benchmark running with 64k processors to demonstrate achieved scalability
levels.

2 BlueGene/P Architecture
The BlueGene/P [1] base component is a PowerPC 450 quad core 32 bit mi-
croprocessor with a frequency of 850 MHz. One quad-core together with 2 or
4 GB of shared memory forms a next building block of BlueGene - a compute
node ASIC. The compute nodes run under the IBM proprietary light-weight
Compute Node Kernel (CNK) and are dedicated to run exclusively MPI/hybrid
applications. CNK is, on one hand, striped down in order to minimize the sys-
tem overheads when executing an application and, on the other hand, appears
148 Y. Oleynik and M. Gerndt

to the application programmers as a Linux-like operating system, supporting the


majority of the system calls. However not all the system calls are executed by
CNK, instead this functionality is forwarded to the dedicated I/O nodes. Due
to the low OS jitter and a single application process per core, applications show
excellent performance reproducibility.
I/O nodes are the service nodes and are not intended to run user applications.
The hardware is identical to the one of compute nodes with the only difference
of physical placement within the system, which leads to different network con-
nections. I/O nodes use 10GB Ethernet to connect to the BlueGene/P frontend.
I/O nodes operate under the “standard”Linux kernel in a four-way SMP mode
providing file system access and socket communication. Each I/O node runs one
Control and I/O daemon (CIOD) which executes the function-shipped system
call requests coming from the back-end compute nodes. Applications are not
allowed to run on the I/O node, however upon a request one tool process is al-
lowed to be spawned by the CIOD [2]. This is a typical mechanism employed by
debugging tools to control application execution [3]. The tool process normally
gets started through additional -start tool argument of the mpirun. It is also
important to mention that the I/O node and all the associated compute nodes
share the same network address.
32 Compute nodes together with 0 or up to 2 I/O nodes form the next building
block called compute card. 32 compute cards in their turn form one midplane.
One BlueGene/P rack consists of 2 midplanes or 1024 compute nodes. 72 Blue-
Gene/P racks are needed to achieve 1 PFlop of peak performance.
The distinguishing feature of BlueGene/P are multiple high-speed networks
coupling numerous compute nodes. The most appreciated one is the torus net-
work that wraps the compute nodes of one midplane into a 3D torus. It al-
lows efficient low-latency nearest neighbor communication between ranks of
MPI COMM WORLD. MPI collective operations are carried out through the dedicated
tree-like collective network. In total each compute node has six connections to
the 3 dimensional torus network with a bandwidth of 3.4Gbps in each direction,
three connections to the global collective network with 6.8 Gbps per link, four
connections to the global interrupt network and one connection to the control
network JTAG.
As an alternative to the default High Performance Computing (HPC) mode
described above, High Throughput Computing (HTC) mode allows to run large
number of non-MPI applications simultaneously. Each application could be started
independently under different user names, with separate stdin, stdout and stderr.
In order to use HTC mode a partition has to be first booted by the system admin-
istrator.

3 Periscope Design
Periscope [4] is an automatic distributed on-line performance analysis system
developed by Technische Universität München (TUM) at the Chair of Computer
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 149

Architecture. In comparison to other tools Periscope relies on profiling rather


than on tracing and searches automatically for predefined set of performance
inefficiencies during the application’s execution. The scalability requirements,
being the main design concern, resulted in a distributed tree-like reduction net-
work of agents scaling together with the application. The coordination of dis-
tributed components is carried out by Periscope’s registry service running in the
background.

3.1 Tool Architecture


Periscope consists of multiple agents where the root agent is called the Frontend
(FE). The FE is the only process which has to be started by the user. It is
responsible for starting the instrumented application processes, computing the
optimal agents hierarchy and mapping it to the underlying hardware as well as
starting the hierarchy. The FE will coordinate the distributed search by taking
global decision on controlling application’s execution and when necessary restart
application. The FE will also receive the aggregated set of found performance
properties and store them into an xml file.
The second layer of the tree-like Periscope network consist of the High-Level
(HL) agents. They are responsible for the efficient propagation of analysis deci-
sions from the FE downwards and the scalable aggregation of the results sent in
the opposite direction. The results, namely found performance inefficiencies, are
clustered on-line within the HL on their way to the FE.
The bottom level of the Periscope’s network is represented by Analysis Agents
(AA), which are the leaf nodes and are directly connected to the application
processes. AAs are responsible for executing the local search for predefined per-
formance inefficiencies on the assigned subset of application processes. An AA
instantiates performance hypotheses, creates and submits monitoring requests
required to evaluate them. The requests are submitted over sockets to the mon-
itoring library linked in the application processes. After one execution of appli-
cation phase, which is typically one iteration of the main loop (typically a time
loop in scientific simulations), the measured values for the submitted requests are
analyzed in order to prove or disprove the candidate hypotheses. The refinement
mechanism is then employed to drill the found performance inefficiency down
to the specific line of the code as well as down to the specific problem source.
An AA relies on a set of search strategies when searching for bottlenecks, which
allow efficient evaluation of numerous application code regions against the rich
set of available performance hypothesis. All known and measurable performance
bottlenecks are formalized following the APART Specification Language [7] and
are implemented as Periscope properties.
The application is instrumented and linked against Periscope’s monitoring
library which measures time, hardware counters and is capable of automatic
detection of MPI wait states. In addition, the monitoring library can control the
application execution according to the requests received from the AA.
In order to bring multiple distributed components of Periscope together,
Periscope’s registry service is used. Periscope agents as well as application
150 Y. Oleynik and M. Gerndt

processes publish their network address - identity-tag pairs and also look up
for the addresses of their communication partners at the registry service.

3.2 Current Design Shortcomings


Although the scalability of the tool was an important concern, the overall de-
sign was targeting commodity clusters [6] with tens of thousands of cores. Even
though, Periscope was showing good scalability at this scale [5], several weak
points, which could become a severe bottleneck at higher levels, were revealed.
One of the most severe potential bottlenecks was considered to be the registry
service used by all the distributed components (agents and instrumented applica-
tion processes) to register and find their corresponding communication partners.
Since every instrumented application process had to be registered and then is
queried by the responsible analysis agent, this can severely hit the analysis time,
which will quadratically grow together with the number of processes.
Another drawback of the current network startup implementation was that
the hierarchy of agents, the application process distribution, and the sequential
spawning of the agents was performed by the frontend agent. It will become a
severe startup bottleneck, when executed on a system like BlueGene/P.
Another issue for porting Periscope to the BG/P comes from the limited OS
support of the compute nodes, which is restricting compute nodes from running
anything else but an MPI application (except for the case of HTC mode or
MPMD programs, which will be discussed later), therefore excluding Periscope
agents from running on the same partition with application processes.

4 BlueGene/P Tailored Design Alternatives


Taking into account the requirements of Periscope and the architectural limita-
tions of BlueGene/P several porting approaches were elicited. The main question
to be answered here is Periscope agents placement, allowing the tool to overcome
the scalability bottlenecks considered before as well as achieve best utilization
of BlueGene/P avoiding additional overhead perturbations.

4.1 MPMD Approach


With the introduction of MPMD parallelization in the P generation of the Blue-
Gene series, the possibility to run multiple MPI programs within the same com-
munication domain became available. This allows to overcome the limitation of
only application processes being allowed to run on the compute nodes of the
same partition.
In this configuration the user starts the FE via mpirun on the allocated parti-
tion. The number of processes, which would have to account for both application
processes and periscope agents are specified using -universe size argument for
mpirun. After being started, the FE opens a port using MPI Open port and pub-
lishes the pair of port and service description. This mechanism, also employed by
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 151

other agents and application processes, would allow us to drop the commodity
registry service. However this would not remove the bottleneck associated with
the thousands of application processes ports being published and then queried
at the same time.
After successful registration, the FE computes the agents hierarchy, however
this is much simplified since all agents are started at once with a single MPI Spawn
command. The hierarchy, in this case, is determined according to the fan-out of
the HL agents and the number of leaf agents, which is proportional to the number
of application processes.
After setting up the agent hierarchy, the application would be started by the
FE and the agents connect to the application processes and execute the analysis
as before. This design would also support the restart of the application which is
done by Periscope if the application terminated but additional search steps are
required.
However, several severe drawbacks of this design were considered. First, the
collective network of the BlueGene can not be properly utilized when the ap-
plication is running in the sub communicator of the MPI COMM WORLD, which is
the case when the application is started by Periscope in MPMD mode. Second,
as it was mentioned before, the bottleneck of publishing and querying of every
application process and the agents still significantly impacts the efficiency of
Periscope’s analysis at large scale. Finally, the additionally required, complete
reimplementation of the communication substrate of Periscope with MPI as well
as the need to port the AA to run under CNK would require significant amount
of programming efforts.

4.2 Implementation Based on the HTC Mode


Another design approach, that would allow to place the AAs on the compute
nodes, relies on the High-Throughput Computing mode of BlueGene. In this
mode multiple non-MPI programs are allowed to be independently started and
run simultaneously within the provided partition under the full Linux kernel.
Based on this mode, the current approach of starting agents need not be changed.
The overall startup and analysis flow is done as follows. The user starts
the FE on the frontend node of BG/P. Then either the user or the FE sub-
mits the standard job to run the instrumented application in the HPC mode.
While the application is waiting for the dispatch, the FE first computes the
tree hierarchy with the explicit assignment of the children-parent relations and
then starts the agents within the previously booted in HTC mode partition. The
rest of the analysis follows the current procedure including the registration of
all the distributed components, which would again be a bottleneck. Additional
complications come from the fact, that on the majority of BG/P installations
privileged access rights are required to boot the HTC partition, which seriously
limits the tool’s usability. On the other hand, this design can be realized with
little porting effort.
152 Y. Oleynik and M. Gerndt

4.3 I/O Nodes Agents Placement


Following this approach, the tool distribution would utilize compute nodes exclu-
sively for application processes linked with the monitoring library, I/O nodes for
running AAs and the frontend node for Periscope’s FEs. There is one important
advantage emerging from such a setup, the agents would neither waste com-
putational resources by occupying additional cores, nor disturb the application
processes during execution like in the case of the MPMD or HTC approaches.
The placement of HLs is, however, a more complicated issue, since it is not
allowed to run more than one process on the I/O nodes. The two solutions
here are either to run HLs on the frontend node or to merge the reduction
functionality of the HL into the AAs and reuse the AA to form the tree hierarchy.
It is also worth mentioning that the number of compute nodes affiliated with
one I/O node is installation-defined and fixed and, what is even more important,
they share the same network address. This simplifies agent placement and re-
moves the need for the registry service, since addresses are by definition known.
Thus one of the most severe scalability issues of the current design is eliminated
when the AAs run on the I/O nodes.
Usually I/O nodes are not accessible to an unprivileged user. The only process
allowed to run by a regular user is a debug process of CIOD and is started
normally by the -start tool argument of the mpirun command. mpirun will
spawn the tool process on all of the I/O nodes affiliated with the allocated
partition. The startup will no longer be done sequentially by the FE, which
removes another scaling issue. However there is an important drawback, since the
AAs will be automatically killed when mpirun exits. Thus Periscope’s application
restart capability becomes not possible.

4.4 Comparison of Design Alternatives


Considering the current architecture of Periscope, its scalability issues and Blue-
Gene/P system specifics, a set of selection criteria were recognized, i.e., amount
of changes, system utilization during analysis, application intrusion and tool
usability. The amount of changes, associated with porting, is an important fac-
tor influencing both the porting effort as well as the further maintenance of
an additional branch of Periscope. In addition, system utilization becomes very
important when large scale experiments are considered. Even one additional pro-
cess per few application processes might severely hit the computational budget
of the user. The easy metric allowing to estimate the system utilization would
be the amount of additional cores needed to run Periscope’s agents. Overheads
and application execution intrusion is another critical factor which can poten-
tially corrupt the measurements, therefore the design should not introduce any
additional overheads. The last factor is the tool’s usability. It is severely limited
if privileged rights are required to run Periscope. The possibility of application
restart within one Periscope run is also a valuable functionality which is impor-
tant to preserve.
The summary comparison of the derived design alternatives against the selec-
tion criteria is presented in Table 1.
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 153

Table 1. Design alternatives comparison

Selection criteria MPMD HTC mode I/O node placement


Amount of changes high low medium
Administrative issues no yes no
Application restart possible possible not possible
Additional user input no yes no
Additional cores #agents #agents 0
Additional overheads yes no no

The comparison table shows that the design relying on the MPMD function-
ality of MPI 2.1 standard is the less preferable failing to meet the majority of
the selection criteria.
The design approach utilizing the HTC partition to run Periscope agents
features low efforts to be invested in porting and maintenance, however suffers
from the fact that booting a HTC partition is a privileged operation. In addition,
the system utilization is worse since additional cores are required to run agents.
The best match with the selection criteria is the I/O node agent placement
design and therefore was chosen for implementation. This approach features best
system utilization, since it doesn’t require any additional compute nodes to run
Periscope’s agents. Instead it runs them on the I/O nodes which are not intended
for computation by design. However the efforts to port Periscope following the
described design are considered to be moderate. In oder to prove the selected
concept fast and minimize associated porting risks, it was decided to split the
porting efforts in two phases. Within the first phase the idea of running the
AAs on the I/O nodes and the application processes on the affiliated compute
nodes was evaluated and considered to be a low-effort task. The other agents are
intended to run on the frontend node of BlueGene/P in this phase. The majority
of the efforts, though, come from the task to merge the functionality of the AA
and the HL in order to run them within the single user process allowed to run
on the I/O nodes. Therefore this task was assigned to be implemented in phase
two, which will deliver optimal tool distribution and capability to operate at the
full-scale 72-rack BlueGene/P.

5 Evaluation

The phase one porting task was implemented and Periscope was installed on
the IBM BlueGene/P supercomputer operated by King Abdullah University of
Science and Technology (KAUST). The machine consists of 16 racks containing
in total 65536 IBM Power450 cores delivering 222 TFlops of peak performance.
In order to prove the scalability of the new Periscope design a large scale
performance analysis run was carried out on a standard BT benchmark from
NAS Parallel Benchmark suite [9]. The benchmark is Block Tridiagonal solver of
a synthetic system of nonlinear PDE’s. The benchmark was built to solve the E
problem size which corresponds to 1020x1020x1020 grid size. The MPI call sites
154 Y. Oleynik and M. Gerndt

were instrumented by Periscope’s instrumenter and then the application com-


piled using the IBM mpxlf Fortran compiler with -O3 optimization flag and the
BlueGene/P MPI library. The analyzed application was running on all the avail-
able 65536 cores and was analyzed with Periscope’s MPI strategy automatically
detecting MPI wait states.
The elapsed analysis time reported by the FE was 432 seconds, from which
the 382 seconds took the tool startup. However it is important to mention that
the majority of the startup time is spent on booting the partition, in this case
the whole machine. The automatically computed agents hierarchy included 20
HL agents running in the service node of the BG/P and 128 Analysis Agents
running on the I/O nodes. Only one iteration of the BT main loop was executed
to complete the analysis for MPI wait states. For each process the monitoring
library measured 7 relevant MPI inefficiency related metrics resulting in 458752
measurements in total. Out of the received measurements, the Analysis Agents
instantiated 393216 candidate properties out of which only 196608 properties
were evaluated true. The found properties then were clustered while being prop-
agated to the FE and the final analysis report contained only 3 properties.
The found properties were “Excessive MPI communication time”reported for
one MPI Waitall and two MPI Wait call sites with a maximum severity of 5.6%.
This property identifies MPI communication overhead which is 5.6% of one main
loop iteration time. Synchronization properties were also checked but appeared
to be below threshold and thus not reported.

6 Related Works
There are only a few performance analysis tools available on BlueGene/P, and
even less of them are specially designed for large scales. SCALASCA [8], being
one of them, is an open-source performance analysis toolset specifically designed
for an evaluation of codes running on hundreds of thousands of processors. The
tool performs parallel trace analysis searching for MPI bottlenecks, which allows
it to scalably handle the trace size linearly increasing with the number of cores.
However it was found that the time spent for the analysis as well as the re-
port size were growing linearly with the employed parallelism scale. In contrast,
Periscope performs on-line profile based search, thus omitting tracing. Also on-
line reduction allows it to keep the report size independent from the number of
processes.

7 Conclusion and Outlook


The new scale of supercomputing is being currently rapidly developed with more
and more petaflop capable machines being installed world-wide. However achiev-
ing reasonable levels of sustained performance becomes a tremendously compli-
cated task for the application developers. New instruments are needed to sup-
port programmers in particular in the area of performance analysis. However
tool development faces challenges induced by the growing number of cores such
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems 155

as proportional growth in analysis times, measurements and final report sizes.


In this paper we have presented a tailored adaptation of the Periscope toolkit
for the IBM BlueGene/P supercomputer. According to the Periscope require-
ments, revealed scalability issues, and the limitations of the BG/P architecture
several porting approaches were elicited. The design, following the idea of plac-
ing Analysis Agents on the I/O nodes, was considered to be superior to the
other alternatives due to optimal system utilization, no additional overheads,
high usability and therefore was implemented.
The large scale performance analysis experiment with the NPB BT benchmark
has shown very promising scalability of the Periscope analysis. Due to optimal
placement of multiple Periscope agents and optimized analysis flow the analysis
time was kept constant and independent of the number of cores. The size of the
report was shown also to be independent from the employed parallelism scale
due to on-line results reduction.

References
1. IBM BlueGene team: Overview of the IBM BlueGene/P project. IBM Jopurnal of
Research and Development 52(1/2), 199–220 (2008)
2. Sosa, C., Knudson, B.: IBM System Blue Gene Solution: Blue Gene/P Application
Development. International Technical Support Organization, 4th edn. (August 2009)
3. DelSignore, J.: TotalView on Blue Gene/L, “Presented at” Blue Gene/L: Applica-
tions, Architecture and Software Workshop,
http://www.llnl.gov/asci/platforms/bluegene/papers/26delsignore.pdf
4. Gerndt, M., Fürlinger, K., Kereku, E.: Advanced techniques for performance anal-
ysis. NIC, vol. 33, pp. 15–26 (2006)
5. Benedict, S., Brehm, M., Gerndt, M., Guillen, C., Hesse, W., Petkov, V.: Automatic
Performance Analysis of Large Scale Simulations. In: Lin, H.-X., Alexander, M.,
Forsell, M., Knüpfer, A., Prodan, R., Sousa, L., Streit, A. (eds.) Euro-Par 2009.
LNCS, vol. 6043, pp. 199–207. Springer, Heidelberg (2010)
6. Gerndt, M., Strohhäcker, S.: Distribution of Periscope analysis agents on ALTIX
4700. In: Proceedings of the International Conference on the Parallel Computing
(ParCo 2007). Advances in Parallel Computing, vol. 15, pp. 113–120. IOS Press
(2007)
7. Fahringer, T., Gerndt, M., Riley, G., Träff, J.: Knowledge specification for automatic
performance analysis. APART Technical Report (2001),
http://www.fz-juelich.de/apart
8. Wylie, B.J.N., Bohme, D., Mohr, B., Szebenyi, Z., Wolf, F.: Performance analysis of
Sweep3D on Blue Gene/P with the Scalasca toolset. In: IEEE International Sympo-
sium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW
2010). IEEE (2010) - 978-1-4244-6533-0. - S. 1 - 8
9. NPB. NAS Parallel Benchmark,
http://www.nas.nasa.gov/resources/software/newlinenpb.html
An Approach to Creating Performance
Visualizations in a Parallel Profile Analysis Tool

Wyatt Spear, Allen D. Malony, Chee Wai Lee,


Scott Biersdorff, and Sameer Shende

Department Computer and Information Science,


University Oregon, Eugene, Oregon, 97403

Abstract. With increases in the scale of parallelism the dimensional-


ity and complexity of parallel performance measurements has placed
greater challenges on analysis tools. Performance visualization can assist
in understanding performance properties and relationships. However, the
creation of new visualizations in practice is not supported by existing par-
allel profiling tools. Users must work with presentation types provided
by a tool and have limited means to change its design. Here we present
an approach for creating new performance visualizations within an ex-
isting parallel profile analysis tool. The approach separates visual layout
design from the underlying performance data model, making custom vi-
sualizations such as performance over system topologies straightforward
to implement and adjust for various use cases.

1 Introduction
The performance measurement and analysis of large-scale parallel applications
requires means for understanding the features within the multi-dimensional
performance datasets and their relation to the computational and operational
aspects of the application and its execution. While automatic analysis of perfor-
mance behavior and diagnosis of performance problems is desired, performance
tools today invariably involve the user for performance results interpretation.
Presentation of performance information has been regarded as an opportunity for
conveying visually characteristics and traits in the data. However, it has always
been challenging to create new performance visualizations, for three reasons.
First, it requires a design process that integrates properties of the performance
data (as understood by the user) with the graphical aspects for good visual form.
This is not easy, if one wants effective outcomes. Second, unlike visualization of
physical phenomena, performance information does not have a natural semantic
visual basis. It could utilize a variety of graphical forms and visualization types
(e.g., statistical, informational, physical, abstract). Third, with increasing ap-
plication concurrency, performance visualization must deal with the problem of
scale. The use of interactive three-dimensional (3D) graphics clearly helps, but
the visualization design challenge is still present.
In addition to these challenges, there are also practical considerations. Because
of the richness of parallel performance information and the different relationships

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 156–165, 2012.

c Springer-Verlag Berlin Heidelberg 2012
An Approach to Creating Performance Visualizations 157

to the underlying application semantics, it is unreasonable to expect just a few


performance visualizations to satisfy all needs. Where visualization of large per-
formance information does exist, it is generally embedded as “canned” displays
in a profile or trace analysis tool. If a user has a different concept in mind, they
have very limited ability to make changes. There are ways around the dilemma.
One approach might be to use a visualization environment (e.g., VisIt [3] or
ParaView [2]) which provides robust support for building and producing visual-
izations. These environments have a targeted user group, and it is not parallel
performance analysts.
An alternate approach is to work within a performance analysis tool and build
capabilities for visualization that allow more user design input and interaction.
This could be accomplished simply by an interface to query performance data
from within the tool, and passing it over to a component implemented with a
visualization library that the user programs directly (e.g. VTK [4]). Unfortu-
nately, few users would have the expertise to take advantage of this. Instead,
we took a path to constrain how visual layout and user interface components
are specified. The tool then constructs a performance visualization from the
specifications.

2 Design Approach

The approach we followed for performance visualization design was motivated by


our experience with the 3D visualization in the ParaProf profile analysis tool [7]
provided by the TAU performance system [17]. ParaProf’s primary visualiza-
tion mode presents 2D charts of a metric for each event measured in profiled
application. Metrics can be execution time (inclusive, exclusive), event count, or
hardware counters. The 2D charts include bar plots, histograms, communication
“heat maps,” among others. In addition, ParaProf presently provides two types
of 3D performance views:

– Full profile: A landscape visualization is used to show the entire parallel


profile for all events and threads of execution for a single performance metric.
The view is drawn as a triangle mesh or bar plot with events on the X-axis,
threads on the Y-axis, and (event,thread) metric value on the Z-axis.
– Event correlation: A scatter plot visualization is used to show the re-
lationship between four events (each with its own performance metric) for
all threads of execution. The first three event/metric pairs set the spatial
coordinates for each thread with the fourth determining color.

For both visualizations, the UI allows the user to select how the performance
event/metric pair is displayed. Both visualizations are implemented in JOGL,
Java’s interface to OpenGL, and interactive rotation and zooming are provided.
Both of these 3D views were developed to target specific use cases. Without
any additional support, any new visualization would also be implemented that
way. Thus, when we wanted to develop a new visualization for a single event and
158 W. Spear et al.

metric that was based on a layout of threads according to topological informa-


tion, it became clear that a more general methodology was required; otherwise,
a separate implementation would be needed for each different layout.
To move towards a more general method of creating 3D visualizations, we
considered the salient components of the existing versions and the need to specify
aspects of a 3D design in a more flexible manner. Two ideas resulted: 1) separate
the visualization layout design from visualization user interface (UI) design, and
2) allow the properties of each to be specified by the user. The new design
methodology is shown in Figure 1.

Fig. 1. Visualization architecture

Visualization layout design is concerned with how the visualization will ap-
pear. Our approach allows the visual presentation to be specified with respect to
the parallel profile data model (events, metrics, metadata) and possible analysis
of this information. Two basic layout approaches we support are mapping to
cartesian coordinates provided by MPI and filling a space of user-defined dimen-
sionsions in order of MPI rank. We have also worked to develop a specification
language for describing more complex layouts of thread performance in a 3D
space. In our initial implementation of these custom layouts, mathematical for-
mulae define the coordinates and color value of each thread in the layout. The
formulae are based on variables provided by the profile data model. These input
variables include event and metric values for the current thread being processed
as well as global values such as the total number of threads in the profile. The
specification is applied successively to each thread in the profile to determine X,
Y and Z coordinate values and color values which are used to generate the vi-
sualization graphics. Our initial implementation for expression analysis uses the
MESP expression parser library[5]. MESP provides a simple syntax for express-
ing mathematical formulae but is powerful enough to allow visualization layouts
based on archetecturally relevant geometries or the mathematical relationship
of multiple performance variables.
Visualization UI design is concerned with how the visualization will be con-
trolled. The key insight here is to have the UI play a role in “binding” data
model variables used in the layout specification. This approach implements the
An Approach to Creating Performance Visualizations 159

functionality present in the current ParaProf views, where the user is free to
select events and metrics to be applied in the visualization as inputs to layout
formulae. However, for large performance profiles of many threads/processes,
the specified layout can result in a dense visualization that obscures internal
structures. The current ability to zoom and rotate the the topology in the UI
partially ameliorates this issue. Our model for visualization UI further allows
more sophisticated filtering techniques.

3 Examples
The performance visualization design approach is being developed in the Para-
Prof profile analysis tool. Here we demonstrate our current prototype for three
applications: Sweep3D, S3D and GCRM/ZGrd. Our initial focus is on topol-
ogy visualization. In addition, we illustrate the flexibility of these techniques by
recreating ParaProf’s event correlation view.

3.1 Sweep3D
For development and testing of our 3D visualization approach we used data from
the Sweep3D[6] particle transport code. The Sweep3D performance data set we
used was generated from a 16k core run on an IBM Blue Gene/L system and
contains Cartesian coordinates of each MPI rank from the MPI system [19].
The most obvious topology mapping scheme is to take the rank-to-coordinate
mapping and use it to lay out the points representing the ranks in a 3D space.
Figure 2 shows this performance view for the exclusive time in MPI Barrier.
The layout specification is defined with respect to MPI ranks while event and
metric variables are selected in the UI.

Fig. 2. Sweep 3d BG/L 16k-core map- Fig. 3. Sweep 3d BG/L 16k-core user-
ping as provided by MPI defined mapping

When information about the physical layout of a system is provided, it is


straightforward to write the layout expression and to interpret the results. How-
ever there are a number of cases where basic MPI provided coordinate mapping
will not suffice. Just as mapping schemes vary between MPI implementations,
160 W. Spear et al.

the means of accessing topology mapping data is also variable. The relevant
mapping information may not be available in performance data from which a
topological display is desired. Even when coordinate data is available there are
potential issues which may render it inappropriate for the performance analysis
task at hand. For example, the underlying, machine-level topology may not be
the topology of interest. Higher level topologies, relating to how work is allocated
may have different or no ready means of programatically associating ranks with
topological coordinates. Another issue that can arise is the need to incorporate
another dimension, such as thread (core) ID, in the display. In such situations it
is necessary to find other means of rank mapping. In general, greater flexibility
in how ranks are visualized allows for more complete analysis of an application
with respect to topology.
Figure 3 shows a user-specified visualization defining a topology with mean-
ingful spatial context for performance data based on a block-wise layout of MPI
ranks, in this case in two dimensions. This show that even a basic linear stacking
with respect to rank id can produce valuable interpretive effects.
More general topological renderings produced by mathematical expressions
can serve a number of purposes, including defining more complex hardware
topologies and other spatial representations of computational activity. To demon-
strate the power of the layout specification, Figure 4 illustrates a spherical visu-
alization of the same Sweep3D performance data.

Listing 1.1. Topology configuration file for


sphere transformation

BEGIN VIZ=s p h e r e
rootR ank s=s q r t ( maxRank )
t h e t a =2∗ p i ( ) /
rootR ank s ∗mod( rank , rootRanks )
p h i=p i ( ) /
rootR ank s ∗ ( f l o o r ( rank / rootR ank s ) )
x=c o s ( t h e t a ) ∗ s i n ( p h i ) ∗ 1 0 0
y=s i n ( t h e t a ) ∗ s i n ( p h i ) ∗ 1 0 0
Fig. 4. Sweep 3D 16k-core mapping z=c o s ( p h i ) ∗ 1 0 0
with spherical topology END VIZ

Listing 1.1 shows the expressions mapping ranks to points on the surface of
a sphere. The X, Y, and Z formulae are required, but additional helper func-
tions may be provided to simplify the expressions. Several variables, such as
maxRank and rank are provided internally. The topology formulae are defined in
a standard text file using MESP’s syntax. The file may be loaded and refreshed
from within ParaProf, allowing rapid development and adjustment of application
or purpose specific topologies, as well as easy sharing of topological definitions
between collaborators.
An Approach to Creating Performance Visualizations 161

Once a visual layout is defined, the UI can control selection and filtering.
A particularly effective setting defines a displayable range based on minimum
and maximum values. Thread points are not displayed if the value of the metric
event combination representing color exceeds the maximum or is less than the
minimum. If the minimum is set above the maximum, only threads with values
above the minimum and below the maximum are shown, meaning only high and
low values are displayed. This is very important for identifying the topological
patterns of performance outliers, and is shown for Sweep3D in Figure 5.

Fig. 5. Sweep 3d Topology with mid- Fig. 6. Sweep 3d topology slice along
dle values excised x axis

The UI also allows exclusion by locality. For example, if a value along the X
axis is selected, only points appearing along that “slice” will be visible. Excluding
by one, two or three axes results in the visualization of a plane, line or point
respectively. The average value of the metric for the ranks in a selected area will
be displayed in each case, with three selected axes displaying the actual value
for the single selected rank. This is demonstrated in Figure 6 for Sweep3D.

3.2 S3D Use Case


The S3D application [9] is a massively parallel turbulent combustion simulator
developed by Sandia National Laboratories. The core DNS (Direct Numerical
Simulation) solver code of S3D is parallelized using three dimensional domain
decomposition over a Cartesian mesh. We examined data collected by TAU from
several scaling runs of S3D on the Intrepid IBM Blue Gene P system. Previous
performance analysis of S3D [16] has suggested topology dependent performance
behavior centered around MPI communication. The communication-topology
dependent nature of the code made it a strong candidate for re-analysis with
topological performance visualization.
The Cartesian topology coordinates provided by MPI for the S3D run on in-
trepid cover the three spatial dimensions of nodes within the system’s layout.
However, on BG/P, each node contains two processors with two cores per pro-
cessor. The default behavior of the topology view, given pre-defined coordinates
with a 4th thread or core-level dimension, is to take the average of the thread or
core values on each node and display that value in the 3D node level topology.
162 W. Spear et al.

Listing 1.2. Topology configuration file for


processor blocks

BEGIN VIZ=4Px16Block
xdim=8 ,ydim=8 , zdim=16
x=mod( rank , xdim)+16∗ f l o o r ( ran k / 1 0 2 4 )
Fig. 7. Time in MPI Allreduce for y=mod( f l o o r ( rank /xdim ) , ydim )
S3D 4K-core run on BGP with core- z=mod( f l o o r ( rank /xdim/ ydim ) , zdim )
based topological layout END VIZ

We wanted to break down the visualization further to display core level activ-
ity. We elected to use the mathematical expression topology definition system to
display each of the four cores as its own point in a distinct node level topology.
The result is shown in figure 7. Each of the four blocks represents the activity
one of the four core ids laid out in the specified node level topology.
Discussing the internal topologies of each block is outside the scope of this
paper. However an interesting high level phenomenon is immediately visible. In
each node, overall, the cores are operating in pairs of high and low utilization.
That is, for each chip, one core is spending significantly more time in the routine
under observation (MPI Allreduce) than the other. This core wise breakdown is
likely related to the way individual cores are assigned to handle communication.
The formulae used to distinguish the topology by core is defined in listing 1.2.
Note that this topology definition is only applicable to topologies in which the
thread rank is the last to repeat.
There are numerous alternative thread-conscious, four dimensional layouts.
The ideal layout will vary by the application selected and the topological be-
havior being observed. For example, we have also had success with grouping the
threads or cores that comprise a single node and arranging each of these groups
in the context of the greater node-level system topology.
By opening the node and thread layout to formulaic definition we have ex-
panded the scope of topological performance visualization from machine dictated
layouts to arbitrary node configurations. This is especially useful for mapping
ranks to program domain decompositions which may have no direct relationship
with the hardware topology.

3.3 GCRM/ZGrd Use Case: 3D Correlation Plot


The Global Cloud Resolving Model (GCRM) [1] is an atmospheric circulation
simulator. We selected a 10240 core run of its ZGrd sub-application to demon-
strate our visualization system’s flexibility by duplicating ParaProf’s existing
An Approach to Creating Performance Visualizations 163

3D scatter plot functionality, which shows the correlation of four event/metric


combinations specified in the UI. Three value are shown spatially and one using
color.
The spatial dimensions and color of each rank are set by distinct event/metric
pairs. For example in Figure 8 the values used to calculate the Y axis position
of each point are determined by the exclusive time spent in MPI Allreduce. As
shown in Listing 1.3 the three spatial dimensions are associated with internally
defined values. The event/metric pairs used to populate these values are specified
by the user in the ParaProf UI. The restrictDim value set to 1 causes the 3D
visualization to normalize dimensions, ensuring a cubic rendering space even if
the different event/metric pairs use values with different orders of magnitude.

Listing 1.3. Topology configuration


file for scatterplot

BEGIN VIZ=S c a t t e r T e s t
r e s t r i c t D i m =1
x=( e v e n t 0 . v a l −e v e n t 0 . min ) /
( e v e n t 0 . max−e v e n t 0 . min )
y=( e v e n t 1 . v a l −e v e n t 1 . min ) /
( e v e n t 1 . max−e v e n t 1 . min )
z=( e v e n t 2 . v a l −e v e n t 2 . min ) /
Fig. 8. 3D Correlation plot of 10240 ( e v e n t 2 . max−e v e n t 2 . min )
core ZGrd run END VIZ

4 Related Work

The interest in parallel performance visualization truly began as parallel systems


began to scale. The classic ParaGraph [13] used network topology to display per-
formance of message passing programming. The Prism [18] programming envi-
ronment for the CM-5 demonstrated the benefits of topological layout for display
of debugging, data, and performance information for data parallel programs.
Previously studied concepts and methods for parallel performance visualization
[10,14,15] emphasize the importance of using various forms of semantics in effec-
tive presentation. Three-dimensional graphics techniques have also proven use-
ful for scalable performance visualization [12,11], again with structural elements
providing a necessary context for interpretation.
As the scale of parallel systems grew, the challenges of performance data size,
analysis complexity, and presentation intensified. The Cube application provided
by the Scalasca project is a performance analysis tool the features of which
include visualization of performance data in a Cartesian topology layout. Cube
populates this view based on the coordinates provided by Scalasca, collected
from the MPI system at runtime.
164 W. Spear et al.

The aforementioned tools, where topology based visualization is provided at


all, generally support topological displays mediated by communication patterns
or literal hardware layouts. In contrast ParaProf’s 3d topology visualization sys-
tem introduces a number of novel features to increase the scope of topological
layouts which can be applied to performance data inputs, including site and
application specific topologies. Unlike proprietary or architecture-specific tools
the ParaProf 3d topology visualizer provides a consistent, portable interface for
topological analysis regardless of platforms and programming models that pro-
duced the data. Additionally, the scope of the performance data made available
by the TAU framework allows for topological analysis with respect to a wide
array of metrics and program decompositions.
Significant work has been done by the Charm++ project[8] among others
on topology mapping algorithms. These solutions generally focus on selection
of ideal layouts for a given application and are complimented by diagnostic
performance analysis with respect to the topological configuration.
There are some similarities in graphical presentation between our topological
performance display and systems for configuring and monitoring HPC clusters.
These management tools are often focused on a specific subset of hardware and
generally are more concerned with static evaluation of system topologies than
post-mortem analysis. Such applications include XT3D[20] which presents 3d
visualizations of system topologies on Cray clusters.

5 Conclusion
Parallel performance visualization can be a useful technique for better under-
standing performance phenomena. However, it is important to integrate the
capabilities within a performance analysis framework. This paper describes a
performance visualization design methodology and its incorporation in the TAU
ParaProf tool. Its initial implementation concentrates on topology-oriented lay-
out and examples are given for the Sweep3D and S3D applications.
However, the methods we present for visual layout and UI design are more
broadly applicable. To demonstrate their versatility, we have recently recreated
ParaProf’s event correlation view. In general, our goal is to allow the user the
full benefit of incorporating their concepts of visual presentation and semantics
to improve performance understanding.

Acknowledgements. This research is support by the U.S. Department of En-


ergy, Office of Science, under contract DE-SC0001777. Resources of the Argonne
Leadership Computing Facility at Argonne National Laboratory were utilized in
the work.

References
1. Global cloud resolving model (gcrm), https://svn.pnl.gov/gcrm
2. Paraview, http://www.paraview.org/
An Approach to Creating Performance Visualizations 165

3. Visit, https://wci.llnl.gov/codes/visit/
4. Visualization toolkit (vtk), http://expression-tree.sourceforge.net/
5. Math expression string parser (mesp) (2004),
http://expression-tree.sourceforge.net/
6. The ascii sweep3d code (October 2006),
http://www.llnl.gov/ascibenchmarks/asci/
limited/Sweep3D/asciSweep3D.html
7. Bell, R., Malony, A.D., Shende, S.: A portable, extensible, and scalable tool for
parallel performance profile analysis. In: Proc. EUROPAR 2003 Conference, pp.
17–26 (2003)
8. Bhatele, A., Kale, L.V., Chen, N., Johnson, R.E.: A Pattern Language for Topology
Aware Mapping. In: Workshop on Parallel Programming Patterns, ParaPLOP 2009
(June 2009)
9. Chen, J., et al.: Terascale direct numerical simulations of turbulent combustion
using S3D. Computational Science and Discovery 2(1), 15001 (2009)
10. Couch, A.: Categories and Context in Scalable Execution Visualization. Journal of
Parallel and Distributed Computing 18(2), 195–204 (1993)
11. De Rose, L., Pantano, M., Aydt, R., Shaffer, E., Schaeffer, B., Whitmore, S., Reed,
D.: An approach to immersive performance visualization of parallel and wide-area
distributed applications. In: Proceedings of the Eighth International Symposium
on High Performance Distributed Computing, 1999, pp. 247–254 (1999)
12. Hackstadt, S., Malony, A., Mohr, B.: Scalable Performance Visualization of Data-
Parallel Programs. In: Scalable High-Performance Computing Conference, pp. 342–
349 (May 1994)
13. Heath, M., Etheridge, J.: Visualizing the Performance of Parallel Programs. IEEE
Software 8(5), 29–39 (1991)
14. Heath, M., Malony, A., Rover, D.: Parallel Performance Visualization: From Prac-
tice to Theory. IEEE Parallel and Distributed Technology: Systems and Technol-
ogy 3(4), 44–60 (1995)
15. Heath, M., Malony, A., Rover, D.: The Visual Display of Parallel Performance
Data. Computer 28(4), 21–28 (1995)
16. Jagode, H., Dongarra, J., Alam, S., Vetter, J., Spear, W., Malony, A.D.: A Holistic
Approach for Performance Measurement and Analysis for Petascale Applications.
In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot,
P.M.A. (eds.) ICCS 2009. LNCS, vol. 5545, pp. 686–695. Springer, Heidelberg
(2009), http://dx.doi.org/10.1007/978-3-642-01973-9_77
17. Shende, S., Malony, A.D.: The TAU Parallel Performance System. SAGE Publica-
tions (2006)
18. Sistare, S., Allen, D., Bowker, R., Jourdenais, K., Simons, J., Title, R.: A scalable
debugger for massively parallel message-passing programs. IEEE Parallel and Dis-
tributed Technology: Systems and Applications Distributed Technology: Systems
and Applications 2(2), 50–56 (1994)
19. Traff, J.: Implementing the mpi process topology mechanism. In: SC Conference,
p. 28 (2002)
20. Yanovich, J., Budden, R., Simmel, D.: Xt3dmon 3d visual system monitor for psc’s
cray xt3 (2006), http://www.psc.edu/~ yanovich/xt3dmon
INAM - A Scalable InfiniBand Network Analysis
and Monitoring Tool 

N. Dandapanthula1, H. Subramoni1, J. Vienne1 , K. Kandalla1 , S. Sur1 ,


Dhabaleswar K. Panda1 , and Ron Brightwell2
1
Department of Computer Science and Engineering,
The Ohio State University
{dandapan,subramon,viennej,kandalla,
surs,panda}@cse.ohio-state.edu
2
Sandia National Laboratories
rbbrigh@sandia.gov

Abstract. As InfiniBand (IB) clusters grow in size and scale, predicting the be-
havior of the IB network in terms of link usage and performance becomes an
increasingly challenging task. There currently exists no open source tool that al-
lows users to dynamically analyze and visualize the communication pattern and
link usage in the IB network. In this context, we design and develop a scalable
InfiniBand Network Analysis and Monitoring tool - INAM. INAM monitors IB
clusters in real time and queries the various subnet management entities in the IB
network to gather the various performance counters specified by the IB standard.
We provide an easy to use web-based interface to visualize performance counters
and subnet management attributes of a cluster in an on-demand basis. It is also
capable of capturing the communication characteristics of a subset of links in the
network. Our experimental results show that INAM is able to accurately visualize
the link utilization as well as the communication pattern of target applications.

1 Introduction
Across various enterprise and scientific domains, users are constantly looking to push
the envelope of achievable performance. The need to achieve high resolution results
with smaller turn around times has been driving the evolution of enterprise and super-
computing systems over the last decade. Interconnection networks have also rapidly
evolved to offer low latencies and high bandwidths to meet the communication require-
ments of distributed computing applications. InfiniBand has emerged as a popular high
performance network interconnect and is being increasingly used to deploy some of
the top supercomputing installations around the world. According to the Top500 [13]

This research is supported in part by Sandia Laboratories grant #1024384, U.S. Department
of Energy grants #DE-FC02-06ER25749, #DE-FC02-06ER25755 and contract #DE-AC02-
06CH11357; National Science Foundation grants #CCF-0621484, #CCF-0702675, #CCF-
0833169, #CCF-0916302 and #OCI-0926691; grant from Wright Center for Innovation
#WCI04-010-OSU-0; grants from Intel, Mellanox, Cisco, QLogic, and Sun Microsystems;
Equipment donations from Intel, Mellanox, AMD, Obsidian, Advanced Clustering, Appro,
QLogic, and Sun Microsystems.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 166–177, 2012.

c Springer-Verlag Berlin Heidelberg 2012
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 167

ratings of supercomputers done in June’11, 41.20% of the top 500 most powerful super-
computers in the world are based on the InfiniBand interconnects. Recently, InfiniBand
has also started to make in-roads into the world of enterprise computing.
Different factors can affect the performance of applications utilizing IB clusters. One
of these factors is the routing of packets or messages. Due to static routing, it is impor-
tant to ensure that the routing table is correctly programmed. Hoefler et al. showed,
in [4], the possible degradation in performance if multiple messages traverse the same
link at the same time. Unfortunately, there do not exist any open-source tools that can
provide information such as the communication matrix of a given target application or
the link usage in the various links in the network, in a user friendly way.
Most of the contemporary network monitoring tools for IB clusters have an overhead
attached to them which is caused by the execution of their respective daemons which
needs to run every monitored device on the subnet. The purpose of these daemons is
to gather relevant data from their respective hosts and transmit it to a central daemon
manager which renders this information to the user. Furthermore, the task of profiling
an application at the IB level is difficult considering the issue that most of the network
monitoring tools are not highly responsive to the events occurring on the network. For
example, to reduce the overhead caused by constant gathering of information at the
node by the daemons, a common solution is to gather the information at some time
intervals which could be anywhere between 30 seconds to 5 minutes. This is called the
sampling frequency. Thus, the higher the sampling frequency, the higher the overhead
created by the daemons. This causes a tradeoff with the responsiveness of the network
monitoring tool. This method has an additional disadvantage in that, it does not allow
us to monitor network devices such as switches and routers where we will not be able
to launch user specified daemon processes.
As IB clusters grow in size and scale, it becomes critical to understand the behav-
ior of the InfiniBand network fabric at scale. While the Ethernet ecosystem has a wide
variety of matured tools to monitor, analyze and visualize various elements of the Eth-
ernet network, the InfiniBand network management tools are still in their infancy. To
the best of our knowledge, none of the available open source IB network management
tools allow users to visualize and analyze the communication pattern and link usage in
an IB network. These lead us to the following broad challenge - Can a low overhead
network monitoring tool be designed for IB clusters that is capable of depicting the
communication matrix of target applications and the link usage of various links in the
InfiniBand network?
In this paper we address this challenge by designing a scalable InfiniBand Network
Analysis and Monitoring tool - INAM. INAM monitors IB clusters in real time and
queries the various subnet management entities in the IB network to gather the vari-
ous performance counters specified by the IB standard. We provide an easy to use web
interface to visualize the performance counters and subnet management attributes of
the entire cluster or a subset of it on the fly. It is also capable of capturing the com-
munication characteristics of a subset of links in the network, thereby allowing users
to visualize and analyze the network communication characteristics of a job in a high
performance computing environment. Our experimental results show that INAM is able
168 N. Dandapanthula et al.

to accurately visualize the link usage within a network as well as the communication
pattern of target applications.
The remainder of this paper is organized as follows. Section 2 gives a brief overview
of InfiniBand and the InfiniBand subnet management infrastructure. In Section 3 we
present the framework and design of INAM. We evaluate and analyze the correctness
and performance of INAM in various scenarios in Section 4, describe the currently
available related tools in Section 5, and summarize the conclusions and possible future
work in Section 6.

2 Background
2.1 InfiniBand
InfiniBand is a very popular switched interconnect standard being used by almost 41%
of the Top500 Supercomputing systems [13]. InfiniBand Architecture (IBA) [5] defines
a switched network fabric for interconnecting processing nodes and I/O nodes, using
a queue-based model. InfiniBand standard does not define a specific network topol-
ogy or routing algorithm and provides the users with an option to choose as per their
requirements.
IB also proposes link layer Virtual Lanes (VL) that allows the physical link to be split
into several virtual links, each with their specific buffers and flow control mechanisms.
This possibility allows the creation of virtual networks over the physical topology. How-
ever, current generation InfiniBand interfaces do not offer performance counters for
different virtual lanes.

2.2 OFED
OFED, short for OpenFabrics Enterprise Distribution, is an open source software for
RDMA and kernel bypass applications. It is needed by the HPC community for appli-
cations which need low latency and high efficiency and fast I/O. A detailed overview of
OFED can be found in [11]. OFED provides performance monitoring utilities which
present the port counters and subnet management attributes for all the device ports
within the subnet. Some of the attributes which can be obtained from these utilities are
shown in Table 1.

Table 1. Sample of attributes provided by utilities inside OFED

Utility Attribute Description


perfquery XmtData The number of 32 bit data words sent out through that port since last reset
perfquery RcvData The number of 32 bit data words received through that port since last reset
perfquery XmtWait The number of units of time a packet waits to be transmitted from a port
smpquery LinkActiveSpeed The current speed of a link
smpquery NeighborMTU Active maximum transmission unit enabled on this port for transmit

OFED includes an InfiniBand subnet manager called OpenSM which configures an


InfiniBand subnet. It comprises of the subnet manager (SM) which scans the fabric,
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 169

initiates the subnet and then monitors it. The subnet management agents (SMA) are
deployed on every device port of the subnet to monitor their respective hosts. All man-
agement traffic including the communication between the SMAs and the SM is done
using subnet management packets (SMP). IBA allocates the Virtual lane (VL) 15 for
subnet management traffic. The general purpose traffic can use any of the other virtual
lanes from 1 to 14 but the traffic on VL 15 is independent of the general purpose traffic.

3 Design and Implementation of INAM

We describe the design and implementation details of our InfiniBand Network Analysis
and Monitoring tool (INAM) in this section. For modularity and ease of portability, we
separate the functionality of INAM into two distinct modules - the InfiniBand Network
Querying Service (INQS) and the Web-based Visualization Interface (WVI). INQS acts
as a network data acquisition service. It retrieves the requested information regarding
ports on all the devices of the subnet to obtain the performance counters and subnet
management attributes. This information is then stored in a database using MySQL
methods [9]. The WVI module then communicates with the database to obtain the data
pertaining to any user requested port(s) in an on-demand basis. The WVI is designed as
a standard web application which can be accessed using any contemporary web browser.
The two modes of operation of the WVI include the live observation of the individual
port counters of a particular device and the long term storage of all the port counters
of a subnet. This information can be queried by the user in the future. INQS can be
ported to any platform, independent of the cluster size and the Linux distribution being
used. INAM is initiated by the administrator and there exists a connection thread pool
through which individual users are served. As soon as a user exits the application, the
connection is returned to the pool. If all the connections are taken up, then the user has
to wait. Currently the size of this connection pool is 50 and can be increased.
As we saw in Section 1, a major challenge for contemporary IB network monitoring
tools is the necessity to deploy daemon processes on every monitored device on the
subnet. The overhead in terms of CPU utilization and network bandwidth caused by
these daemons often cause considerable perturbations in the performance of real user
applications that use these clusters. INAM overcomes this by utilizing the Subnet Man-
agement Agents (SMA) which are required to be present on each IB enabled device
on the subnet. The primary role of an SMA is to monitor and regulate all IB network
related activity on their respective host nodes. The INQS queries these SMAs to ob-
tain the performance counters and subnet management attributes of the IB device(s)
on a particular host. The INQS uses Management Datagram (MAD) packets to query
the SMAs. As MAD packets use a separate Virtual Lane (VL 15), they will not com-
pete with application traffic for network bandwidth. Thus, compared to contemporary
InfiniBand network management tools, INAM is more responsive and and causes less
overhead.
INAM is also capable of monitoring and visualizing the utilization of a link within
a subnet. To obtain the link utilization, the XmtWait attribute alone or XmtData / Rcv-
Data and LinkActiveSpeed attributes in combination are used. The XmtWait attribute
corresponds to the period of time a packet was waiting to be sent, but could not be sent
170 N. Dandapanthula et al.

due to lack of network resources. In short it is an indication of how congested a link


is. The LinkActiveSpeed attribute indicates the speed of the link. This can be used in
combination with the change in XmtData or RcvData attribute to see whether the link is
being over utilized or not. In either case, we update a variable called the link utilization
factor to depict the amount of traffic in the link. There is also an option to use just the
INQS as a stand alone system to save the device port information and the link usage
information over a period of time (time can be varied depending on the memory avail-
able) to analyze the traffic patterns over an InfiniBand subnet. INQS initially creates a
dynamic MySQL database of all the available LID-Port combinations, along with the
physical links interconnecting these ports. The LID-Port combination signifies all com-
binations of the device LIDs in the subnet and their respective ports. This information
is updated periodically and thus adapts to any changes in the network topology. The
frequency at which the data is collected from the subnet and the frequency at which the
data is displayed on the WVI can both be modified as per the requirement of the user.
The overhead here would be associated with the WVI module. The display frequency
can be reduced to 1 second and this would serve the users for all practical purposes. If
this display frequency is less then 1 second, then we see a drop in the responsiveness of
the dynamic graphs generated.
The WVI interacts with the database and displays the information requested by the
user in the form of a graphical chart. Dynamic graphs are generated by using High-
Charts Js [2]. This model currently does not support dynamic multiple data series dis-
played in the same diagram with a common y-axis. Thus the comparision is done using
multiple graphs as shown in Figure 1. We use a push model instead of a pull model
to update the data in the WVI. The connection between the MySQL database and the
WVI is kept open and hosting server pushes data to the browser as soon as the database
is updated by INQS. This technique removes the overhead on the web server caused
by the browser constantly polling the database for new data. This is implemented using
a methodology called Comet [3]. This makes the web server stable and provides high
availability even when deployed on large InfiniBand clusters with heavy data flow. The
rest of the functionalities of the web server are implemented using Java 2 Platform En-
terprise Edition (J2EE) [12]. The communication pattern of an MPI job is created by
WVI by querying the database and then by using the canvas element of HTML5 [15] to
chart out the physical topology and connections between the ports on a subnet.

3.1 Features of INAM


INAM can monitor an InfiniBand cluster in real time by using the functionalities pro-
vided by Open Fabrics Enterprise Distribution (OFED) stack. It can also monitor the
link utilization on the fly and provide a post mortem analysis of the communication
pattern of any job running on the IB cluster.
The user can select the device he wants to monitor through a dynamically updated list
of all the currently active devices on the subnet. An option to provide a list of all the port
counters which need to be compared in real time, is given to the user. Only the counters
of one particular port can be monitored at a time. INAM shows the first derivative of
the counter values. A detailed overview of all the subnet management attributes of a
particular port in a subnet can also be obtained. The attributes are divided into four
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 171

main categories which are Link Attributes, Virtual Lane Attributes, MTU Attributes and
Errors and Violations. INAM also provides dynamic updates regarding the status of the
master Subnet Manager (SM) instance to the user. If there is a change in the priority
of SM or if the Master SM instance fails or if a new slave SM takes over as a Master
SM instance, the status is updated and the user is notified. This can help to understand
the fail-over properties of OpenSM. Further more, a user can ask INAM to monitor the
network for the time period of an MPI job and then it helps the user understand the
communication pattern of that job using a color coded link utilization diagram.

4 Experimental Results
4.1 Experimental Setup
The experimental setup is a cluster of 71 nodes (8 cores per node with a total of 568
cores) which are all dual Intel Xeons E5345 connected to an InfiniBand Switch which
has an internal topology of a Fat Tree. We use a part of this cluster to show the function-
ality of INAM. This set up comprises of 6 leaf switches and 6 spine switches with 24
ports each and a total of 35 leaf nodes equipped with ConnectX cards. The functioning
of INAM is presented using a series of benchmarks in varied scenarios. The first set
of results are obtained using a bandwidth sharing benchmark to create traffic patterns
which are verified by visualizing the link usage using INAM. The second set of bench-
marks shows similar network communication patterns with MPI Bcast configured for
diverse scenarios. The third set of experiments verifies the usage of INAM using the
LU benchmark from the SpecMPI suite.

4.2 Visualizing Port Counters


The user can select the device they want to monitor through a dynamically updated list
of all the currently active devices on the subnet. The user can also provide a list of all
the port counters they want to compare in real time. Figure 1 depicts how INAM allows
users to visually compare multiple attributes of a single port. In this example, we show
how two attributes - transmitted packets and received packets, of user selected port can
be compared.

4.3 Point to Point Visualization: Synthetic Communication Pattern


We create custom communication patterns using the bandwidth sharing benchmark
mentioned in [14] to verify the functioning of INAM. The benchmark in question en-
ables us to mention the number of processes transmitting messages and the number of
processes receiving messages at leaf switch level and thus creating a blocking point to
point communication pattern. We created various test patterns, each incrementally more
communication intensive then the previous pattern, to help us notice a difference in the
pattern using INAM. Two of those patterns are mentioned in detail in the consequent
sections.
172 N. Dandapanthula et al.

Fig. 1. Monitoring the XmtData and RcvData of a port

Fig. 2. INAM depiction of network traffic pat- Fig. 3. INAM depiction of network traffic pat-
tern for 16 processes tern for 64 processes

Test Pattern 1. The first test pattern is visualized in Figure 2. The process arrangement
in this pattern is such that 8 processes, one per each of the 8 leaf nodes connected to
leaf switch 84, communicate with one process, on each of the four leaf nodes connected
to the each of the two switches 78 and 66. The thick green line indicates that multiple
processes are using that link. In this case, it can be observed that the thick green line
originating from switch 84 splits into 2 at switch 110. The normal green links symbolize
that the links are not being over utilized, for this specific case.
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 173

Test Pattern 2. Figure 3 presents the network communication for test pattern 2. The
process arrangement in this pattern is such that 32 processes, four per each of the 8
leaf nodes connected to leaf switch 84, communicate with two processes, on each of
the eight leaf nodes connected to the each of the two switches 78 and 66. 32 processes
send out messages from switch 84 and 16 processes on each of the switches 78 and 66
receive these messages. This increase in the number of processes per leaf node explains
the exorbitant increase in the number of links being overly utilized. Figure 3 also shows
that all of the inter switch links are marked in thick lines, thus showing that each link is
being used by more then one process. The links depicted in red indicate that the link is
over utilized. Since each leaf node on switch 84 has four processes and each leaf node
on the other switches have two processes, the links connecting the leaf nodes to the
switch are depicted as thick red lines.

4.4 Link Utilization of Collective Operations: Case Study with MPI Bcast
Operation
In this set of experiments, we evaluate the visualization of the One-to-All broadcast
algorithms typically used in MPI libraries, using INAM. MVAPICH2 [8] uses the tree-
based algorithms for small and medium sized messages, and the scatter-allgather al-
gorithm for larger messages. The tree-based algorithms are designed to achieve lower
latency by minimizing the number of communication steps. However, due to the costs
associated with the intermediate copy operations, the tree-based algorithms are not suit-
able for larger messages and the scatter-allgather algorithm is used for such cases. The
scatter-allgather algorithm comprises of two steps. In the first step, the root of the broad-
cast operation divides the data buffer and scatters it across all the processes using the bi-
nomial algorithm. In the next step, all the processes participate in an allgather operation
which can either be implemented using the recursive doubling or the ring algorithms.
We designed a simple benchmark to study the link utilization pattern of the
MPI Bcast operation with different message lengths. For brevity, we compare the link
utilization pattern with the binomial algorithm with 16KB message length and we study
the scatter-allgather (ring) algorithm with a data buffer of size 1MB. We used six pro-
cesses for these experiments, such that we have one process on each of the leaf switches,
as shown in Figure 4. In our controlled experiments, we assign the process on switch
84 to be the root (rank 0) of the MPI Bcast operation, switch 126 be rank 1 and so on
until the process on switch 66 is rank 5. Figure 4 shows a binomial traffic pattern for
a broadcast communication on 6 processes using a 16KB message size. The binomial
communication pattern with 6 processes is as follows:
– Step1: Rank0 → Rank3
– Step2: Rank0 → Rank1 and Rank3 → Rank4
– Step3: Rank1 → Rank2 and Rank4 → Rank5
In Figure 4, a darker color is used to represent a link that has been used more than
once during the broadcast operation. We can see that processes with ranks 0 through 4,
the link connecting the compute nodes to their immediate leaf-level switches are used
more than once, because these processes participate in more than one send/recv oper-
ation. However, process P5 receives only one message and INAM demonstrates this
174 N. Dandapanthula et al.

by choosing a lighter shade. We can also understand the routing algorithm used be-
tween the leaf and the spine switches by observing the link utilization pattern generated
by INAM. We also observe that the process with rank4, uses the same link between
switches 90 and 110 for both its send and receive operations. Such a routing scheme is
probably more prone to contention, particularly at scale when multiple data streams are
competing for the same network link.

Fig. 4. Link utilization of binomial algorithm Fig. 5. Link utilization of scatter-allgather


algorithm

Figure 5 presents the link utilization pattern for the scatter-allgather (ring) algorithm
with 6 processes. We can see that the effective link utilization for this algorithm is con-
siderably higher when compared to the binomial exchange. This is because the scatter-
allgather (ring) algorithm involves a higher number of communication steps than the
binomial exchange algorithm. With 6 processes, the ring algorithm comprises of 6 com-
munication steps. In each step, process P i communicates with its immediate logical
neighbors processes P (i − 1) and P (i + 1). This implies that each link between the
neighboring processes are utilized exactly 6 times during the allgather phase.

4.5 Application Visualization: SpecMPI - LU

In this experiment, we ran the LU benchmark (137.lu - medium size - mref) from the
SpecMPI suite [10] on a system size of 128 processes using 16 leaf nodes with 8 nodes
on each of the two leaf switches. The prominent communication used by LU comprises
of MPI Send and MPI Recv. The communication pattern is such that each process com-
municates with its nearest neighbors in either directions (p2 communicates with p1 and
p3). In the next step, p0 communicates with p15, p1 communicates with p16 and so
on. This pattern is visualized by INAM and is shown in Figure 6. It can be seen that a
majority of the communication is occurring on an intra-switch level.
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 175

Fig. 6. INAM depicting the communication pattern using LU benchmark

4.6 Overhead of Running INAM

Since we use the subnet management agent (SMA), which acts like daemons monitoring
all the devices of a subnet, we do not need to use any additional daemons installed on
every device to obtain this data. This is a major advantage as it avoids the overhead in
the contemporary approach caused by the daemons which are installed on every device.
The user just needs to have the service opensmd started on the subnet. Since the queries
used communicate through Virtual Lane 15 for the purpose of data acquisition, there is
no interference with the generic cluster traffic. For the verification of this, we compared
the performance of an IMB alltoall benchmark while toggling the data collection service
on and off by using messages of size 16 KB and 512 KB for a system size varying from
16 cores to 512 cores. The results obtained are shown in Figure 7 which shows that the
overhead is very minimal even though the service is on and there is not much difference
even though the message size is increased.

System Size 16 cores 32 cores 64 cores 128 cores 256 cores 512 cores
Message size 16 KB 0.13% 0.11% 0.15% 0.09% 0.07% 0.14%
Message size 512 KB 0.19% 0.21% 0.16% 0.08% 0.21% 0.15%

Fig. 7. Overhead caused by running INAM

5 Related Tools

There is a plethora of free or commercial network monitoring tools that provide dif-
ferent kinds of information to the system administrators or the users. But only a few of
them provide specific information related to IB network. We focus here on three popular
network monitoring tools: Ganglia [6], Nagios[1] and FabricIT [7].
Ganglia is a widely used open-source scalable distributed monitoring system for
high-performance computing systems developed by the University of California inside
the Berkeley Millennium Project. One of the best features of Ganglia is to offer an
176 N. Dandapanthula et al.

overview of certain characteristics within all the nodes of a cluster, like memory, CPU,
disk and network utilization. At the IB level, Ganglia can provide information through
perfquery and smpquery. Nevertheless, Ganglia can’t show any information related to
the network topology or link usage. Furthermore, to get all the data, Ganglia needs to
run a daemon, called gmond, on each node, adding an additional overhead.
Nagios is another common open-source network monitoring tool. Nagios offers al-
most the same information as Ganglia through a plug-in called ”InfiniBand Performance
Counters Check”. But, as Ganglia, Nagios can’t provide any information related to the
topology.
FabricIT is a proprietary network monitoring tool developed by Mellanox. Like
INAM, FabricIT is able to provide more information than Ganglia or Nagios, but the
free version of the tool does not give a graphical representation of the link usage or the
congestion.
INAM is different from the other existing tools by the richness of the given informa-
tion and also its unique link usage information, giving all the required elements to users
to understand the performance of applications at the IB level.

6 Conclusions and Future Work


In this paper, we have presented INAM - a scalable network monitoring and visualiza-
tion tool for InfiniBand networks which renders a global view of the subnet through a
web-based interface (WVI) to the user. INAM depends on many services provided by
the open-source OFED stack to retrieve necessary information from the IB network.
INAM also has an online data collection module (INQS) which runs in the background
while a job is in progress. After the completion of the job, INAM presents the commu-
nication pattern of the job in a graphical format. The overhead caused by this tool is
very minimal and it does not require the user to launch any special processes on the tar-
get nodes. Instead, it queries on the IB devices directly through the network and gathers
data.
In future, we would like to extend this work to do an online analysis of the traffic pat-
terns on a cluster. If next generation InfiniBand devices offer performance counters for
each virtual lane, we could leverage it to study link utilization and network contention
patterns in a more scalable fashion. Another dimension would be to create a time line
graphical pattern to depict the exact amount of data being communicated in the subnet
during a particular interval. We would also like to extend the functionality of INAM
such that the user can monitor and compare various counters from different ports. We
would also like to show if the links are used multiple times simultaneously when the
communication matrix is generated.

References
1. Barth, W.: Nagios. System and Network Monitoring. No Starch Press, U.S. Ed edn. (2006)
2. Charts, H.: HighCharts JS - Interactive JavaScript Charting,
http://www.highcharts.com/
3. DWR: DWR - Direct Web Remoting, http://directwebremoting.org/dwr/
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool 177

4. Hoefler, T., Schneider, T., Lumsdaine, A.: Multistage Switches are not Crossbars: Effects of
Static Routing in High-Performance Networks. In: Proceedings of the 2008 IEEE Cluster
Conference (September 2008)
5. InfiniBand Trade Association, http://www.infinibandta.org/
6. Massie, M.L., Chun, B.N., Culler, D.E.: The Ganglia Distributed Monitoring System: De-
sign, Implementation, and Experience. Parallel Computing 30(7) (July 2004)
7. Mellanox: Fabric-it,
http://www.mellanox.com/pdf/prod ib switch systems/
pb FabricIT EFM.pdf
8. MVAPICH2, http://mvapich.cse.ohio-state.edu/
9. MySQL: MySQL, http://www.mysql.com/
10. Müller, M.S., van Waveren, G.M., Lieberman, R., Whitney, B., Saito, H., Kumaran, K.,
Baron, J., Brantley, W.C., Parrott, C., Elken, T., Feng, H., Ponder, C.: Spec mpi2007 - an
application benchmark suite for parallel systems using mpi. Concurrency and Computation:
Practice and Experience, 191–205 (2010)
11. Open Fabrics Alliance,
http://www.openfabrics.org/
12. SUN: Java 2 platform, enterprise edition (j2ee) overview,
http://java.sun.com/j2ee
13. Top500: Top500 Supercomputing systems (November 2010),
http://www.top500.org
14. Vienne, J., Martinasso, M., Vincent, J.M., Méhaut, J.F.: Predictive models for bandwidth
sharing in high performance clusters. In: Proceedings of the 2008 IEEE Cluster Conference
(September 2008)
15. W3C: HTML5 - Canvas Element,
https://developer.mozilla.org/en/HTML/Canvas
Auto-tuning for Energy Usage in Scientific
Applications

Ananta Tiwari, Michael A. Laurenzano, Laura Carrington, and Allan Snavely

Performance Modeling and Characterization Laboratory,


San Diego Supercomputer Center
{tiwari,michaell,lcarring,allans}@sdsc.edu

Abstract. The power wall has become a dominant impeding factor in


the realm of exascale system design. It is therefore important to under-
stand how to most effectively create software to minimize its power us-
age while maintaining satisfactory levels of performance. This work uses
existing software and hardware facilities to tune applications to mini-
mize for several combinations of power and performance. The tuning is
done with respect to software level performance-related tunables and for
processor clock frequency. These tunable parameters are explored via
an offline search to find the parameter combinations that are optimal
with respect to performance (or delay, D), energy (E), energy×delay
(E×D) and energy×delay×delay (E×D2 ). These searches are employed
on a parallel application that solves Poisson’s equation using stencils.
We show that the parameter configuration that minimizes energy con-
sumption can save, on average, 5.4% energy with a performance loss of
4% when compared to the configuration that minimizes runtime.

1 Introduction
As the HPC community prepares to enter the era of exascale systems, a key
problem that the community is trying to address is the power wall problem.
The power wall arises because as compute nodes (consisting of multi/many-
cores) become increasingly powerful and dense, they also become increasingly
power hungry. The problems this creates are two-fold; it is more expensive to
run compute nodes due to the energy they require and it is difficult/expensive
to cool them.
Going forward, power-aware computing research in the HPC community will
focus in at least two main areas. The first is to develop descriptive and universal
ways of describing power usage, either by direct measurement or through ex-
planatory models. Inexpensive, commercially-produced devices such as WattsUp?
Pro [3] or more customized frameworks such as PowerMon2 [4] or PowerPack [13]
can help measure power and energy consumption. Modeling energy usage through
combinations of architectural parameters with performance counters [26] or other
resource usage information [21] also fall in this category. The second thrust,
which invariably depends on the first, is to attempt to minimize the amount
of energy required to solve various scientific problems. This includes the use of

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 178–187, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Auto-tuning for Energy Usage in Scientific Applications 179

Dynamic Voltage Frequency Scaling (DVFS) technique to exploit processor un-


derutilization due to memory stalls [12, 19] or MPI inter-task load imbalances in
large scale applications [12], improvements in the process and design of hardware,
or software-based techniques that change some feature of application behavior
in order to lower some energy-related metric [20].
In this work we explore the optimization space of energy usage and perfor-
mance on a stencil computation, which is an important kernel in many HPC
applications1 . We use a compiler-based methodology to generate and select
among alternative mappings of computational kernels that reduce energy con-
sumption. A unique feature of this work is the combined exploration of the
effect of code-transformation level tunables and CPU clock frequency scaling2
on the overall energy consumption of the system during an application run. In
other words, clock frequency is considered to be one of the tunables along with
code-transformation parameters such as tiling factors and unrolling factors. This
approach can be used to help answer the following questions:

1. Can simple loop transformation strategies such as loop blocking, unrolling,


data copy etc. have an impact on energy consumption?
2. Can we use search heuristics to find a code-variant that performs better in
terms of energy consumption?
3. Is there a trade-off between energy consumption and execution time of an
application? In other words, does the execution time of an application suffer
when our primary goal is to optimize for energy consumption?

In this paper we take a first concrete step towards answering these questions.
The study presented here takes a search-based offline auto-tuning approach. We
start by identifying a set of tunable parameters for different potential perfor-
mance bottlenecks in an application. The feedback driven empirical auto-tuner
monitors the application’s performance and power consumption and adjusts the
values of the tunable parameters in response to them. When the auto-tuner re-
quires a new code variant in order to move from one set of parameter values
to another, it invokes a code generation framework [8] to generate that code
variant. The feedback metric values associated with different parameter config-
urations are measured by running the target application on the target platform.
The methodology is thus offline because the tuning adjustments are made be-
tween successive full application runs based on the observed power consumption
for code-variants.

2 Motivation

In this section, we demonstrate that there are opportunities for power and energy
consumption auto-tuning. We use an implementation of the Poisson’s equation
1
The prevalance of stencil computations in DARPA Ubiquitous High Performance
Computing (UHPC) challenge applications is documented in [10].
2
Frequency scaling can be used to reduce energy consumption.
180 A. Tiwari et al.

Normalized energy consumption of the entire system

1.05 1.05
1 1
0.95 0.95
Normalized Energy 0.9 0.9
0.85
0.85
0.8
0.8
0.75
0.7 0.75
0.7

50
100
150 400
350
200 300
250 250
TI 200
300 150
350 100 TJ
50
400

Fig. 1. Normalized energy consumption of the entire system for 8-core experiment.
Figure is easier to see in color.

solver (described in more detail in Section 3.4) as a test application. One of


the computational hotspots on this application is the relaxation function, which
uses a 7-point stencil operation within a triply nested loop. The two outer-most
(i and j) loops of this function are blocked (using blocking factors TI and TJ
respectively) for better cache usage following Rivera et. al.’s tiling scheme [24].
Better cache utilization can reduce overall power usage because it reduces data
movement costs.
We run the solver with different combinations of TI and TJ and measure the
overall power usage for each combination. The experiment was conducted on
an Intel Xeon E5530 workstation (more description in Section 3). We ran the
application on 8 cores for different combinations of the blocking factors. Clock
frequencies for each of the cores were kept fixed at the highest available level,
2.4GHz. We normalized the measured energy consumption for each combina-
tion of TI and TJ with respect to the energy consumed by compiler optimized
original implementation (non-blocked version compiled with the -O3). Figure
1 shows these results. The interaction between energy consumption and tiling
optimization is interesting and complex. From the energy consumption point of
view, long slender tiles (with TI>TJ) are preferable. The best tiling configura-
tion uses 30% less energy than the compiler optimized original implementation.
Moreover, there exists a fairly large “good” energy consumption area in the fig-
ure. These results imply that relatively naive tuning does not result in anything
near optimal energy usage and that something is needed to guide application
execution toward the optimal solution. We show in the remainder of this work
that auto-tuning is a practical and useful methodology for doing this.
Auto-tuning for Energy Usage in Scientific Applications 181

3 Experiments

To drive the tuning process we use Active Harmony [9, 27], which is a search-
based auto-tuner. Active Harmony treats each tunable parameter as a variable
in an independent dimension in the search (or tuning) space. Parameter config-
urations (admissible values for tunable parameters) serve as points in the search
space. The objective function values (feedback metrics) associated with points
in the search space are gathered by running and measuring the application on
the target platform. The objective function values are consumed by the Active
Harmony server to make tuning decisions.
For tunable parameters that require new code (e.g. unroll factors), Active
Harmony utilizes code-transformation frameworks to generate code. The exper-
iments reported in this paper use CHiLL [8], a polyhedral loop transformation
and code generation framework. CHiLL provides a high-level script interface
that auto-tuners can leverage to describe a set of loop transformation strategies
for a given piece of code. More details on offline auto-tuning using Active Har-
mony and CHiLL are described in [27]. Both Active Harmony and CHiLL are
open-source projects.

3.1 Power/Energy Measurement

We measure the energy consumption of a system using the WattsUp? Pro power
meter [3]. The power meter is a fairly inexpensive device and, at the time of
this writing, costs less than $150. This device measures the AC power being
consumed by the entire system. We have implemented a command line interface
on top of the wattsup driver to monitor and calculate the overall energy usage
of an application.

3.2 Auto-tuning Feedback Metric

The most common feedback metric used by auto-tuners is application execution


time, which can also be expressed as runtime delay with respect to some baseline.
For energy auto-tuning, however, we need a feedback metric that combines power
usage with the execution time of a given program. There has been a lot of debate
about the appropriateness of different combinations of power and performance
[5, 7, 14] in investigating energy consumption reducing techniques in today’s
architectures. All of them hinge on how much the delay in execution time should
be penalized in return for lower energy.
In this work, we use four different feedback metrics: E (total energy), ED
(energy×delay), ED2 (energy×delay×delay) and T (execution time). Total en-
ergy (E) is derived by multiplying the average power usage by the application
execution time. E does not penalize execution time delay at all. T penalizes only
execution time delay with no credit for saving energy. Between these extremes,
ED and ED2 metrics put more emphasis on the total application execution time
than the total energy metric. The appropriateness of which metric to use depends
182 A. Tiwari et al.

on the overall goal of the tuning exercise and one could certainly consider a user-
set set delay penalty per job. We think these 4 are enough to characterize our
methods and and the optimization space.

3.3 Experimental Platform


The experiments were conducted on an Intel Xeon E5530 workstation. The E5530
has 2 quad-core processors. Each core has its own 32KB L1 cache and 256KB L2
cache. Each of the quad-core processors has a shared 8MB L3 cache (for a total of
16MB of L3 for the 8 cores). Each of the 8 cores can be independently clocked at
1.60GHz, 1.73GHz, 1.86GHz, 2.00GHz, 2.13GHz, 2.26GHz, 2.39GHz or 2.40GHz.
Processor clock frequency is changed using the cpufreq-utils package [1] that
is available with many popular Linux distributions.

3.4 Results for Poisson’s Equation Solver (PES)


Poisson’s equation is a partial differential equation that is used to character-
ize many processes in electrostatics, engineering, fluid dynamics, and theoretical
physics. To solve for Poisson’s equation on a three-dimensional grid, we use a
modified version of the parallel implementation provided in the KeLP-1.4 [2] dis-
tribution. The application is written in C++ and Fortran. The implementation
uses the redblack successive over relaxation method to solve the equation. The
core of the computational time is spent on two kernels: the relaxation function,
which uses the stencil computation (described in Section 2), and the error cal-
culation function, which calculates the sum of squares of the residual over the
3D grid. The error calculation portion of the code is optional and can be turned
on or off using a command line parameter. These functions are tuned separately
in offline modes.
For the relaxation function, we use the tiling optimization scheme that we
described in Section 2. Active Harmony determines the dimension of the tiles
and the appropriate CPU frequency — a three dimensional search space. The
error function optimization requires new code generation. For this function, we
tile all three loops and the innermost loop is unrolled. Thus, the search space
for error function optimization is five dimensional — four code transformation
parameters and the CPU frequency.
Table 1 shows the results for relaxation function tuning. For each of the
feedback metric, we conducted three auto-tuning runs. The table shows the
results for the best parameter configurations and also the averages across three
runs. The data provided in the table are normalized with respect to the timing
and energy usage of the original program (compiled with the -O3) when run at
the highest available frequency.
Empirical tuning via automatic generation of code alternatives and/or care-
ful selection of parameters that govern the application of different optimization
strategies has proven its merit in last few decades [24, 27, 28]. Needless to say, the
technique delivers its promise in our experiments as well. However, we are more in-
terested in the energy side of the application execution behavior. Active Harmony’s
Auto-tuning for Energy Usage in Scientific Applications 183

Table 1. PES-Relaxation kernel results (4003 grid, 8-cores)

E(μ) ED(μ) ED2 (μ) T (μ)


Best Configurations
speedup (time) 2.18 (2.13) 2.25 (2.39) 2.24 (2.39) 2.27 (2.40)
norm. ener. usage 0.42 0.43 0.43 0.44
Averages
norm. speedup (time) 2.16 2.22 2.22 2.24
norm. ener. usage 0.42 0.44 0.44 0.45
μ (clock frequency in GHz )

search favors a lower clock frequency for better energy usage. On average, energy
conscious parameter configurations save 58% energy and run 2.16× faster com-
pared to the baseline compiler optimized code. The auto-tuning runs that use ED,
ED2 and T metrics all favor high clock frequency and show similar performance and
energy characteristics. In terms of the best runtime improvement, auto-tuning runs
done with delay as the feedback metric achieves 2.24× improvement along with en-
ergy saving of 55%. This confirms the popular belief that auto-tuning for runtime
in scientific applications leads to better system-wide energy usage.
Often the best performing code is nearly the most energy efficient as short
runtimes shorten one component of the P ower × T product (Energy). So finally,
we compared the performance and energy consumption measurements between
configurations that give best timing and best energy usage respectively. The
configuration that provides best energy usage suffers a delay of 4.1%; however,
the energy usage saving is 5.8%. The search heuristic used in these experiments
does not guarantee a globally best configuration with respect to timing or en-
ergy consumption, which means there can be other configurations in the search
space that can possibly demonstrate different behavior. However, the result indi-
cates that there are some non-trivial interactions between compiler performance
optimization strategies and energy usage.
Table 2 shows the results for L2-Error function. The results for this function
follows a similar pattern to that of the relaxation function. We then compared
the performance and energy consumption measurements between configurations

Table 2. PES-L2Error kernel results (4003 grid, 8-cores)

E(μ) ED(μ) ED2 (μ) T (μ)


Best Configurations
norm. speedup (time) 1.15 (2.13) 1.17 (2.39) 1.20 (2.40) 1.18 (2.40)
norm. ener. usage 0.80 0.82 0.85 0.86
Averages
norm. speedup (time) 1.15 1.15 1.18 1.17
norm. ener. usage 0.82 0.83 0.86 0.87
μ (clock frequency in GHz )
184 A. Tiwari et al.

that give best timing 3 and best energy usage respectively. The configuration
that provides best energy usage suffers a performance loss of 3.9% and the energy
usage savings is 5%. This result further strengthens our earlier argument about
the need to investigate the interactions between compiler optimization strategies
and energy consumption.

4 Future Work

In this paper, we have relied exclusively on the Wattsup Pro? to gather AC


power measurements at a 1 second granularity. Going forward, we would like to
utilize more fine-grained DC power measurement tools such as PowerMon2[4]
that allow for measurements of individual system components at much higher
sampling rates. Doing so would allow us to be able to attribute fractions of the
total energy consumed to various components of the compute systems, such as
memory subsystem energy usage and CPU energy usage. This association will
allow us to target specific optimization techniques to reduce energy usage of
various components. Finally, we would like to extend this work to use online
tuning. That is, the application will be energy-tuned while it runs.

5 Related Work

Reducing power consumption has long been of great interest to embedded and
mobile systems architects [11, 18, 22]. The architectural properties of these sys-
tems are fundamentally different from those of the HPC systems, so the strategies
proposed for for them generally fail to translate into reducing energy consump-
tion for HPC systems and applications [16]. As such, power optimization has
received a fair amount of attention from the HPC community. Most previous re-
search on power optimization uses architectural simulation to estimate power or
energy usage by different components of the compute system [6]. More recently,
direct power and energy measurement hardware and software have been devel-
oped [4, 13, 15]. Bedard et. al. [4] developed PowerMon2, a framework designed to
obtain fine-grained current and voltage measurements for different components
of a target platform such as CPU, memory subsystem, disk I/O etc. Power pro-
filing frameworks can be integrated within our power auto-tuning framework to
obtain greater understanding of the impact of different optimization techniques
on individual components of the target architecture.
Power or energy usage modeling and benchmarking is another relevant area.
The Energy-Aware Compilation (EAC) [17] framework uses a high-level en-
ergy estimation model to predict the energy consumption of a given piece of
code. The model utilizes architectural parameters and energy/performance con-
straints. The overall idea is to use the model to decide the profitability of different
3
Note that for this kernel, ED2 tuning runs gives the best timing, which is what we
use for the comparison with the best energy usage configuration.
Auto-tuning for Energy Usage in Scientific Applications 185

compiler optimization techniques. Singh et. al. [26] derive an analytic, workload-
independent piece-wise linear power model that maps performance counters and
temperature to power consumption. Laurenzano et. al. [19] use a benchmark-
based approach to determining how system power consumption and performance
is affected by various demand regimens on the system, then use this to select
processor clock frequency.
Seng et. al. [25] examine the effect of compiler optimization levels and a few
specific compiler optimization flags on the energy usage and power consumption
of the Intel Pentium 4 processor. Rather than relying on compiler optimiza-
tion levels, we exercise a greater control over how different code transformation
strategies are applied. Moreover, our technique is general purpose and uses a
fairly inexpensive power measurement hardware to guide the exploration of the
parameter search space.
Rahman et. al. [23] use a model-based approach to estimate power consump-
tion of chip multiprocessors and use that information to guide the application
of different compiler optimization techniques. This work is most closely related
to the work that we have presented here. Power estimations for different code-
variants are obtained using the model described by Singh et. al. [26]. Our work
uses power measurements rather than models and we simultaneously treat clock
frequency as a tunable parameter alongside the generation and evaluation of
different code variants.

6 Conclusion
In this paper, we showed that there are non-trivial interactions between com-
piler performance optimization strategies and energy usage. We used a fairly
inexpensive power meter and leveraged open source projects to explore energy
and performance optimization space for computation intensive kernels.

Acknowledgements. This work was supported in part by the DOE Office of


Science through the SciDAC2 award entitled Performance Engineering Research
Institute. This work was also funded in part by the Department of Defense and
received support from the Extreme Scale Systems Center, located at Oak Ridge
National Laboratory.

References
1. CPU Frequency Scaling, https://wiki.archlinux.org/index.php/Cpufrequtils
2. KeLP, http://cseweb.ucsd.edu/groups/hpcl/scg/KeLP1.4/
3. WattsUp? Meters, https://www.wattsupmeters.com/secure/products.php?pn=0
4. Bedard, D., Lim, M.Y., Fowler, R., Porterfield, A.: PowerMon: Fine-grained and
integrated power monitoring for commodity computer systems. In: Proceedings of
the IEEE SoutheastCon 2010 (SoutheastCon), pp. 479–484 (2010)
5. Bekas, C., Curioni, A.: A new energy aware performance metric. Computer Science
- Research and Development 25, 187–195 (2010)
186 A. Tiwari et al.

6. Brooks, D., Tiwari, V., Martonosi, M.: Wattch: a framework for architectural-level
power analysis and optimizations. In: Proceedings of the 27th Annual International
Symposium on Computer Architecture, ISCA 2000, pp. 83–94. ACM, New York
(2000)
7. Brooks, D.M., Bose, P., Schuster, S.E., Jacobson, H., Kudva, P.N., Buyukto-
sunoglu, A., Wellman, J.-D., Zyuban, V., Gupta, M., Cook, P.W.: Power-aware
microarchitecture: Design and modeling challenges for next-generation micropro-
cessors. IEEE Micro 20, 26–44 (2000)
8. Chen, C.: Model-Guided Empirical Optimization for Memory Hierarchy. PhD the-
sis, University of Southern California (2007)
9. Chung, I.-H., Hollingsworth, J.: A case study using automatic performance tuning
for large-scale scientific programs. In: 2006 15th IEEE International Symposium
on High Performance Distributed Computing, pp. 45–56 (2006)
10. Ciccotti, P., et al.: Characterization of the DARPA Ubiquitous High Performance
Computing (UHPC) Challenge Applications. Submission to International Sympo-
sium on Workload Characterization, IIWSC (2011)
11. Flinn, J., Satyanarayanan, M.: Energy-aware adaptation for mobile applications.
In: Proceedings of the Seventeenth ACM Symposium on Operating Systems Prin-
ciples, SOSP 1999, pp. 48–63. ACM, New York (1999)
12. Freeh, V.W., Kappiah, N., Lowenthal, D.K., Bletsch, T.K.: Just-in-time dynamic
voltage scaling: Exploiting inter-node slack to save energy in mpi programs. J.
Parallel Distrib. Comput. 68, 1175–1185 (2008)
13. Ge, R., Feng, X., Song, S., Chang, H.-C., Li, D., Cameron, K.: PowerPack: En-
ergy Profiling and Analysis of High-Performance Systems and Applications. IEEE
Transactions on Parallel and Distributed Systems 21(5), 658–671 (2010)
14. Horowitz, M., Indermaur, T., Gonzalez, R.: Low-power digital design. In: IEEE
Symposium on Low Power Electronics, Digest of Technical Papers 1994, pp. 8–11
(October 1994)
15. Hotta, Y., Sato, M., Kimura, H., Matsuoka, S., Boku, T., Takahashi, D.: Profile-
based optimization of power performance by using dynamic voltage scaling on a
pc cluster. In: Proceedings of the 20th International Conference on Parallel and
Distributed Processing, IPDPS 2006, p. 298. IEEE Computer Society, Washington,
DC (2006)
16. Hsu, C.-H., Feng, W.-C.: A Power-Aware Run-Time System for High-Performance
Computing. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomput-
ing, SC 2005, p. 1. IEEE Computer Society, Washington, DC (2005)
17. Kadayif, I., Kandemir, M., Vijaykrishnan, N., Irwin, M., Sivasubramaniam, A.:
Eac: a compiler framework for high-level energy estimation and optimization. In:
Proceedings of Design, Automation and Test in Europe Conference and Exhibition,
2002, pp. 436–442 (2002)
18. Kandemir, M., Vijaykrishnan, N., Irwin, M.J., Ye, W.: Influence of compiler opti-
mizations on system power. IEEE Trans. Very Large Scale Integr. Syst. 9, 801–804
(2001)
19. Laurenzano, M.A., Meswani, M., Carrington, L., Snavely, A., Tikir, M.M., Poole,
S.: Reducing Energy Usage with Memory and Computation-Aware Dynamic Fre-
quency Scaling. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part
I. LNCS, vol. 6852, pp. 79–90. Springer, Heidelberg (2011)
20. Li, D., de Supinski, B., Schulz, M., Cameron, K., Nikolopoulos, D.: Hybrid
MPI/OpenMP power-aware computing. In: 2010 IEEE International Symposium
on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)
Auto-tuning for Energy Usage in Scientific Applications 187

21. Olschanowsky, C., Carrington, L., Tikir, M., Laurenzano, M., Rosing, T.S.,
Snavely, A.: Fine-grained energy consumption characterization and modeling. In:
DOD High Performance Computing Modernization Program User Group Confer-
ence (2010)
22. Pillai, P., Shin, K.G.: Real-time dynamic voltage scaling for low-power embedded
operating systems. SIGOPS Oper. Syst. Rev. 35, 89–102 (2001)
23. Rahman, S.F., Guo, J., Yi, Q.: Automated empirical tuning of scientific codes
for performance and power consumption. In: Proceedings of the 6th International
Conference on High Performance and Embedded Architectures and Compilers,
HiPEAC 2011, pp. 107–116. ACM, New York (2011)
24. Rivera, G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In:
Proceedings of the 2000 ACM/IEEE Conference on Supercomputing (CDROM),
Supercomputing 2000. IEEE Computer Society, Washington, DC (2000)
25. Seng, J.S., Tullsen, D.M.: The Effect of Compiler Optimizations on Pentium 4
Power Consumption. In: Proceedings of the Seventh Workshop on Interaction be-
tween Compilers and Computer Architectures, INTERACT 2003, p. 51. IEEE
Computer Society, Washington, DC (2003)
26. Singh, K., Bhadauria, M., McKee, S.A.: Prediction-based power estimation and
scheduling for cmps. In: Proceedings of the 23rd International Conference on Su-
percomputing, ICS 2009, pp. 501–502. ACM, New York (2009)
27. Tiwari, A., Chen, C., Chame, J., Hall, M., Hollingsworth, J.: A Scalable Auto-
Tuning Framework for Compiler Optimization. In: 23rd IEEE International Par-
allel & Distributed Processing Symposium, Rome, Italy (May 2009)
28. Vuduc, R., Demmel, J.W., Yelick, K.A.: Oski: A library of automatically tuned
sparse matrix kernels. Journal of Physics: Conference Series 16, 521–530 (2005)
Automatic Source Code Transformation
for GPUs Based on Program Comprehension

Pasquale Cantiello and Beniamino Di Martino

Second University of Naples, Italy


{pasquale.cantiello,beniamino.dimartino}@unina.it

Abstract. In this work is presented a technique to transform sequen-


tial source code to execute it on parallel architectures as heterogeneous
many-core systems or GPUs. Source code is parsed and basic algorithmic
concepts are discovered from it in order to feed a knowledge base. A rea-
soner, by consulting algorithmic rules, can compose this basic concepts
to pinpoint code regions representing a known algorithm. This code can
be annotated and / or transformed with a source-to-source process. A
prototype tool has been built and tested on a case study to analyse the
source code of a matrix multiplication. After recognition of the algorithm,
the code is modified with calls to nVIDIA GPU cuBLAS library.

Keywords: comprehension, GPU, manycore, source-to-source,


reengineering.

1 Introduction
The development of software for scientific applications through the years has
seen different seasons. Continuous growth in performance requests to fulfil spe-
cific calculus needs, drove the birth of parallel machines and related concurrent
programming models. Lot of effort has been spent on developing parallelization
techniques to port applications on parallel, vectorial or super-scalar architectures
in the ninety.
Subsequently, continuous improvements on the hardware systems and mainly
on clock processors’ clock speeds, caused lacking of interest on research activities
on parallelization, since the performances of applications got a natural growth.
But in the last few years the processors’ clock speed growth has stopped due
to physical limits on junctions dimensions and to the dissipated power. Proces-
sor improvements have to follow a different path by multiplying the number of
processing units on a chip (multi-core systems). Chips producers nowadays an-
nounce systems, no longer with higher frequencies, but with increased number
of cores. Special purpose devices as the GPUs, designed for graphics applica-
tions, can be used to do parallel computations. Not only in systems for scientific
applications, but also in common personal computers, there are now multi-core
CPUs and GPUs.
It is hard to write parallel code and it requires skilled developers. Great effort
is needed to port existing software and manual parallelization of applications
with high orders of magnitude of lines of code is a critical and error-prone process.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 188–197, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Automatic Source Code Transformation for GPUs 189

Nowadays, research activities on automatic parallelization systems to migrate


existing code, in a more or less assisted way, to heterogeneous architectures, are
again hot-topics. Next years of compiler research will be mostly devoted to the
generation and verification of parallel programs[12].
In this paper we will see how to pinpoint potentially parallelizable code regions
from source code, starting from extracting basic algorithmic concepts, composing
them and reasoning upon them to find common algorithms implementations. The
outlined regions of code are processed with a source-to-source transformation
technique in order to take gain of the target architecture. If an optimized library
exists for the specific algorithm, the code is replaced with a function call.
The paper continues, after this introduction, with the section 2 with an anal-
ysis of related works on code comprehension and translation. In the section 3
will be introduced the technique to analyse code to find concepts, to reason on
them and to translate related code for heterogeneous architectures.
The first implementation of a tool to drive the process and translate code
to CUDA[17] code or CUBLAS calls is presented in section 4. A case study to
validate the technique is shown in section 5, and the paper ends with section 6
with conclusions and future work directions.

2 Related Works

A description of the Algorithmic Recognition used here can be found in [7] or in


[5] where is introduced the definition of algorithmic concept and the technique
to describe the algorithms by using an attributed grammar. Two tools to do
program comprehensions were presented in [6]. That work was tailored to be
used on the Vienna Fortran Compiler [3]. Present work, even if is still using the
same recognition engine, it is different since adds the source-to-source capabilities
for GPUs, adds support for object oriented languages as C++, and new source
code parsing technique.
Several papers have been presented on clone detection or searching for similar
code fragments in programs. These clones are potentially fault causes due to
the necessity to maintain multiple copies of the same code. The works in this
field have been developed with several approaches. Text-based approaches [8],
syntax-based [2], or graph-based [14]. All of these techniques can only detect
duplicates that are nearly identical each other and so cannot identify implemen-
tation variants or code perturbations. They focus mainly on code that originates
from the copy actions (e.g. cut and paste), instead of investigating on the func-
tionality of the code. A semantic approach can be found in [10] where the authors
investigate on extracting sub-graphs from Program Dependence Graph (PDG)
[9], convert them to a simpler tree form and do tree similarities studies with a
scalable method [13].
Source to source transformation for parallelizzation can be found in several
works. In [16] the semantics of standard abstractions as (C++ Standard Tem-
plate Library), or Array-Based Computation Loop drives the process of source
code transformation with the support of OpenMP directives and clauses. In [15]
190 P. Cantiello and B. Di Martino

the transformation has been made on OpenMP code by analizing parallel con-
structs and work-sharing constructs to extract candidate kernels and transform
them to CUDA code. A framework for optimization of affine loop nests, with
polyhedral compiler model, has been described in [1]. All these works start with
code that is already parallel or user annotated code to drive the transformation
and not from sequential code as ours.

3 Source Code Transformation


The proposed technique, shown in figure 1, begins with a static analysis of code.
The source file is processed by a front-end that translates it into an interme-
diate representation (IR) which is an enriched Abstract Syntax Tree (AST).
The Extractor traverse this structure searching for patterns that can be recog-
nized as basic concepts and emits Prolog Facts corresponding to them. Now the
Transformer can submit queries to the reasoner to search for known algorithmic
concepts. The reasoner, by using these facts and by consulting a set of rules,
gives replies about any found algorithm. Answers include references to the code
region related to the algorithm and to the data involved. The transformer can
now pick from the repository an alternative implementation of the algorithm
that is suitable for the target architecture, and modifies the AST accordingly.
The user can interact in this phase by setting preferences on the selection of the
alternatives. A final unparsing of the IR can generate new source code for the
target architecture.

3.1 Algorithm Recognition


The recognition strategy is based on a hierarchical parsing. Starting from the
intermediate representation of code, basic concepts are recognized. They can be
seen as building blocks of composed concepts in a recursive way as described in
[7] and in [5].

 dƌĂŶƐĨŽƌŵĞƌ 
&ƌŽŶƚͲŶĚ ĂĐŬͲŶĚ
ƐŽƵƌĐĞ ƐŽƵƌĐĞ

ĂƐŝĐ
ŽŶĐĞƉƚƐ ZĞĂƐŽŶĞƌ ůŐŽƌŝƚŚŵ
džƚƌĂĐƚŽƌ ƌĞƉŽƐŝƚŽƌLJ

ůŐŽƌŝƚŚŵŝĐ
ZƵůĞƐ

Fig. 1. The model of the process


Automatic Source Code Transformation for GPUs 191

For example the analysis of the statement: int i = 0; produces the facts in
the listing 1.1.
s c a l a r v a r d e f ( i , d e f l i s t 1 , e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 2 , 0 , e l e m u p d a t e r , main ) .
a s s i g n r (( d e f l i s t 1 , stp 1 ) , stp 1 , stp 2 , elem update r ,
main ) .

Listing 1.1. Facts produced by variable declaration

We can see that four facts are generated: a) the definition of a scalar variable;
b) the usage of a scalar value; c) the usage of a constant; d) the assignment
statement. In detail, the second fact above indicates a basic concept named
scalar_var_inst. Its instance number is 1 (stp_1), its parameter is i (the
variable name), the rule is recognized by is the elem_update_r and the function
in which it is present is named main.
Similarly, in last fact, we see the composition of the previous concepts in a
tree.
Another example is the loop statement: for (i = 0; i < 10; i++) which
produces the facts in listing 1.2
f o r r (15 , f o r (15 , ex it 115 ) , i n i t 6 , exit 115 , incr 7 , elem update r
, main ) .
s c a l a r v a r i n s t ( s t p 1 1 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 1 2 , 0 , e l e m u p d a t e r , main ) .
a s s i g n r ( i n i t 6 , assign ( i n i t 6 , stp 11 ) , stp 11 , stp 12 ,
e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 3 , i , e l e m u p d a t e r , main ) .
v a l i n s t ( s t p 1 4 , 1 0 , e l e m u p d a t e r , main ) .
l e s s ( e x i t 1 1 5 , s t p 1 3 , s t p 1 4 , e l e m u p d a t e r , main ) .
s c a l a r v a r i n s t ( s t p 1 5 , i , e l e m u p d a t e r , main ) .
p o s t i n c r ( i n c r 7 , s t p 1 5 , e l e m u p d a t e r , main ) .

Listing 1.2. Facts produced by for loop

In this case the numbers are the pointers to the nodes of the AST.
Control dependence facts generated have a syntax like:
control_dep(dependant_id, depend_from_id, type, class, method).
Data dependence facts have a syntax like:
data_dep(type,dependant_id,depend_from_id,variable,class,method).
The concept recognition rules are the production rules of the parsing process;
they describe the feature set that permits the identification of an instance of
an algorithmic concept in the code. This feature set can be named algorithmic
pattern. The rules can be defined as the way in which abstract concepts, as groups
of statements in the code, are organized under an abstract control structure. With
this definition we include structural relationships as Control and Data Flow,
Control and Data Dependence and function calling.
192 P. Cantiello and B. Di Martino

Each recognition rule identifies the concept by using a composition hierarchy,


specified with the set of composing sub-concepts, and a set of conditions and
constraints that sub-concepts must satisfy.
The main aspects of the method are:

– Basic concepts, the starting points of the hierarchical abstraction process,


are chosen among the elements of the intermediate representation of code.
Properties and relationships that characterize them are still part of the rep-
resentation [4]. Dependence informations are found in the PDG that is built
during the analysis phase.
– Properties and relationships are chosen in order to privilege the structural
features instead of syntax. So, dependences relationships assume a main role:
they become the features that drive the abstract control structure among the
concepts.

The chosen parsing strategy is top-down; so the concept parsing is descendant


and the recognition is demand-driven. This choice has been motivated by our
main objective that is the transformation of code to support parallelization of
certain algorithms. We do not want to comprehend entire code, but only find if
instances of particular algorithmic patterns are present. So the demand-driven
approach can be used, since it has a less complex search space than the code-
driven approach.
A knowledge base with the definition of recursive composition rules permits to
describe an algorithm. It relies on Prolog as a system shell and takes advantage
of Prolog’s deductive inference-rule engine to perform the hierarchical concept
recognition.
The Prolog engine is queried for specific goals. When one of them is satis-
fied, the result contains the recognized algorithm, the references to AST nodes
involved, and the input and output data related.
After reasoning, composed concepts are recognized. Some examples of recog-
nized concepts are:

– elem_update: This represents the update of the value of an element by an


expression that depends on previous value of the same element.
– count_loop: This represents a for loop where the init, test and output
statements are based on expressions involving only constants, except the
loop variable.
– scan: This is the access (read or write) of a sequence of elements in an array.
– dot_product: This is concept of the product of two dimensions of two arrays.

3.2 Source to Source Transformation

The information obtained after the recognition of the algorithm drives the trans-
former module. The source code region that implements the algorithm can be
replaced by optimized parallel code or by call to optimized libraries. The al-
gorithm repository contains, for each target architecture, one or more possible
Automatic Source Code Transformation for GPUs 193

implementations, stored in a parametric source code format. The parameters


should be mapped to input and output data involved in the algorithm. The user
can drive the selection of the code among the repository by setting preferences
on alternative implementations. The transformer directly manipulates the inter-
mediate representation of the analysed source program. By using the references
given by the reasoner, the abstract syntax tree is modified with the following
steps:

– The sub-tree corresponding to the code region is pruned from the AST and,
if desired, a comment block with the original code is inserted.
– A new sub-tree is generated with transformed code. If needed (as in GPUs),
it contains also: memory allocation on device, memory transfer from CPU
to device, library invocation, memory transfer from the device back to the
CPU and memory deallocation.
– This tree is appended in the AST at the removal point, just after the com-
ment block.

After all the transformations done on the AST, an unparsing operation permits
to generate the code ready to be compiled on the target platform.

4 Prototype Tool
To test the technique a prototype tool has been built. The reasoner has been
implemented with SWI-Prolog [19] as a stand-alone module with a shell interface.
The rest of the work has been done by using Rose Compiler [18]. This is a
complete compiler infrastructure, tailored for source-to-source transformations.
It uses two front-end modules, one that can parse C/C++ and the other for
Fortran 2003 and earlier. The intermediate representation used by Rose is very
reach and preserve all the information from the source code (including source file
references, code comments, macros and templates for C++). This can be valuable
in the unparsing process to produce source code that can still be readable by
humans. The programming interface of Rose Compiler is C++, so our work was
done in this language.
Starting from the intermediate representation obtained by the front-end, the
AST should be traversed in order to find basic concepts. We built a class that im-
plements the Visitor Design Pattern [11] by extending the ROSE_VisitorPattern
class and overriding the visit() methods for each node type we need to pro-
cess. The AST is so traversed and the series of facts corresponding to the basic
concepts are produced in a text file. Similarly control-dependence and data-
dependence facts are produced by using the related Rose Library functions. Now
the reasoner is invoked with a series of goals each corresponding to a known al-
gorithm that is present in the repository. If a goal is satisfied the reasoner replies
with the name of the algorithm, the references to the code region that imple-
ments it and the data involved. Since multiple queries can be done to search for
different algorithms and the reasoning is a time consumption process it can be
done separately from the transforming and results saved in intermediate files.
194 P. Cantiello and B. Di Martino

The transformer, starting with that information, cuts the original code (or sim-
ply enclosed in comments, depending on the preferences of the user), builds new
code from the templates for the platform the user has chosen, and modify the AST
accordingly. But before doing the code removal, a test to prove legality of the trans-
formation is done. All the code that is enclosed in the AST sub-tree of the algorithm,
but is not mapped to basic concepts of the algorithm (eg. extra added lines), is
checked for data dependencies with the data involved in the algorithm.
To add new code and comments to the AST, Rose Compiler furnishes the so
called Rewrite mechanism. It uses three simple functions: insert(), replace()
and remove() that can be used at different levels of abstractions. Two low levels
which interact directly with the nodes of the tree and permit a fine grained control
on the generated nodes but they are extremely verbose, an intermediate level which
lets the user express the transformation with strings and an higher level which can
be used during the traversal operations. We have used the mid level since it gave
use the best compromise between complexity and power of use.
After all the transformations, a final call to backend() function can be used
to generate the source code from the AST in a new file.

5 Case Study
As a case study we used the source code for a sequential C implementation
of a matrix-matrix multiplication. This contains one of the algorithms we can
recognize at present.
In listing 1.3 we can see a fragment of the code that is given in input to the
tool.
double x [ 1 0 ] [ 1 0 ] ;
double y [ 1 0 ] [ 1 0 ] ;
double z [ 1 0 ] [ 1 0 ] ;
double temp = 0 ;
int i = 0 ;
int j = 0 ;
int k = 0 ;

f o r ( i =0; i <10; i ++) {


f o r ( j =0; j <10; j ++) {
temp = 0 ;
f o r ( k =0;k <10; k++) {
temp = temp + x [ i ] [ k ] ∗ y [ k ] [ j ] ;
}
z [ i ] [ j ] = temp ;
}
}

Listing 1.3. Sequential Matrix multiplication

In listing 1.4 is shown a small excerpt of the Prolog facts with basic concepts
and dependence information produced for the code.
Automatic Source Code Transformation for GPUs 195

a r r a y v a r d e f i n i t i o n ( d e f l i s t 1 , double , 2 , x , [ 1 0 , 1 0 ] , simple mmp ,


do mmp) .
a r r a y v a r d e f i n i t i o n ( d e f l i s t 2 , double , 2 , y , [ 1 0 , 1 0 ] , simple mmp ,
do mmp) .
a r r a y v a r d e f i n i t i o n ( d e f l i s t 3 , double , 2 , z , [ 1 0 , 1 0 ] , simple mmp ,
do mmp) .
s c a l a r v a r d e f ( i , d e f l i s t 4 , simple mmp , do mmp) .
s c a l a r v a r i n s t ( s t p 1 , i , simple mmp , do mmp) .
v a l i n s t ( s t p 2 , 0 , simple mmp , do mmp) .
% . . . . omitted
c o n t r o l d e p ( 1 7 , 1 9 , true , simple mmp , do mmp) .
c o n t r o l d e p ( 1 5 , 1 7 , true , simple mmp , do mmp) .
c o n t r o l d e p ( 1 0 0 0 1 1 , 1 7 , true , simple mmp , do mmp) .
c o n t r o l d e p ( 1 0 0 0 1 4 , 1 5 , true , simple mmp , do mmp) .
%
d a t a d e p ( true , 1 0 0 0 1 4 , 1 0 0 0 1 1 , z , 0 , simple mmp , do mmp) .

Listing 1.4. Prolog Facts produced

In listing 1.5 we can see the response of the reasoner for a query of the goal
matrix_matrix_r.
% hierarchy of concepts : ref erences omitted
matrix matrix product (
simple scan ( . . . ) ,
matrix vector product (
simple scan ( . . . ) ,
dot product ( . . . ) ,
simple scan ( . . . ) ,
),
simple scan ( . . . )
).

Listing 1.5. Prolog hierarchy response for matrix matrix r goal

After that recognition, in listing 1.6 is shown the added source code with the
calls to CUBLAS library, assuming the user has chosen that implementation.
We have omitted the commented code block.
// . . . . o m i t t e d commented code . . .
// −−−> Added by Transf ormer −−−
void∗ d p t r x ;
void∗ d p t r y ;
void∗ d p t r z ;
// Memory a l l o c a t i o n
cudaMalloc ( ( void ∗ ∗ )& d p t r x , 10∗10∗ s i z e o f ( double ) ) ;
cudaMalloc ( ( void ∗ ∗ )& d p t r y , 10∗10∗ s i z e o f ( double ) ) ;
cudaMalloc ( ( void ∗ ∗ )& d p t r z , 10∗10∗ s i z e o f ( double ) ) ;
c u b l a s C r e a t e (& h a n d l e ) ;
// Data t r a n s f e r CPU−>GPU
196 P. Cantiello and B. Di Martino

c u b l a s S e t M a t r i x ( 1 0 , 1 0 , s i z e o f ( double ) , x , 10 , dptr x , 10) ;


c u b l a s S e t M a t r i x ( 1 0 , 1 0 , s i z e o f ( double ) , y , 10 , dptr y , 10) ;
// Matri x x Matri x M u l t i p l i c a t i o n
cublasDgemm ( handle , CUBLAS OP N, CUBLAS OP N , 1 0 , 1 0 , 1 0 ,
0 . 0 , dptr x , 10 , dptr y , 10 , 0 . 0 , d p t r z , 10) ;
// Data t r a n s f e r GPU−>CPU
c u b l a s G e t M a t r i x ( 1 0 , 1 0 , s i z e o f ( double ) , d p t r z , 10 , z , 10) ;
// Memory d e a l l o c a t i o n
cublasDestroy ( handle ) ;
cublasFree ( dptr x ) ;
cublasFree ( dptr x ) ;
cublasFree ( dptr x ) ;

Listing 1.6. Code region added for Matrix multiplication with CUBLAS

6 Conclusion
In this work we have seen how to do source code analysis in order to recognize
basic algorithmic concepts, to reason on them and drive a source to source
transformation of code so that it can execute on new parallel architectures as
GPUs. A prototype tool has been presented to validate the technique and a test
on a case study has been shown.
The work must be intended as a starting point for future investigation. At
present the rules can recognize basic linear algebra algorithms as matrix and
vector multiplication, dot product, maximum and minimum search, reduction.
One direction on which we are now working is the extension of the set of rec-
ognized algorithms and their implementation variants (i.e. variants with use of
pointers and dynamic memory allocation). At the same time, since the reasoning
is a time-consuming process, the recognition process does not scale well with the
increasing in the number of recognized algorithms. We are studying techniques
to finding code clones that maybe can be adapted to extract basic concepts.
Another research path is to add performance investigation on the transformed
code; at present the transformation is done with no regards on the size of the
problem. We know that for small problems, the overhead added by memory
transfers can vanish the improvements obtained by the use of the parallel device.
Conversely, large problems may not fit the device memory. We are working on
adding test points on code so that they can be used to select, at runtime, different
implementation variants depending on the size of the data involved. In this
direction, an extension of the transformation to produce OpenCL code can be
used to tailor heterogeneous architectures as many-core sytems.

References
1. Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev,
A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for
gpgpus. In: Proceedings of the 22nd Annual International Conference on Super-
computing, ICS 2008, pp. 225–234. ACM, New York (2008)
Automatic Source Code Transformation for GPUs 197

2. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone detection using
abstract syntax trees. In: IEEE International Conference on Maintenance (ICSM
1998), p. 368 (1998)
3. Benkner, S.: Vfc: The vienna fortran compiler. Scientific Programming 7, 67–81
(1999)
4. Di Martino, B.: Algorithmic concept recognition to support high performance code
reengineering. Special Issue on Hardware/Software Support for High Performance
Scientific and Engineering Computing of IEICE Transaction on Information and
Systems E87-D, 1743–1750 (2004)
5. Di Martino, B., Iannello, G.: Pap recognizer: A tool for automatic recognition of
parallelizable patterns. In: International Workshop on Program Comprehension, p.
164 (1996)
6. Di Martino, B., Kessler, C.W.: Two program comprehension tools for automatic
parallelization. IEEE Concurrency 8, 37–47 (2000)
7. Di Martino, B., Zima, H.P.: Support of automatic parallelization with concept
comprehension. Journal of Systems Architecture 45(6-7), 427–439 (1999)
8. Ducasse, S., Rieger, M., Demeyer, S.: A language independent approach for detect-
ing duplicated code. In: IEEE International Conference on Software Maintenance
(ICSM 1999), p. 109 (1999)
9. Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its
use in optimization. ACM Transactions on Programming Languages and Systems
(TOPLAS) 9(3) (1987)
10. Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings
of the 30th International Conference on Software Engineering, ICSE 2008, pp.
321–330. ACM, New York (2008)
11. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design patterns: elements of
reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc.,
Boston (1995)
12. Hall, M., Padua, D., Pingali, K.: Compiler research: the next 50 years. Commu-
nunications of the ACM 52, 60–67 (2009)
13. Jiang, L., Misherghi, G., Su, Z., Glondu, S.: Deckard: Scalable and accurate tree-
based detection of code clones. In: Proceedings of the 29th International Confer-
ence on Software Engineering, ICSE 2007, pp. 96–105. IEEE Computer Society,
Washington, DC (2007)
14. Komondoor, R., Horwitz, S.: Using Slicing to Identify Duplication in Source Code.
In: Cousot, P. (ed.) SAS 2001. LNCS, vol. 2126, pp. 40–56. Springer, Heidelberg
(2001)
15. Lee, S., Min, S.-J., Eigenmann, R.: Openmp to gpgpu: a compiler framework for
automatic translation and optimization. SIGPLAN Not. 44, 101–110 (2009)
16. Liao, C., Quinlan, D., Willcock, J., Panas, T.: Extending Automatic Parallelization
to Optimize High-Level Abstractions for Multicore. In: Müller, M.S., de Supinski,
B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 28–41. Springer,
Heidelberg (2009)
17. NVIDIA. Cuda: Compute unified device architecture,
http://www.nvidia.com/cuda/
18. Quinlan, D.: Rose compiler, http://www.rosecompiler.org/
19. Wielemaker, J.: Swi-prolog, http://www.swi-prolog.org/
Enhancing Brainware Productivity
through a Performance Tuning Workflow

Christian Iwainsky1 , Ralph Altenfeld2, ,


Dieter an Mey1 , and Christian Bischof1
1
Center for Computing and Communication,
RWTH Aachen University
{iwainsky,anmey,bischof}@rz.rwth-aachen.de
2
Access e.V.
RWTH Aachen University
r.altenfeld@access-technology.de

Abstract. Operation costs of high performance computers, like cooling


and energy, drive HPC centers towards improving the efficient usage of
their resources. Performance tuning through experts here is an indis-
pensable ingredient to ensure efficient HPC operation. This "brainware"
component, in addition to software and hardware, is in fact crucial to en-
sure continued performance of codes in light of diversifying and changing
hardware platforms. However, as tuning experts are a scarce and costly
resource themselves, processes should be developed that ensure the qual-
ity of the performance tuning process. This is not to dampen human
ingenuity, but to ensure that tuning effort time is limited to achieve a
realistic substantial gain, and that code changes are accepted by users
and made part of their code distribution. In this paper, we therefore for-
malize a service-based Performance Tuning Workflow to standardize the
tuning process and to improve usage of tuning-expert time.

1 Introduction

Researchers from many scientific backgrounds (e.g. biology, engineering, medicine)


develop software to satisfy their simulation needs. We observe that software
development projects pressed ahead by domain specialists typically lack the nec-
essary knowledge, experience and manpower to effectively use modern manycore-
clusters, resulting in insufficient or lacking parallelization, little or no tuning and
missing adaption to the local HPC-system and as a consequence results in limited
scalability, less than optimal performance and bad hardware utilization.
Besides the ability to solve previously unsolvable problems, the operational
cost in terms of power and cooling becomes an additional driving force for per-
formance tuning and parallelization (from now on summarized as "tuning"). It
is therefore in the interest of users, i.e. the users of a projects software, and
HPC-centers to improve scalability and efficiency.

Formerly employed at the Center for Computing and Communication.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 198–207, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Enhancing Brainware Productivity through a Performance Tuning Workflow 199

For this reason many HPC-sites employ tuning specialists to provide the nec-
essary expertise. This can be most prominently seen in the Tier-0 and Tier-
1 HPC-sites, like the UK National Supercomputing Service (HECToR) with
the Distributed Computations Science and Engineering support1 and the HPC-
Simulation Labs in Germany2 . These projects provide this essential performance
tuning expertise to enable "users", i.e. code developers and program users, to
scale to ten-thousands or more cores. Also the smaller Tier-2 sites (like ours)
who focus more on productivity and throughput of compute-jobs especially ben-
efit from such support.
Whilst funding for such experts is typically difficult for Tier-2 sites, we argued
in previous work (Brainware for Green HPC [1]), summarized in section 2, that
savings in the Total Cost of Ownership (TCO) obtained by the tuning activity
could be used to fund these experts. This claim holds true, as long as these
experts achieve sufficient improvement for a project in a definite time.
Under this assumption, we propose a Tuning Workflow, described in detail in
section 3, to guide and monitor the tuning service effort of a performance tuning
expert. This workflow aims to maintain the necessary balance between tuning
investment and the obtained savings. To our knowledge such a process has not
been published, though tuning-experts may already follow this process.
We discuss our work and, as this workflow is our first implementation of this
workflow, further improvement possibilities in section 4.

2 Brainware for Green HPC


To acquire an HPC-system on a Tier-2 level considerable funds are required. At
the RWTH-Aachen University the TCO of about 5.5 million e per year covers
of costs for the building, compute hardware, maintenance, power, software and
staff. Figure 1 shows the percentage of a typical annual load of our system, broken
down to the number of users causing the load. Using this figure and the TCO we
can derive that the top 10 users consume roughly 40 % of our system amounting
to 2.2 million e.
From our past experience and consistent with findings of the Computational
Science and Engineering support of the HECToR UK National Supercomputing
Service 3 , we claim that on average a tuning expert can achieve at least a perfor-
mance improvement of 10 % to 20 %. In this case this means that the project, i.e.
code, dataset and software environment, executed faster on the same number of
cores or used the available nodes more efficiently by scaling to further cores. For
several projects in the past, we even experienced performance improvements of a
factor of 2 to 5, with similar findings reported in the HECToR reports. The amount
of work necessary for these improvements depends on the specific project. If one
for example augments an MPI-only project with OpenMP (multi-threading), very
little time is needed to identify the hotspot, parallelize it with OpenMP and to
1
http://www.hector.ac.uk/cse/distributedcse/
2
http://www.jara.org/de/research/jara-hpc/forschung/
3
http://www.hector.ac.uk/cse/distributedcse/
200 C. Iwainsky et al.

verify the results. However, if one performs a change in algorithms or other exten-
sive code modifications, more time is necessary for these changes but the potential
outcome is larger. In our experience a tuning expert invests on average about 2
months worth of work to achieve the aforementioned conservative 10 % to 20 %
performance improvements.
Assuming a constant
100,00%
load, these funds are
90,00%
initially freed up by
80,00%
shutting down the re-
70,00%
spective surplus hard-
60,00%
ware starting with older,
CPU-time usage

50,00%
less power efficient sys-
40,00% tems thus saving en-
30,00% ergy. In the long run a
20,00% portion money can be
10,00% diverted from procure-
0,00% ments towards tuning as
1
7
13
19
25
31
37
43
49
55
61
67
73
79
85
91
97
103
109
115
121
127
133
139
145
151
157
163
169
175
181
187
193
199
the new machinery could
Number of users responsible

be considered 10 % more
Fig. 1. Accumulated Cluster Usage
productive. For a non-
constant load, i.e. the users allways use all available ressources, the systems
behaves as if one had obtained 10 % more hardware, which would cost more
in terms of running costs and procurement. So in either way the tuning would
conserve funds in the long run.
With this in mind and looking at the top users who consume 10 % of our
system (approx. 550k e), it is easy to see, that by just improving such a users
project by 10 % a full time employee at 60ke per year could be funded. Looking
at the total usage, a mere average improvement of 5% on all projects would be
sufficient to continuously support 3 HPC experts, if those experts tune one of
the top projects every 2 months. For further detail and information please refer
to "Brainware for Green HPC" [1].

3 A Performance Tuning Workflow


Performance experts themselves need to be concerned about the productivity
of their efforts, i.e. they should try to obtain maximum gains in minimal time.
Based on our experience in tuning several user-codes, we determined two issues
the most common, which we will focus on in this work: First without outside
stimulus a tuning process can continue for an indefinite time, as always some
more tuning or improvement is possible. The second issue is the lack of user-
acceptance of code changes preventing the tuning to take effect.
To address these issues, we developed a systematic approach by defining a
Tuning Workflow (see Fig. 2)4 . In our experience, involving the user in the
4
In this workflow, we only incorporated those performance analysis tools in use at
the RWTH Aachen University.
Enhancing Brainware Productivity through a Performance Tuning Workflow 201

tuning process the acceptance of tuning-recommendations and code modifica-


tions at the end of tuning effort increases significantly. It should also be noted
at this point that we consider a program or code with a specific data set and
runtime environment as a single project. Therefore multiple datasets or different
execution sites call for multiple, though potentially abbreviated tuning projects.
From our experience with past and current projects (e.g. XNS [2], CDPLit [3]
and MICRESS [4] ) we identified five general stages for a tuning / parallelization
project:
1. Initial Meeting
2. Baseline
3. Pathology
4. Tuning Cycle
5. Finalization / Handover.
These stages are explained in the subsections to follow.

3.1 Initial Meeting


The tuning process should be initiated by the user, however it is also possi-
ble for the expert to approach top users. We recommend an initial face-to-face
meeting with all parties involved. From our experience this improves the per-
sonal involvement of everyone and the acceptance of recommendations at the
end of the tuning-project. This meeting should establish clear expectations by
users and clear definitions of goals by the tuning expert. This practice helps to
prevent overstated improvement expectations and unnecessary prolonged expert-
involvement. The expert and user should also agree on the time frame and sched-
ule a date for a later review, already preventing the expert-time allotted for the
project to get out of hand.
Also in this meeting the scope of involvement, e.g. actual development work,
consultation, minor code changes, should be discussed as copyright, code- and
data-security are often of significance. Full access to the source code is desirable
for the best tuning outcome, though tuning may still be possible for binaries
by optimizing scheduling and environment parameters, like thread-counts and
MPI-parameters. If the expert is granted access to the source code, he should
be given a hands-on introduction to the build system and program execution
setup, to speed up the following tuning and experimentation phase. It may even
be possible, within reason, to request the user to modify the build-system to
resemble a site-specific standard, like using standard compilers, GNUmake and
a single configuration file. We also recommend a review of the algorithms used
in the program, to check later in the Pathology phase for known issues of given
algorithms.

3.2 Base-Line
The Base-line is intended to be quick and easy to obtain information. Great
emphasis should be put on disturbing the target application as little as possible
202 C. Iwainsky et al.

HPC Tuning Process

User HPC Support

Application Tuning Request Schedule User Meeting

Initial Meeting
x Goals
Baseline
x Expectations xMetrics
x Code Transfer Runtime, CPI, LLC-
Misses (Bandwith),
x Brief Introduction
Mflops
xTools
Meeting: time, gprof, Oracle
Algorithmic Review Analyzer, Intel Amplifier,
Vampir

Algorithmic Issues Pathology


Dwarfs Communication-Bound
Memory-Bound
Meeting: Compute-Bound
x Issue Report I/O-Bound
Generate Issue Report

Schedule Meeting

Continue? yes

Issue Specific Analysis


yes

Meeting:
* Code Transfer
* Acceptance
Serial Analysis OpenMP Analysis
MPI Analysis Hybrid Analysis
Acumem Intel Thread Profiler
no Vampir Vampir ...
Oracle Analyzer Oracle Analyzer
Scalasca Oracle Analyzer
Intel Amplifier/PTU Intel Amplifier

Continue?

no Tune/Modify yes
Code
Baseline
xMetrics
Goal Runtime, CPI, LLC-
Stop Achieved? Misses (Bandwith),
End? Mflops
xTools
time, gprof, Oracle
Analyzer, Intel Amplifier,
Generate
Vampir
Modification and
Schedule Meeting Improvement Report
Code modification
recommondation

Fig. 2. Proposed Performance Tuning Workflow


Enhancing Brainware Productivity through a Performance Tuning Workflow 203

in order to use this information to later detect undue instrumentation overhead.


It should in no way provide details about deficiencies or tuning opportunities,
though it should provide a good overview where the program does its work.
This baseline obviously incorporates the runtime, peak memory usage, a sim-
ple runtime-profile, some basic hardware-performance counters (e.g. FLOP/s5 ,
IOP/s6, Memory Access, Cache-Miss-Rate and CPI7 ) and if the application is
parallel, a few scalability samples. If possible, information of solver-convergence
rates and similar information should be preserved in order to spot deviations
from the expected behavior during the tuning and measurement phases.
The recorded runtime can be used to easily measure the progress of the tun-
ing effort as well as the perturbance induced by later performance measure-
ments. The memory consumption itself is typically not a performance sensitive
point, though the available RAM on a machine is typically the limiting factor
for problem set size. The runtime-profile provides information about the gen-
eral program behavior and hot-spots and becomes important during the later
measurement and tuning stages, e.g. for adapting the measurement systems of
many performance analysis tools, like Scalasca[5], Vampir[6] and Tau[7]. The
hardware-performance counters mostly serve as easy-to-monitor progress scale.
The scalability sample (serial, two-processes, many processes) serves the expert
as an indicator for the programs scalability limits.
The tools for the baseline should be a fixed and predefined tool set to facilitate
cross-projects comparisons and prevent unnecessary experimentation during the
tuning project. For example one could define gprof[8] for serial applications, Intel
Amplifier[9] for shared memory and Scalasca with hardware performance coun-
ters through PAPI[10] for message passing applications. In general, alterations
to the tool-sets recommended by the workflow should only be altered in extreme
situations or through a separate evaluation process between tuning activities.

3.3 Pathology

In the Pathology-phase the expert uses the information form the initial base-
line to spot potential problems that the project-code may have. The expert
should only survey the program for general common issues, i.e. if the applica-
tion is bound by Algorithmic Issues,Communication Issues, Computation Issues,
Memory Access Issues and I/O Issues, without spending a lot of time in detail
analysis. Also each of these issues may require different tool-sets for observation
and measurement. Whilst it is in this phase tempting to identify the cause of
a specific problem, we recommend against it, as an early focus on a specific
issue may lead to wrong conclusions. It is, in our experience, much more impor-
tant to get an overall picture. In addition, the expert should gather information
about the hardware-capabilities of the target platform, in order to later correlate
any observed issues. For example, if an application is spending most of its time
5
Floating Point Operations per Second.
6
Integer Operations per Second.
7
Cycles Per Instruction.
204 C. Iwainsky et al.

floating-point intensive routines (compare profile) the peak-floating point rate


of the architecture may become important.
It is also important in this phase to look for general scalability issues that
may derive from the used algorithms in the code potentially leading to a coarse
grained performance model for the application. However, as detailed performance
models have a different focus, we consider such model a separate project or
service. Whilst the performance expert can recognize such algorithmic issues,
the precise identification and solution can usually only be found in conjunction
with the domain expert in an interdisciplinary effort, that we can not depict in
this workflow.
The Pathology-phase should be concluded with a detailed report of problems
on a coarse level, their impact on the program behavior and a recommendation
which single issue to address first in the following tuning step. This report then
should be discussed with the user including risk-benefit assessment based on the
experience with similar codes and issues in the past. Also a specific time frame
should be determined, in which the issue at hand is to be solved. In addition a
process or method to verify the correctness of the code during the tuning process
has to be defined or established jointly by the user and the tuning expert, as
domain experts are very sensitive about numerical deviations of the correctness
of their code. This verification method should later be used by the expert during
the tuning cycle to prevent work that invalidates the correctness and numeric
stability of the program.
If at this point no specific issue is found, the expert must decide if the potential
improvement of the code is worth his effort. For computationally demanding
codes, i.e. many execution instances at a high cost per execution, it may still be
a prudent choice to continue to invest additional time and effort.

3.4 Tuning Cycle


Once a specific issue, like reduction of wait-times at a specific barrier or load
imbalances, has been chosen as the tuning target, the expert has to decide how
to detect the root-cause of the problem. Typically performance analysis tools
are employed. For example in Aachen we use tools like the Intel Amplifier, Or-
acle Analyzer, or community developed tools like Vampir-Trace[6], Scalasca[5]
or PAPI[10]. As these tools all focus on different properties and exhibit differ-
ent measurement methods, the expert initially has gauge a tool’s advantages
and disadvantages based on experience alone. Our workflow provides valuable
information for the selection of the right tool or tool-chain from the beginning,
reducing his time spent on experiments.
We also recommend to establish rules and cook-book like approaches for dif-
ferent, typically occurring issues. Besides providing reminders and guidelines for
specific checks, these cook-books can be used to monitor tuning-quality and train
new experts. In our example workflow, we have indicated this, with no claim to be
complete, by providing four place-holders for Serial Analysis, OpenMP Analysis,
MPI Analysis and Hybrid Analysis along with tool selection for each analysis.
We do not provide details on such cook books, but are of the opinion, that
Enhancing Brainware Productivity through a Performance Tuning Workflow 205

a process standardization would improve average tuning quality, in particular


for less experienced tuning staff. An exemplary cook-book could for example,
based on a dominant wait-time in MPI routines without network-saturation
from the Pathology-phase, recommend an initial MPI-only8 Vampir based anal-
ysis for confirmation followed by a Scalasca based investigation to automatically
identify the root-cause. Of course branches for potential different diagnoses and
observation would have to be defined as well as detailed configuration steps.
The tuning phase itself contains many inherently iterative steps. Many avail-
able performance analysis tools require in the initial step an adaption of the
measurement system to reduce the overhead and program perturbation [11] and
a focus on the specific problem at hand. Is is beneficiary to use information
from the base-line as a "pre-conditioner" for this process. Once a measurement
of sufficient quality has been obtained, the expert has to analyze, investigate and
hypothesize based on the information to determine the cause for an issue and to
develop a solution.
This solution is then implemented or recommended to the user and ideally
verified using the performance tool in question. If the modification does not yet
deliver the agreed-on goal, then the process iterates again until the expert either
solves the problem or runs out of time.
The tuning activity is then concluded by a repetition of the base-line to docu-
ment the improvement, as well as to check for any side effects that may influence
the usability of the change. This may for example be an increased memory foot-
print or different requirements upon the data-set.
The last crucial, but often forgotten step of this phase is the Modification and
Improvement Report. It should contain detailed reasons for the code-change,
its effects and extent on the programs performance in a user-understandable
fashion. It should contain a protocol of the resources used for the tuning phase,
be it work-time, tools or compute resources. We consider such a report to be
very important, as in our experience the information provided influences the
acceptability of the effort to a great extent.

3.5 Results/Handover

The tuning cycle ends in a review meeting to elaborate the modified code in
conjunction with the final Modification and Improvement Report in detail. This
provides the user with an opportunity to raise concerns regarding code-changes
or the results of the last tuning cycle. These concerns and the general acceptance
or reasons for dismissal should be attached to the report. Lastly, the invested
tuning effort should be reflected against the achieved improvement to decide if
further tuning is worthwhile or desirable. If further open issues from the pathol-
ogy exists, a new iteration of the tuning cycle is started with a new specific issue
(see second part of section 3.3) - otherwise the tuning projects concludes at this
point.
8
i.e. using only the wrapped MPI routines.
206 C. Iwainsky et al.

4 Conclusion
In this work we proposed a Tuning Workflow to improve the productivity of
tuning-experts. We argued, that increasing complexity and diversity of paral-
lelism in clusters requires specialized expertise to efficiently use current hard-
ware. As domain experts typically do not have this expertise, it must be pro-
vided, in particular at University installations, where a large diversity of scientific
applications typically is supported by a single HPC installation. Previous work
showed that such a service pays for itself, if top user’s projects are targeted in a
systematic fashion by performance tuning experts. We called this the "brainware"
component of an HPC operation. However, to maximize gains from brainware,
we need to develop standards and processes to govern the performance tuning
process to maximize its efficiency. If performance tuning is only left to "gurus"
there will just not be sufficient staff available for this task.
Our tuning process itself is based on the notion of a service and distinguishes
the roles of users and experts. In reality this role separation is not so clear as
both sides may interact in every phase to assist each other with details when
tuning and adapting the code. Nevertheless, we still see a formal tuning process
description as necessary to ensure the quality of a tuning service: clear goals,
deadlines, avoidance of exaggerated expectations, limitation of wasted effort and
an indication when to stop. A particularly important fact is the documentation
of the tuning effort in a Modification and Improvement Report. Such a workflow
together with the entailed documentation also can provide better argumentation
for funding to the management, as the costs and benefits of tuning become
evident. Furthermore, this workflow can be used to train additional experts and
even integrate non-scientific staff in the tuning process.
The workflow itself is based so far solely on the collective experience of the
HPC-group of the RWTH Aachen University. We recognize the fact, that this
workflow has yet to see a throughout study and that additional input from other
HPC-sites must still be incorporated. At the time of this work, the workflow was
only partially applied to one ongoing tuning effort, using only the Baseline and
Reporting with good acceptance by the users.
In its current form there are some additional conceivable steps. For example
the topic of version control, data management and verification remain unan-
swered. However, we consider the workflow in its current form to be already
quite complex, such that we plan to gain further use-case experience and feed-
back, before revising and adding additional steps.
Whilst we did not cover any performance tools in specific, we would like to
raise the question, to what extent tools could generate external documentation
of performance issues.
The workflow we described as a guide for tuning processes is of course not
meant to be the last word, but rather to serve as a rough guide and incentive
for implementation and improvement of such tuning processes. It is also clear
that depending on local characteristics these processes may need to be modified.
Nonetheless, we believe that defined processes, "cook-books" for specific tasks
and the requirement of modification and improvement reports are important
Enhancing Brainware Productivity through a Performance Tuning Workflow 207

ingredients that should be part of such a structured process. In the long run, we
hope that, similar in spirit to ITIL9 for general IT operations, also for HPC
code development and tuning a structured body of best-practice knowledge
will develop to structure the increasingly complex task of ensuring good HPC
performance.

References
1. Bischof, C., an Mey, D., Iwainsky, C.: Brainware for Green HPC. In: Ludwig, T.
(ed.) Proceedings EnA-HPC 2011 (2011) (to appear)
2. Behr, M., Arora, D., Benedict, N.A., O’Neill, J.J.: Intel compilers on linux clusters.
Intel Developer Services online publication (October 2002)
3. Zeng, P., Sarholz, S., Iwainsky, C., Binninger, B., Peters, N., Herrmann, M.: Sim-
ulation of Primary Breakup for Diesel Spray with Phase Transition. In: Ropo, M.,
Westerholm, J., Dongarra, J. (eds.) PVM/MPI. LNCS, vol. 5759, pp. 313–320.
Springer, Heidelberg (2009)
4. Altenfeld, R., Apel, M., an Mey, D., Böttger, B., Benke, S., Bischof, C.: Parallelising
Computational Microstructure Simulations for Metallic Materials with OpenMP.
In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) IWOMP
2011. LNCS, vol. 6665, pp. 1–11. Springer, Heidelberg (2011)
5. Geimer, M., Wolf, F., Wylie, B.J.N., Ábrahám, E., Becker, D., Mohr, B.: The
Scalasca performance toolset architecture. Concurrency and Computation: Practice
and Experience 22(6), 702–719 (2010)
6. Knüpfer, A., Brunst, H., Doleschal, J., Jurenz, M., Lieber, M., Mickler, H., Müller,
M.S., Nagel, W.E.: The vampir performance analysis tool-set. In: Proceedings of
the 2nd HLRS Parallel Tools Workshop, Stuttgart, Germany (July 2008)
7. Shende, S.S., Malony, A.D.: The tau parallel performance system. The Interna-
tional Journal of High Performance Computing Applications 20, 287–331 (2006)
8. GNU: gprof, http://sourceware.org/binutils/docs/gprof/
9. Intel: Intelparallel
c amplifier (2011)
http://software.intel.com/en-us/articles/intel-parallel-amplifier/
10. London, K., Moore, S., Mucci, P., Seymour, K., Luczak, R.: The papi cross-
platform interface to hardware performance counters. In: Department of Defense
Users Group Conference Proceedings, pp. 18–21 (2001)
11. Iwainsky, C., an Mey, D.: Comparing the Usability of Performance Analysis Tools.
In: Cèsar, E., Alexander, M., Streit, A., Träff, J., Cèrin, C., Knüpfer, A., Kran-
zlmüller, D., Jha, S. (eds.) Euro-Par 2008 Workshops. LNCS, vol. 5415, pp. 315–
325. Springer, Heidelberg (2009)

9
www.itil.org
Workshop on Resiliency in High Performance
Computing (Resilience) in Clusters, Clouds,
and Grids

Stephen L. Scott1,2 and Chokchai (Box) Leangsuksun3


1
Stonecipher/Boeing Distinguished Professor of Computing,
Tennessee Tech University, USA
2
Oak Ridge National Laboratory, USA
3
SWEPCO Endowed Associate Professor of Computer Science,
Louisiana Tech University, USA

Clusters, Clouds, and Grids are three different computational paradigms with
the intent or potential to support High Performance Computing (HPC). Cur-
rently, they consist of hardware, management, and usage models particular to
different computational regimes, e.g., high performance systems designed to sup-
port tightly coupled scientific simulation codes and commercial cloud systems
designed to support software as a service (SAS). However, in order to support
HPC, all must at least utilize large numbers of resources and hence effective HPC
in any of these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future increases in
performance, in excess of those resulting from improvements in single- processor
performance, will be achieved through corresponding increases in system scale, i.e.,
using a significantly larger component count. As the raw computational perfor-
mance of these HPC systems increases from today’s tera- and peta-scale to next-
generation multi peta-scale capability and beyond, their number of computational,
networking, and storage components will grow from the ten-to-one-hundred thou-
sand compute nodes of today’s systems to several hundreds of thousands of com-
pute nodes and more in the foreseeable future. This substantial growth in system
scale, and the resulting component count, poses a challenge for HPC system and
application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-
recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added
another major source of concern. The probability of such errors not only grows
with system size, but also with increasing architectural vulnerability caused by
employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer
technology. Reactive fault tolerance technologies, such as checkpoint/restart, are
unable to handle high failure rates due to associated overheads, while proactive
resiliency technologies, such as migration, simply fail as random soft errors can’t
be predicted. Moreover, soft errors may even remain undetected resulting in
silent data corruption.
The goal of this workshop is to bring together experts in the area of fault
tolerance and resiliency for HPC to present the latest achievements and to discuss
the challenges ahead.
The Malthusian Catastrophe Is Upon Us!
Are the Largest HPC Machines Ever Up?

Patricia Kovatch, Matthew Ezell, and Ryan Braby

National Institute for Computational Sciences,


The University of Tennessee
{pkovatch,rbraby}@utk.edu, ezell@nics.utk.edu

Abstract. Thomas Malthus, an English political economist who lived


from 1766 to 1834, predicted that the earth’s population would be lim-
ited by starvation since population growth increases geometrically and
the food supply only grows linearly. He said, “the power of population is
indefinitely greater than the power in the earth to provide subsistence for
man,” thus defining the Malthusian Catastrophe. There is a parallel be-
tween this prediction and the conventional wisdom regarding super-large
machines: application problem size and machine complexity is growing
geometrically, yet mitigation techniques are only improving linearly.
To examine whether the largest machines are usable, the authors col-
lected and examined component failure rates and Mean Time Between
System Failure data from the world’s largest production machines, in-
cluding Oak Ridge National Laboratory’s Jaguar and the University of
Tennessee’s Kraken. The authors also collected MTBF data for a variety
of Cray XT series machines from around the world, representing over 6
Petaflops of compute power. An analysis of the data is provided as well as
plans for future work. High performance computing’s Malthusian Catas-
trophe hasn’t happened yet, and advances in system resiliency should
keep this problem at bay for many years to come.

Keywords: high performance computing, resiliency, MTBF, failures,


scalability.

1 Introduction
Conventional wisdom dictates that as supercomputers become larger, they also
become more complex with an increased number of parts. Each individual part
might have a long Mean Time Between Failure, but when many parts are com-
bined together, the chance for any one part to fail is great. The failure of specific
parts might cause an entire machine to fail, and even more likely it will cause
one or more running applications to fail. This seemingly makes it impossible
for large-scale applications to run to completion without interruption. This sit-
uation reminded us of the Malthusian Catastrophe: the idea that populations
grow geometrically, but the food supply only grows linearly, with population
size being limited by starvation. We decided to explore the parallels between
supercomputing and food supply.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 211–220, 2012.

c Springer-Verlag Berlin Heidelberg 2012
212 P. Kovatch, M. Ezell, and R. Braby

There are four main reasons why population growth hasn’t been limited by
food supply:
1. economies of scale
2. national and world-wide markets mitigate shortages/problems
3. more efficient technologies, and
4. best practices.
Over time, agricultural resources became more concentrated, with companies
making larger infrastructure investments. In this way, more food could be pro-
duced more efficiently, taking advantage of economies of scale. This has also
happened with supercomputing, with the fewer larger machines becoming more
efficient at delivering cycles. This has been done with more shared physical in-
frastructure, for instance, at Oak Ridge National Laboratory (ORNL), where
three supercomputers (one for the Department of Energy, one for the National
Science Foundation, and one for the National Oceanic and Atmospheric Ad-
ministration) share the same computer room. These centers are more efficient
are delivering power and cooling, and concentrate and take advantage of the
intellectual expertise.
To cope with local shortages, agriculture developed wider markets and better
distribution networks. A drought in one area did not cause local people to starve
because they could get food from another location. This is also true within su-
percomputing, where users have access to different machines–for instance, the
TeraGrid [1] offers multiple supercomputers, and users have allocations at dif-
ferent sites, to help mitigate the situation when one site is down.
In agriculture, more efficient technologies were developed that took advantage
of new equipment and techniques. For instance, farmers started using tractors
instead of horses and plows. In supercomputing, vendors have used one large in-
dustrial strength fan per cabinet, instead of many PC-quality fans. This shared
physical infrastructure for the cabinet reduces the overall total number of com-
ponents, and makes the individual nodes and overall machine more reliable.
And lastly, farmers developed better practices, like canning, and storing food
in case of hard times. Supercomputing has taken similar action: implementing
application checkpointing in case of failure.
The University of Tennessee’s National Institute for Computational Sciences
(NICS) has employed methods to keep their machine available as much as pos-
sible. They have redundant power in the facilities, put the machines through a
rigorous acceptance test to ferret out as many bad parts as possible, perform
maintenance regularly to fix all the down nodes, and run regression tests to verify
the system after planned and unplanned outages. Cray, the vendor for their XT4
and XT5 machines, has incorporated power redundancy, shared cooling with
multiple liquid cooling cabinets, error correcting memory, reduced components
per cabinet (fan, for instance), and support for Berkeley checkpoint/restart. Cray
is also working on an MPI implementation to survive link failures. To see if these
strategies are working well for production supercomputing, this paper examines
if the largest machines “stay up” for reasonable amounts of time–long enough
for full machine jobs to be run routinely.
The Malthusian Catastrophe Is Upon Us! 213

This paper explains some of the difficulties in defining and collecting resiliency
statistics, and presents initial findings. Failure rates for individual components
are examined over the lifetime of a machine, and conclusions are made based
on the data. Failure rate and mean time between failure data from several large
systems with similar architectures are examined and patterns and trends are
extracted from the data. Is the HPC industry on the cusp of the intersection of
system complexity and job length (that is, the inability to get through a single
day without an interrupt)?

2 Defining a Resilient System

Different people have different definitions of the qualities that make a system
resilient. Do you examine resiliency at the full system level, cabinet level, node
level, or processor level? Do you have to examine all of the above? Modern high
performance computing systems are often complex and hierarchal in nature,
meaning that failure of a single component may or may not affect the availability
of additional components.
Although it would be 
nice to have HPC sys-
tems that are 100% reli-
able, “unbreakable” sys- 
tems are not practical.
System design is often
based on a number of
factors, balanced accord-
ing to the the “project
triangle” shown in
Figure 1. The wisdom
of the project triangle  

states that you can build
Fig. 1. The Project Triangle [2]
something fast (rapid en-
gineering design), you can build something good (high quality), and you can
build something cheap (low cost), but you only get to pick two. In HPC, a
goal is often to place well on the Top500 [4] list. Every dollar spent on high-
reliability parts takes away from the system’s peak performance. Designing a
high-performance reliable system becomes a complex balancing act.
In [3], Stearley provides definitions and equations for reliability, availabil-
ity, serviceability, and utilization based largely on the semiconductor indus-
try’s SEMI-E10 specification. Of particular interest, mean time between failure
(MTBF) for an entire system or node is defined as:

production time
M T BFSystem = (1)
number of system failures
production time
M T BFN ode = (2)
number of node failures
214 P. Kovatch, M. Ezell, and R. Braby

Stearley notes that reliability is often defined as:

R(t) = e−λt (3)

where
1
λ= (4)
M T BF
is a constant failure rate, but Stearley and others [6] suggest a time-varying
failure rate may be more appropriate.

3 Difficulties in Obtaining and Comparing Data


It seems that most supercomputing centers develop their own policies and meth-
ods for collecting, analyzing, and reporting resiliency data. The equations in
Section 2 from [3] provide high-level direction for reporting the data, but there
is no advice for a common data collection format. The USENIX Computer Fail-
ure Data Repository [5] contains a wealth of failure data, but no common data
format exists between the entries.
It is common to see a difference in vendor-specified failure in time (FIT)
rate compared to those observed in a large-scale HPC system [7]. These large
computer systems provide a unique source of reliability data that a vendor does
not get to see with the rest of their business. An integrated circuit vendor may
produce 10 million units of a certain part that gets shipped to 100 customers who
then integrate it into a product that gets sold to thousands of their customers.
This quickly becomes a fragmented field data base where data can be lost or left
unreported. HPC vendors may purchase 250,000 of that part and place them in
a dozen systems that are closely tracked for failure types and trends.
Another factor that makes comparing field data to vendor data difficult is
the ever-present percentage of “no trouble found” failure cases. HPC centers
will count these as failures, but the vendor will likely ignore them. There are a
number of possible reasons for this:
1. Site staff misdiagnosed the failure.
2. Vendor component retest is unable to duplicate the specific operating con-
ditions.
3. Failure is of an intermittent nature and does not fail during retest.
4. The component was proactively replaced because of indications a failure was
imminent.

4 Reliability of Kraken at the National Institute


for Computational Sciences
4.1 Cray XT Architecture Description
Cray and Sandia National Laboratories designed a distributed memory super-
computer codenamed Red Storm that utilized commodity AMD Opteron pro-
cessors and a proprietary SeaStar network. In 2004, this technology was turned
The Malthusian Catastrophe Is Upon Us! 215

into a commercial offering called the Cray XT3. Each Cray XT cabinet contains
24 blades, and the machine contains a mixture of compute and service blades.
Each compute blade holds 4 compute nodes, while each SIO blade has two ser-
vice nodes. SeaStar network chips (one for each node) live on a mezzanine card
and provide access to the three dimensional torus network.

4.2 NICS Kraken System Description


The National Institute for Computational Sciences (NICS) was granted a $65M
award from the National Science Foundation in September 2007, and began to
acquire, install, and place in production a series of Cray HPC systems. Each of
these systems has been called Kraken after the mythical sea monster of Norse
legend. The first dedicated NICS supercomputer was a Cray XT3, comprised
of 40 cabinets and with a peak computational rate of 38 teraflops using AMD
dual-core processors that went into production in June 2008. This system was
rapidly upgraded over the following month to a Cray XT4 with the addition
of 8 cabinets using the new AMD Budapest 2.3 GHz quad-core processors, and
the replacement of the existing dual-core processors by the new quad-core chips,
taking the peak performance to 166 teraflops from 18,048 cores.
The next incarnation of Kraken was a completely new Cray XT5 comprised of
88 cabinets and 608 Teraflops peak in February 2009. The “Barcelona” processors
at 2.3 GHz were used in the XT5, in preparation for an upgrade to one petaflop
of “Istanbul” chips. This new machine was actually installed in the ground-
floor machine room at ORNL, while the XT4 remained as a completely separate
system in the upper machine room. Both the old and new Kraken systems ran
as parallel production systems for two months, through February and March of
2009, while users migrated to the newer machine. In September of 2009 Kraken
was upgraded to 2.6 GHz six-core Istanbul processors, and entered production
ahead of schedule on Oct 5, 2009. In February of 2011, an additional 12 cabinets
were added to Kraken, for a total of 1.17 Petaflops of peak computing power
and 147 Terabytes of memory spread through 9408 compute nodes. Kraken has
a 3.3 PB parallel file system directly connected with 30 GB/sec bandwidth.
Kraken has a varied workload that supports scientists from numerous fields of
study. NICS has optimized scheduling policies to efficiently handle both capac-
ity and capability computing [8] while maintaining very high utilization numbers
(see Figure 2). Utilization that is nearly uniform over time is helpful when an-
alyzing resiliency data over time. Kraken has achieved over 95% uptime during
its entire lifetime.

4.3 Data Collection


At NICS, system outage data is manually entered into a database after ev-
ery downtime event. The database supports high level categories as well as
component-level failure reasons.
NICS uses the Simple Event Correlator with Cray-specific rules [9] to monitor
the hardware event logs for failures. Failures and anomalies are logged to a
216 P. Kovatch, M. Ezell, and R. Braby







     
















10

10

10

11

1
-1

-1

r-1

-1
l-1

-1

-1

-1

n-
n-

g-

p-

ar

ay
ct

ov

ec
Ju

Ap
Ja

Fe
Ju

Au

Se

M
N

D
Fig. 2. Kraken Utilization

central file, and events that require manual intervention from the Cray hardware
engineers are automatically entered into Cray’s case tracking system. Additional
records are added as necessary for any maintenance performed on the machine.
Cray uses this tool to record every hardware failure at every site with Cray
machines. They keep an extensive database of each failure, and are thus able to
calculate a variety of metrics.

4.4 Kraken Node Failure Data


Figure 3 shows Kraken’s node failure data over time. Events tagged as memory
errors correspond to bank 4 machine check exceptions. Opteron errors correspond
to all other machine check exceptions. While it is known that not all bank
4 machine check exceptions are due to DIMM failures, they do relate to the
memory subsystem and determining which component failed is not always easy.
Therefore, this method was chosen for the first pass to review the data. Power
events are due to voltage faults in various components. SeaStar errors, in this
context, were errors that caused nodes to fail. Due to the nature of the XT’s high
speed network, it is highly likely that a single failing SeaStar or mezzanine caused
many nodes to fail (in fact, it may cause the whole system to fail). Uncategorized
events correspond to ambiguous events where no root cause was identified or the
failure type is so rare that a new category was not created.
The graph shows expected “infant mortality” with the memory at the be-
ginning of the machine’s lifetime, but there is an interesting long term drop in
memory errors starting in September and October of 2009. Reviewing the data,
it became clear that this coincides with the September 2009 processor upgrade
of Kraken to AMD hex-core Istanbul. These Istanbul processors have more so-
phisticated “chipkill” ECC. This explains the dramatic decrease in uncorrectable
memory errors. There is also a spike in memory failures in July 2009, which was
contributed to by a cooling failure and emergency power off (EPO). Early anal-
ysis of failure data for JaguarPF (which shares the machine room with Kraken,
and had the same processor upgrade performed) shows a similar spike in July
and a similar drop in failures in late 2009.
After the initial burn-in period and processor upgrade, Kraken’s data fairly
consistently shows more Opteron failures than memory failures. Although CPUs
The Malthusian Catastrophe Is Upon Us! 217



" 


! 





    






 












"
"
"
"
"
"
" 
"!
""
"
"
"






 
!
"








Fig. 3. Kraken XT5 Node Failure Causes by Month

are a more complicated part, the authors expected to see more failures due to
memory than attributed to CPUs, as was observed in most systems in [10]. One
possible explanation is more recent ECC memory technologies (Chipkill, SDDC)
and improved memory controllers may have significantly reduced the memory
failure rates. Upon analyzing the Opteron failure data, it was discovered that
most of the errors were attributed to ECC errors in the on-processor cache hier-
archy. Investigating this with Cray and AMD, it was discovered that a recently
released bios update should significantly reduce the number of these failures.
Similar to the improvements seen in DRAM over recent years, it is expected
that processor reliability will also improve. Improved error correction algorithms
are being developed and will be built into the processors at various levels, in-
cluding the cache hierarchies. We expect that the processor failure rates will see
drops similar to those observed with the memory. This should more than make
up for the potential increase in failures as processor cache sizes grow.

5 Comparison of Large Cray Systems Resiliency


5.1 Data Collection
We asked several sites with Cray hardware to provide failure statistics about their
machines. To simplify the data collection and avoid categorization ambiguity, the
authors asked for a month-by-month count of unscheduled full-system outages
over the lifespan on the machine, as shown in Figure 4 for Kraken.
This data does not include scheduled maintenance periods or “environmental”
events that caused outages. With this data, an average monthly failure rate can be
218 P. Kovatch, M. Ezell, and R. Braby

 
 
  
 









 


 
 
 







 


 
 
 







 


Fig. 4. Kraken Unscheduled Reboots per Month

computed by simply averaging all the data points provided. Assuming 30 days per
month, the mean time between failure can be computed by Equation (5).
30 × months of data
M T BFdays =  (5)
all failures

5.2 The Machines


The collaborating sites were very helpful, and the authors were more successful
than expected in receiving responses. In total, data exists from ten systems span-
ning several generations of Cray hardware. The results are shown in Table 1.

Table 1. Full System Reboot Data

Machine Site # Cabinets TFLOPS Reboots/Month MTBF (days)


JaguarPF XT5 ORNL 200 2331 11.4 2.6
Frankin XT4 NERSC 102 356 5.2 5.7
Kraken XT5 NICS 88 1029 4.5 6.6
Jaguar XT4 ORNL 84 260 6.8 4.4
Hopper XE6 NERSC 68 1289 3.8 8.0
Athena XT4 NICS 48 165 2.8 10.7
Kraken XT4 NICS 48 165 3.0 10.0
Raptor XE6 AFRL 30 410 3.0 10.0
Hexagon XT4 NOTUR 15 51 0.6 52.1
Gaea XT6 NCRC 14 260 2.6 11.7

Figure 5 shows the relationship between system size and failure rate. As one
might expect, increasing the size and complexity of a machine linearly increases
the likelihood of failure. The data also suggests that failure rate is more highly
correlated to number of components than peak performance rating; as individual
components get more powerful, their failure rate does not significantly increase.
The Malthusian Catastrophe Is Upon Us! 219


    
 
   
 
  



"
   

!
  
  
  !

 !
   !   
  


  

     
     
 

Fig. 5. Cray System Failure Rates

6 Future Work
There are several areas that warrant further investigation. The authors have ob-
tained the component-level “Failures per Month” data for Jaguar, and it appears
to match the rough shape of the Kraken data. Comparing it more closely to the
Kraken data would be interesting. A study comparing the individual compo-
nent (CPU, memory) MTBF numbers with what we’ve seen in the field at scale
would be interesting, but it requires obtaining data from vendors. Unfortunately,
vendors do not like to share this. Future work will hopefully include both node
and system failure data between Cray and other machine types, including IBM
BlueGenes, IBM Power series and conventional clusters. This would show if a
particular design or machine is more reliable, at scale, than others. Future work
might examine in more depth why some machines of the same kind, that are
similar in size exhibit almost a 30% difference in failure rates (see Figure 5 for
Jaguar XT4 and Franklin XT4 failure rates).

7 Conclusions
In this paper, the authors have shared node and system-wide failure data from
the largest systems in the world and shown that although the number of com-
ponents in the systems is quite large, applications can regularly run full ma-
chine jobs. Vendors and centers have developed techniques, technologies and
approaches to help mitigate the geometrically growing number of parts in the
largest machines.
220 P. Kovatch, M. Ezell, and R. Braby

Acknowledgements. The authors would like to thank XTreme, the Cray sys-
tem administrators user group who facilitated the sharing of site specific machine
uptime data, and Jim Craw, its President during the time of data collection.
Thanks go out to the representatives from each site along with their manage-
ment: Tina Butler and Nick Cardo (NERSC), Joni Viranen (CSC), Hank Kuehn,
Don Maxwell and Buddy Bland (ORNL), Steve Andrews (HECToR/STFC
Daresbury), Lloyd Slonaker (AFRL/RCMT), and Alexander Oltu (Uni). The
authors would also like to thank Steve Johnson and Pete Ungaro from Cray for
their data, support and assistance. Several folks at ORNL helped analyze and ex-
plain the data, including Stephen L. Scott and Robert Harrison, and Rick Mohr
and Troy Baer from The University of Tennessee. Lastly, the authors would like
to thank Phil Andrews for recognizing and suggesting the comparison with the
ideas of Thomas Malthus.

References
1. TeraGrid, http://www.teragrid.org/
2. Piazzalunga, D.: Project Triangle. Figure in public domain, downloaded from,
http://en.wikipedia.org/wiki/File:Project_Triangle.svg
3. Stearley, J.: Defining and Measuring Supercomputer Reliability, Availability, and
Serviceability (RAS). In: 6th LCI Conference on Linux Clusters (April 2005)
4. Top500 Supercomputer Sites, http://top500.org/
5. The Computer Failure Data Repository, http://cfdr.usenix.org/
6. Gottumukkala, N., Nassar, R., Paun, M., Leangsuksun, C., Scott, S.: Reliability
of a System of k Nodes for High Performance Computing Applications. IEEE
Transactions on Reliability 59(1), 162–169 (2010)
7. Johnson, S.: Cray Inc. Personal Communication
8. Andrews, P., Kovatch, P., Hazlewood, V., Baer, T.: Scheduling a 100,000 core
Supercomputer for Maximum Utilization and Capability. In: 39th International
Conference on Parallel Processing Workshops (2010)
9. Becklehimer, J., Willis, C., Lothian, J., Maxwell, D., Vasil, D.: Real Time Health
Monitoring of the Cray XT3/XT4 Using the Simple Event Correlator (SEC). Cray
Users Group (2007)
10. Schroeder, B., Gibson, G.: A Large-Scale Study of Failures in High-Performance
Computing Systems
Simulating Application Resilience at Exascale

Rolf Riesen1 , Kurt B. Ferreira2, , Maria Ruiz Varela3 ,


Michela Taufer3 , and Arun Rodrigues2,∗
1
IBM Research
rolf.riesen@ie.ibm.com
2
Sandia National Laboratories
{kbferre,afrodri}@sandia.gov
3
University of Delaware
mruiz@udel.edu,taufer@cis.udel.edu

Abstract. The reliability mechanisms for future exascale systems will


be a key aspect of their scalability and performance. With the expected
jump in hardware component counts, faults will become increasingly
common compared to today’s systems. Under these circumstances, the
costs of current and emergent resilience methods need to be reevaluated.
This includes the cost of recovery, which is often ignored in current work,
and the impact of hardware features such as heterogeneous computing
elements and non-volatile memory devices. We describe a simulation and
modeling framework that enables the measurement of various resilience
algorithms with varying application characteristics. For this framework
we outline the simulator’s requirements, its application communication
pattern generators, and a few of the key hardware component models.

1 Introduction
Parallel scientific applications frequently use coordinated checkpoint and restart
(CCR) to recover from system failures. Failures can be anything from loss of
power, human error, hardware component faults, to software bugs. For an ap-
plication using CCR, all of these failures force it to abort and, at a later time,
to restart from a previous checkpoint. Several studies have shown that this will
not scale much beyond the machines currently in existence [4,8,3,11].
For exascale systems, even if per-component reliability remains the same, the
sheer number of components will lead to frequent faults. Therefore, alternative
methods are needed to enable computational progress of large-scale applications.
Many alternative resilience algorithms have been proposed to replace CCR,
but few have been evaluated thoroughly at large scale, with differently behaving
applications, strong scrutiny of their cost – especially for recovery – and the
impact on application throughput. Recovery is often assumed to be infrequent
and neglected in performance studies. In exascale systems we expect failures

Sandia National Laboratories is a multi-program laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 221–230, 2012.

c Springer-Verlag Berlin Heidelberg 2012
222 R. Riesen et al.

to be common and that cascading failures during recovery might change the
performance characteristics of resilience algorithms substantially.
Another aspect that is sometimes overlooked is that a given resilience algo-
rithm may not be suitable for all types of applications. For example, CCR works
well for self-synchronizing applications, since they already bear the synchroniza-
tion cost necessary to achieve coordination. Other applications do better without
introducing additional synchronization steps.
While exascale systems will not be radically different from today’s supercom-
puters, there are features such as massive multicore CPUs, Solid State Disks
(SSD), and non-volatile random access memory (NVRAM) that have impact on
the performance of resilience algorithms. Application characteristics may also
become different when they adapt to the larger scale and new programming
models. Yet, self-synchronizing legacy applications need to be supported as well.
To evaluate proposed and existing resilience algorithms at scale, simulation
and modeling is needed. In this paper we analyze the requirements for an evalu-
ation framework that lets us measure the performance and overhead of various
resilience algorithms with different application characteristics.
This perspective paper is meant to explore future exascale systems in terms of
modeling and other relevant aspects that have to be considered when studying
application recovery after failures. A goal is to generate a discussion that will
help define the taxonomy of future exascale systems and the tools that will
enable us to study them even before they become available.
We list the requirements we have identified in Section 2, describe our design
in Section 3 and 4, and report on the status of our implementation in Section 5.

2 Requirements
The compromises and restrictions we will have to put into our simulation will
prevent us from being able to make absolute and precise performance predictions.
However, the goal is to make relative performance comparisons among resilience
algorithms under various conditions. For that we need a somewhat accurate
model of data movement within the system, but not the data itself nor the
computations necessary to generate that data.
Before we can design an experiment, we need to get an idea of what a future
exascale system might look like [2,3]. Since we cannot simulate a complete system
at scale in full fidelity, we then need to identify the aspects of a system that have
a measurable impact on the performance of resilience algorithms.

2.1 Future Systems


Exascale machines are predicted to appear before the end of this decade. That
is too far out to make accurate predications on what such a system will look
like, but not so far that we cannot make some educated guesses. It will not be
a quantum computer and most aspects of the system will be familiar to today’s
users of supercomputers.
Simulating Application Resilience at Exascale 223

We have about five more iterations of Moore’s law ahead of us and can expect
to see about 512 to 1,024 cores per socket in such a system. If the current trend
continues, each core will have relatively weak performance to help with power
consumption and enable the placement of that many cores onto a single die.
The current number one system on the top500 list employs 548,352 cores to
achieve 8 petaflops. The number of cores per CPU, as well as their total number,
will increase to reach an exaflop. These cores will be connected with each other
through a Network on Chip (NoC). Most likely there will be a complex hierarchy
of caches where some cores in the same “neighborhood” share lower-level caches
such as an L2, and groups of cores share L3 caches, and the memories shared by
these cores may not be shared coherently.
The memory hierarchy will be further complicated by some or all of main
memory becoming non-volatile (NVRAM). SSD with faster access times than
spinning media will also be prevalent. Some of that storage will be local, in the
same rack for example, while more of it will be farther away in a dedicated storage
server. Compute accelerators, such as Graphical Processing Units (GPUs), on
the same motherboard or integrated into CPUs will most likely also play a role
in achieving exascale performance by providing additional compute cycles and
processing stream-oriented application kernels.

2.2 Simulation and Modeling

With the above assumptions, it is not possible to simulate such a system with
high fidelity. There are simply too many components and not enough technolog-
ical certainty for a fully detailed simulation in a reasonable amount of time. In
order to make evaluation of different resiliency algorithms possible, we have to
make some compromises. We can leave out some less important aspects and still
arrive at results that are valid when comparing two different resiliency algorithms
for a given type of application.
The first thing we will abandon is an application’s computation. Obviously,
this will save a lot of simulation time by allowing us to dispense with a detailed
processor model or emulation framework. Furthermore, resilience algorithms are
dependent only on two aspects of the computation itself: How much data it
touches and changes over time, and the duration of compute phases between
data exchanges with other cores and nodes; i.e., externally visible state changes.
While we can dispense with computation, we cannot be quite so cavalier with
communication. Cores on a single die will communicate with each other over the
NoC and, nodes will communicate with each other over a system-wide network.
The exact form of communication is less important. Some of that data will be
transferred using MPI, while other data will be written directly into memory.
Because these are externally visible state changing events, resilience algorithms
depend on the timing of these transfers and the amount of data being moved.
Performance, frequency, and location of saving and restoring state depends on
data traffic. However, the actual content of these messages does not matter.
Because moving data is a large overhead and influences resilience algorithm
performance, a fairly accurate simulation of data flowing through a system is
224 R. Riesen et al.

necessary. The simulation needs enough resolution to detect congestion and mea-
sure its impact. The same applies to I/O. State needs to be saved into remote
memory, NVRAM, and SSD devices. While access times to individual memory
banks are too fine grained to track in a simulation of this scale, access competi-
tion to these devices and transfer times do need to be tracked.
Finally, but not least, the simulation needs to provide a method to inject
faults into the system. A form of notifying a resilience algorithm that a node,
socket, core, or link has failed, with the corresponding data loss, is necessary.
But the exact type of failure notification is not that important.

3 Design of Our Simulation Infrastructure


For our exascale fault resilience simulation we have to create several components:
A router that lets us configure a system wide network as well as the NoC for
each socket, a storage device we can use to simulate data flow into and out of
NVRAMs and SSDs, and an end-point component that generates the data traffic
we need. Because of the complexity of this simulation, we also need an automatic
way to generate the large configuration files.

3.1 The Structural Simulation Toolkit (SST)


We use SST, a parallel discrete event simulator developed at Sandia National
Laboratories [12]. SST is a C++ framework to integrate various simulators and
models and connect them via an event network. An XML file is used to configure
SST at startup time. That file specifies what components are to be used and how
these components are connected. Events travel along the links specified in the
configuration file.

3.2 Router Model


Network routers in supercomputers are very simple and small, when compared
to Ethernet routers in a data center. Supercomputer routers usually have few
ports, five or seven for example to create 2-D or 3-D meshes and tori. Often they
use source-based routing where the message itself contains information about
which output port to use. Commonly, they are wormhole routed, and they are
employed by the thousands to create the main network infrastructure of systems
like Cray’s XT series and IBM’s Bluegene machines.
We created an SST component that models such a router at a behavioral level.
A model is faster to compute and easier to write than a gate-level simulation.
And, since we are feeding an approximate data stream into the network, a more
accurate simulation would not provide us with more reliable results.
Figure 3.2 shows the concept of our router model. The model supports an
arbitrary number of ports. When we use it as a component to create the main
network of our exascale simulation, we configure it (in the SST configuration
file) to have five ports: Four to create the torus topology of the main network,
and a fifth port to connect to a compute node. Incoming traffic is handled in
Simulating Application Resilience at Exascale 225

FIFO order, to preserve message ordering. Messages arrive in the form of events.
The events themselves could contain message data, but since we are not gener-
ating that data, the events only contain the number of bytes the message would
contain. That message length, a configurable router latency, and bandwidth are
used to calculate how long an output port will be occupied. For that duration,
further messages destined for that output port, are queued.
Omitting modeling of flow control be-
tween routers reduces synchronization
overhead. However, since incoming mes-
sages are queued when an output port
is busy, a message traveling through two
or more routers cannot move faster than
the bandwidth and latency limitations –
as well as other traffic in the network –
allow. Messages can be delayed at the
input or output port. A quick stream of
short messages on an input port can be
held up by a larger message using the
same output port. This mechanism gives
the router model a crude approximation
of flow control. If a message is delayed
due to a busy port, congestion statistics Fig. 1. The router model
in the delayed event are updated.
The router model accepts, via the SST configuration file, several parameters:
A hop delay specifies the minimum amount of time a message (event) is delayed
when passing through a router1 . The bandwidth parameter, together with the
incoming message length, dictates how long a message occupies a port. The
number of ports is also a configurable parameter. Two more parameters are used
for power and thermal modeling. One is the hypothetical frequency this router
runs at, and another dictates which power model to use; SST supports several [5].
The router model can use the amount of traffic and the above parameters to
compute power dissipation.
Note that links configured between components, for example between two
routers in the SST configuration file, also have a delay assigned to them. SST
uses that when partitioning the graph of components between processes in a
parallel simulation and to compute event lookahead.
We use the same router model component described so far to also create the
NoC within each socket of our simulation. To keep things simple, we assume the
NoC is also a torus. However, instead of using five-port routers, we allow for addi-
tional ports to connect more than one CPU core to each router in a NoC. A bit in
events traveling between cores attached to the same router indicates local traffic
that moves at higher bandwidth than off-CPU traffic. We assume that these cores
will be communicating through a shared cache instead of making use of the NoC.

1
The actual delay may be much larger if there is congestion on the input or output
ports.
226 R. Riesen et al.

Additionally, when used for a NoC, the router model does not use wormhole
routing. This is a more realistic mode of operation when the NoC also connects
to random access memory, where multiple streams of data can be overlapping
and be destined for the same device.
Figure 2 is a diagram that shows one possible configuration of a node in our
exascale simulation. The router model is used to build the NoC as well as the
main network that connects the nodes in a system. Each node consists of multiple
SST components described in this section.
The router model is
also used as an aggrega-
tor to coordinate data traf-
fic to a single resource.
In that configuration, one
port connects to the re-
source, and all remaining
ports connect to users of
the resource. We use ag-
gregators to control access
to the main network from
each node. Each core can
access the main network,
but has to compete with Fig. 2. Combining components for a node
any other core on that
node for that resource. This is akin to multiple cores and CPUs on a node shar-
ing a single NIC. We also use aggregators to gate access to on-board NVRAM,
which is a shared resource for the cores on that board. Each core also has ac-
cess to a storage network to access a nearby SSD. We assume that access to a
parallel file system will happen through the main network, as it does on most
of today’s machines. But, we also envision that each rack has some SSD devices
for scratch storage and that nodes in the same rack have access to that storage
via a separate, but local, storage network.

3.3 Storage Component


A requirement to evaluate resilience algorithms is to simulate access to storage.
Two types of storage will be important for these algorithms. In each node we
will have some amount of NVRAM that is accessible to the cores on that node.
Figure 2 shows that we use an aggregator to coordinate access to each NVRAM.
The NVRAM itself is a very simple SST component. It accepts data write re-
quests and queues them. Based on a write speed parameter in the configuration
file, this queue is processed and write acknowledgements are sent back to the
requester when an item has been removed from the queue. There is a similar
queue for read requests and read data events are sent back to requesters based
on the read speed of the device and the number and size of pending requests.
Simulating Application Resilience at Exascale 227

For resilience methods that access remote memory, data has to be transferred
using the main network first. Then, one of the cores on a node with direct access
to the local NVRAM has to handle these remote requests.
The second type of storage we provide in our simulation is a “nearby” SSD.
We assume that each rack has some amount of SSD storage that can be used for
temporary data. A rack-wide, local network provides access to that storage. We
use aggregators to build a two-level tree storage network for each rack.
We assume that rack SSD storage has more capacity and is more reliable
than the individual local NVRAMs. A rack will have multiple, redundant power
supplies, and the SSDs will be RAID devices. The NVRAM on a node is quicker
to access and has less contention. But, it also has a smaller capacity and may
become inaccessible if a node, or the network connection to a node, fails.
At this time we have no plans to simulate a remote parallel file server. We
assume that data from the rack SSD can be trickled off to such a server in the
background, if desired. Other research teams are working on full disk simulation
components for SST, including an SSD device, that we will be able to integrate
at a later time, if necessary.

3.4 Communication Pattern Generators and Applications


We have, in Section 2, mentioned that we cannot afford the cost of running or
simulating the application processes that are the endpoints of our network in-
frastructure. Yet, we do need these endpoints; i.e., the processes running on each
core, to transmit and receive data. For our evaluation of resilience algorithms, we
need the approximate timing of these transmissions and their causal ordering.
In other words, we need what we call communication pattern generators.
What we mean by a communication pattern is illustrated in Figure 3. It
shows an example from the NAS parallel benchmark suite. For each rank in the
computation, the diagram shows (approximately) how many messages each rank
sends to all other ranks. Similar plots can be generated for the amount of data
sent between ranks [10].
Generating such patterns for a variety of applications and benchmarks is
not difficult. For example, a five-stencil computation using ghost (halo) cells
to exchange data with neighboring ranks has the following loop structure: Send
data to four neighbors, wait for data from these neighbors, perform computation,
repeat. Once in a while a collective operation, such as an allreduce, is inserted
to determine whether convergence of the result has been achieved.
Many similar, fundamental patterns exist that are employed by applications
today. Our simulator is capable of producing all of them, as long there is no data
dependency; i.e., as long as the communication pattern does not change based
on the content of these messages. Currently we have communication patterns
for a five-stencil ghost cell exchange, and two micro benchmarks: A ping-pong
pattern that measures latency and bandwidth, and a message rate benchmark.
We also have a state machine that implements a barrier operation. The only
resilience algorithm implemented so far is CCR.
228 R. Riesen et al.

Additional communication patterns


we are going to implement include the 700
60
behavior of the NAS parallel benchmark 56
52
600
48
programs FT and IS, as well as some 44 500

Source Node

# messages
40
36 400
patterns that originate from various par- 32
28 300
allel graph algorithms; e.g., [9]. 24
20 200
16
12
8 100
4
0 0
3.5 Implementing Pattern 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Destination Node

Generators
SST is an event driven parallel simula- Fig. 3. Communication patterns for
tor. Each component that is integrated NAS MG, class C, 64 nodes [10]
into the SST framework needs to pro-
cess events it receives and then relin-
quish control back to SST so that the overall simulation can proceed. A natural
way of expressing and implementing the communication patterns we need to
drive our simulations, are state machines. Event processing is an integral part
of state machines. Therefore, that choice is simple.
What makes this choice a little bit difficult is the state explosion when we
combine a communication pattern generator, a resilience algorithm, collective
operations, and the handling of asynchronous I/O events and faults. What
starts out with a handful of states to express a nearest neighbor data ex-
change becomes much more complicated when some events from a collective
operation, that has already started on another rank, arrive early. Even more
states are needed to process the requirements of the resilience algorithm under
evaluation. The algorithm will generate I/O and completion events will arrive
asynchronously. A state machine describing all these possibilities will grow very
complex quickly.
In addition, we want to use the same communication pattern with different
resilience algorithms and, perhaps, different implementations of collective oper-
ations. The solution we have chosen is that of a gate keeper. It is a C++ SST
component from which all communication patterns inherit. Each instantiated
communication pattern component specifies what other services it needs; e.g.
which collective operations it will perform. The resilience algorithm is chosen
through the configuration file. All of these individual, relatively simple state
machines, register an event handler with the gate keeper.
In some respect, the individual state machines are all subroutines of the com-
munication patter generator. At any given time, only one of those state machines
is active. When a new event arrives at the gate keeper, it determines which state
machine needs to receive that event. If that state machine is currently running,
then the event is delivered right away. For currently inactive state machines,
(early) events are queued. The gate keeper component provides functions for
state machines to call each other and to return to a previous caller. Whenever a
state machine change occurs, pending events for the newly active state machine
are delivered by the gate keeper.
Simulating Application Resilience at Exascale 229

4 Tying It All Together

In the XML configuration file for SST, every component to be used, every link
between any two components, and the parameters for each component need to
be specified. The file structure allows for a common shared parameter, among a
set of components, to be specified only once. Nevertheless, these files, for a large
simulation, are too big to be created manually. Therefore, we wrote a separate
program to create configuration files specific to the subset of SST components
used for our experiments.
The configuration generator takes command line arguments, for main network
bandwidth for example, and inserts it into the appropriate places in the XML
file. The choice of which communication pattern to use is also a command line
option, while several other things are hard coded into the generator.
For example, the generator takes as command line parameters the X and Y
dimension of the main network and a separate pair of parameters for the NoC on
each node. But, it is currently hard coded to generate tori. This makes several
things a lot simpler, including source route generation, and should suffice for our
initial experiments.
The simulation design described in this paper has several limitations. Some
stem from the core design. These include the inability to create communication
patterns that are data dependent, computation delays between communications
that vary with the results of computation, and the inaccuracies introduced by
using models instead of more fine-grained simulation.
Other limitations are caused by our implementation choices and the configu-
ration generator. These include things like the fixed topologies and the architec-
ture of the I/O and memory subsystem. These are more easily corrected than
our design choices by adapting our code to the new requirements.

5 Work in Progress

Work in progress includes the study of simple resilience algorithms for exascale
systems, beyond the 256k cores we have already tested, with the support of the
framework proposed in this paper. Future work includes integrating a larger set
of resilient algorithms and a broader range of applications with their communi-
cation patterns in the framework.
For resilience algorithms and methods we plan to look at uncoordinated check-
point restart with message logging, log-based rollback-recovery mechanisms [6],
the RAID-like approach taken by SCR [7], and communication induced check-
pointing [1].
Validation of a complex simulation tool like ours is of course made extremely
difficult by the lack of existing exascale systems. Nevertheless, we plan to use
micro-benchmarks to calibrate various parameters and models built into our sim-
ulation by comparing them against existing systems. Then we will run bench-
marks and applications that our communication patterns are meant to mimic
on existing, large-scale systems, and compare the results with our simulations.
230 R. Riesen et al.

We will be able to do this using individual multicore CPUs, large clusters, and
clusters containing multicore CPUs. Viewing and comparing the actual results
with our simulations from these different angels will provide us with an indication
of the validity of our approach. Scaling experiments within the range of systems
available to us will further assist with validation and provide us with error bars
for simulations at exascale.
We are building the simulation infrastructure to evaluate resilience algorithms.
However, the same infrastructure will be suitable for evaluation of many different
aspects of exascale computing. We have started to investigate projects in the
area of programming models and application performance on a heterogeneous
network where not all components are (virtually) fully connected.
SST, including the components described in this paper, is open source and
freely available.

References
1. Alvisi, L., Elnozahy, E.N., Rao, S., Husain, S.A., Mel, A.D.: An analysis of com-
munication induced checkpointing. In: FTCS (1999)
2. Bergman, K., et al.: Exascale computing study: Technology challenges in achieving
exascale systems (2008)
3. Bianchini, R., et al.: System resiliency at extreme scale (2009)
4. Elnozahy, E., Plank, J.: Checkpointing for peta-scale systems: a look into the fu-
ture of practical rollback-recovery. IEEE Transactions on Dependable and Secure
Computing 1(2) (2004)
5. Hsieh, M., Thompson, K., Song, W., Rodrigues, A., Riesen, R.: A framework
for architecture-level power, area and thermal simulation and its application to
network-on-chip design exploration. In: 1st Intl. Workshop on Performance Mod-
eling, Benchmarking and Simulation of High Performance Computing Systems,
PMBS 2010 (November 2010)
6. Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-
recovery for cluster systems. Concurrency and Computation: Practice and Experi-
ence (April 2009)
7. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and
evaluation of a scalable multi-level checkpointing system. In: SC (2010)
8. Oldfield, R.A., Arunagiri, S., Teller, P.J., Seelam, S., Varela, M.R., Riesen, R.,
Roth, P.C.: Modeling the impact of checkpoints on next-generation systems. In:
24th IEEE Conference on Mass Storage Systems and Technologies (September
2007)
9. Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries.
In: Cluster Computing (2010)
10. Riesen, R.: Communication patterns. In: Workshop on Communication Architec-
ture for Clusters CAC 2006 (April 2006)
11. Riesen, R., Ferreira, K., Stearley, J.: See applications run and throughput jump:
The case for redundant computing in HPC. In: 1st Intl. Workshop on Fault-
Tolerance for HPC at Extreme Scale, FTXS 2010 (June 2010)
12. Rodrigues, A., Cook, J., Cooper-Balis, E., Hemmert, K.S., Kersey, C., Riesen,
R., Rosenfield, P., Oldfield, R., Weston, M., Barrett, B., Jacob, B.: The structural
simulation toolkit. In: 1st Intl. Workshop on Performance Modeling, Benchmarking
and Simulation of High Performance Computing Systems, PMBS 2010 (November
2010)
Framework for Enabling System Understanding

J. Brandt1 , F. Chen1 , A. Gentile1 , Chokchai (Box) Leangsuksun2, J. Mayo1,


P. Pebay1 , D. Roe1 , N. Taerat2 , D. Thompson1 , and M. Wong1
1
Sandia National Laboratories, Livermore CA, USA
{brandt,fxchen,gentile,jmayo,pppebay,dcroe,dcthomp,mhwong}@sandia.gov
2
Louisiana Tech University, Ruston, LA, USA
{box,nta008}@latech.edu

Abstract. Building the effective HPC resilience mechanisms required for


viability of next generation supercomputers will require in depth under-
standing of system and component behaviors. Our goal is to build an inte-
grated framework for high fidelity long term information storage, historic
and run-time analysis, algorithmic and visual information exploration to
enable system understanding, timely failure detection/prediction, and trig-
gering of appropriate response to failure situations. Since it is unknown
what information is relevant and since potentially relevant data may be ex-
pressed in a variety of forms (e.g., numeric, textual), this framework must
provide capabilities to process different forms of data and also support the
integration of new data, data sources, and analysis capabilities. Further,
in order to ensure ease of use as capabilities and data sources expand, it
must also provide interactivity between its elements. This paper describes
our integration of the capabilities mentioned above into our OVIS tool.

Keywords: resilience, HPC, system monitoring.

1 Introduction

Resilience has become one of the top concerns as we move to ever larger high
performance computing (HPC) platforms. While traditional checkpoint/restart
mechanisms have served the application community well it is accepted that they
are high overhead solutions that won’t scale much further. Research into alter-
native approaches to resilience include redundant computing, saving checkpoint
state in memory across platform resources, fault tolerant programming models,
calculation of optimal checkpoint frequencies based on measured failure rates,
and failure prediction combined with process migration strategies. In order to
predict the probability of success of these methods on current and future sys-
tems we need to understand the increasingly complex system interactions and
how they relate to failures.

These authors were supported by the United States Department of Energy, Office of
Defense Programs. Sandia is a multiprogram laboratory operated by Sandia Corpo-
ration, a Lockheed-Martin Company, for the United States Department of Energy
under contract DE-AC04-94-AL8500.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 231–240, 2012.

c Springer-Verlag Berlin Heidelberg 2012
232 J. Brandt et al.

The OVIS project [13], at Sandia National Laboratories, seeks to develop


failure models that, combined with run-time data collection and analysis, can
be used to predict component failure and trigger failure mitigating responses
such as process migration, checkpointing, etc. OVIS [2,3] is comprised of five
main components: 1) numeric data collection and storage infrastructure, 2) text
based information collection and storage, 3) numeric analysis and response, 4)
text based analysis, and 5) user interface and visualization. In this paper we
first describe the current OVIS infrastructure with respect to these components
including new capabilities for data/information gathering and interactive explo-
ration that have been recently developed, including integration of the Baler [5]
tool and development of Baler visualizations. Next we provide use case con-
text for its utility in enabling researchers to aggregate available system related
information into a single infrastructure and manipulate it to gain insight into
operational behaviors related both to normal and failure modes. We contrast
this with other currently available tools built for providing similar insight and
discuss their strengths and shortcomings.

2 Approaches

2.1 Numeric Information Collection and Storage

Numeric information is gathered from various sources in different ways as de-


scribed in these subsections. Storage however is all accomplished through inser-
tion into a distributed database where division of storage is done on the basis of
which compute node the information is being collected on behalf of (e.g. Node
A’s numeric information will be stored into the same database whether it is col-
lected raw from the node itself or harvested from log files stored on other servers.
Scalability of the system is accomplished via this distributed storage architecture
in that there is no duplication of raw information across the databases.
User-Defined Samplers. OVIS provides an interface by which the user can
write information samplers. The OVIS framework handles invocation of the sam-
plers and insertion of the resultant data into the OVIS database. OVIS is released
with samplers that process, for example, /proc, lmsensors [11], and EDAC [8]
data. Recent advancements include flexible regular-expression matching for dy-
namic discovery of data fields within known sources and of relative system com-
ponents to ease user development of samplers.
Samplers can collect data at regular intervals or be prodded to sample on
demand. Examples, motivated by a previous work where a system was insuffi-
ciently releasing claimed memory [1] include prodding a sampler to collect Active
Memory utilization during the epilog script of a job (to determine if memory was
released) or before the resource manager places a new job on a node (to deter-
mine if that node should be allocated or instead placed offline). An additional
sampler exists that can record a string into the database on demand. While the
current OVIS analyses operate on numeric data, this timestamped text data can
be used as a marker by which to guide analyses. For example, one could insert
Framework for Enabling System Understanding 233

markers for the entry and exit of a particular part of a code and then use that
information to determine the times for an analysis.
While these samplers currently run a database client on the nodes, we have
recently implemented a distributed metric service which propagates information
to the database nodes itself for insertion.
Metric Generator. OVIS provides a utility called the Metric Generator that
enables a user to dynamically create new data tables and insert data. The in-
terface allows the user to specify optional input data, a user-defined script to
generate the additional data, and a new output data table. The script can oper-
ate on both metrics in the database and on external sources. Examples include
a) scripts that grep for error messages in log files and insert that occurrence into
the database, thus converting text to numeric representations and b) more so-
phisticated analyses, such as gradients of CPU temperature. This enables rapid
development of prototype analyses. These scripts can be run at run-time or after-
the fact still with the ability to inject data timestamped to be concurrent with
earlier data in the system.
Resource Manager. Resource Manager (RM) data includes information about
job start and end times, job success/failure, user names, job names etc. OVIS
does not collect RM data, but rather intends to natively interface to a variety of
RMs’ native representations. Currently, OVIS natively interfaces to databases
produced by SLURM [15] (and associated tools) and enables search capabilities
upon that data as described in Section 2.4.

2.2 Text Based Information Collection and Storage


Text based information is currently collected from three sources: resource man-
ager logs, syslogs, and console logs.
As OVIS is intended to natively interface with system resource managers
(Section 2.1), the only reason storage would be needed is if, for performance
reasons, the system administrators would prefer the RM’s database to be repli-
cated. Replication can be accomplished either to a separate RM database node
or to one of the database nodes being used for numeric storage depending on
the relative amount of activity in each.
OVIS uses LATech’s Baler [5] log file analysis engine as well as custom metric
generator scripts to interact with both syslog and console log files. Thus collec-
tion and storage of this information is performed in two ways and incurs the
cost of some redundancy for performance purposes. These log files are typically
stored in flat file format on a server associated with a particular HPC system.
Thus we periodically copy these files from the server to one of the database nodes
in our system as flat files.

2.3 Numeric Analysis


In order to provide a standard interface for development of new analysis engines
for manipulating numeric information and to leverage existing high performance
234 J. Brandt et al.

parallel analysis engines we adopted the visualization tool kit (VTK) statistical
analysis interface. Current analysis engines available in OVIS are: descriptive
statistics, multi-variate correlation, contingency statistics, principal component
analysis, k-means, and wear rate analysis. These analyses allow the user to un-
derstand statistical properties and relative behaviors of collected and derived
numeric information which can in turn provide insight into normal vs. abnormal
interactions between applications and hardware/OS. These analysis engines can
be accessed via a python script interface, or a high performance C++ interface
either directly or through OVIS’s GUI. Additionally the user can write scripts,
described in Section 2.1, that manipulate numeric information from any source.

2.4 Text Based Analysis


Resource Manager. RM information analysis capabilities include the ability
to search on various quantities such as user name, job end state, resources par-
ticipating in a job etc; and drill-down graphical capabilities for displaying the
distribution of users, end states, etc. for a set of jobs; and some basic statistics
for a set of jobs such as size and duration for a set of jobs. In general, such infor-
mation is insufficient as a sole source of exploratory information. For example,
in the out-of-memory analysis [1], the on-set of the problem condition actually
occurs during jobs that successfully complete, with job failure as a longer-term
effect. The Resource Manager analysis capability rather is intended to provide
information that can be used to guide analysis. For example, an analysis of CPU
Utilization may be relevant only over the processors of a given job and for the
duration of that job; job failure information can be used to narrow down times
and nodes of interest in which to search.
Baler Log File Analysis. Textual analysis is provided within OVIS via inte-
gration of the Baler [5], a tool for log file analysis. Baler discovers and extracts
patterns from log messages with the goal of reducing large log data sources into
a reasonable number of searchable patterns. Its algorithm is efficient compared
to existing log analysis tools in that it requires only one pass as opposed to con-
tingency statistics algorithms which require several passes over the data. Baler
generates patterns based on message context resulting in deterministic output
patterns irrespective of input data variance, unlike contingency statistics algo-
rithms which depend upon the variance. This results in a consistent pattern rep-
resentation for a given message over time, facilitating consistent understanding
over long-term data exploration. Baler stores pattern representations as unique
numbers; such time-pattern-number tuples are consistent with the OVIS metric
data storage. Thus complete integration of OVIS and Baler will enable the same
statistical analysis and visualization capabilities for log file analysis as OVIS
currently supports for its numerical data.

3 Applications
This section describes use of OVIS integrated, interactive capabilities to enable
system understanding.
Framework for Enabling System Understanding 235

Fig. 1. OVIS-Baler integration user interface. The OVIS-Baler integration provides in-
teracting capabilities for log pattern analysis and visualization, numerical data analysis
and visualization, and job log search.

System Diagnosis of Precursor to Failure. In previous work [1], we demon-


strated a statistically discoverable precursor to failure exists for one of our clus-
ters. Proof of that the discoverable condition was related to failure required cross
reference with job and log data. At the time, this cross referencing was a manual
process, not integrated within OVIS. We have developed the additional capabil-
ities within the OVIS framework to enable the required cross referencing with
job and log data.
This is illustrated in Figure 1 which is the screenshot of the integrated ca-
pabilities addressing system data exploration. Zoom-ins on various sections are
presented in the subsequent figures. In the lower right and in Figure 2, the log
search is used to investigate the status of jobs. Note that while the highlighted
job completes, subsequent jobs on Glory 234 fail. In the upper left and in Figure 3
(top), the Baler patterns are displayed. Results for both the oom-killer and the
out of memory patterns are displayed. In previous work [5], the Baler pattern
for oom-killer was presented, contrasting with other tools where the pattern
was either missed or presented in a redundant fashion so that it was difficult
to determine the number of oom killer events. The ability of Baler to discover
patterns, as opposed to merely enabling filtering on pre-defined patterns makes
it possible to obtain understanding for this problem that might have otherwise
been missed. The occurrences of the patterns with node and time information are
displayed in the upper right and in Figure 3 (bottom). No error messages with
these patterns occur during the completed job, however the error messages occur
during the subsequent failed job, as is explicitly presented in the mouse over.
The lower left and in Figure 4 includes the OVIS display where system data can
be displayed on a physically accurate representation of the cluster. Job-centric
236 J. Brandt et al.

Fig. 2. OVIS Resource Manager (RM) view. Job information is searchable and is
shown. Selecting a job automatically populates an analysis pane and the 3D view
with job-relevant data (Figure 4).

views are supported as in this figure where the highlighted (completed) job in
the job-search display is dropped upon the physical display, limiting the col-
ored nodes to only those participating in the job. It is seen that one of the nodes
(Glory 234, colored red and circled) has significantly higher Active Memory than
any of the other nodes participating in the job. Scrolling through time indicates
that the node has high Active Memory, even during idle times on the node, and
during the subsequent failed job.
Note that any one data source is insufficient to understand the entire situation.
The Resource Manager data shows that there may be a problem on Glory 234,
but it does not elucidate the cause of the problem. The log data shows that
a possible cause of job failure is an out of memory condition on Glory 234,
but it does not indicate the onset of the problem, nor if this is this due to a
naturally occurring large demand for memory. The physical visualization with
outlier indication shows the onset and duration of the abnormal behavior on
the node, but does not directly tie it to a failure condition. The combination of
all three pieces, each providing a different perspective and each working upon
a different data source is necessary for overall system understanding. (this is in
contrast to the condition detection itself, which can be done purely by statistical
means).

Alternative Views of Similar Information. In some cases the same or simi-


lar information is available, but through different sources or in different formats.
For instance the Error Detection and Correction (EDAC) [8] Memory Controller
(MC) driver module provides an interface to DIMM error information. This in-
formation is made available through system files, command line interface, and in
Framework for Enabling System Understanding 237

Fig. 3. Baler Pattern view (top) provides search capabilities and drill down view-
ing of wild card values. Out of Memory related meta-patterns, determined by Baler
are shown. Baler Event view (bottom) shows color coded events in time and space.
Mouseover highlights and displays message patterns. Some messages for the out of
memory condition on a node are shown.

the syslog files. In the system file representation a separate file is written for each
row and channel, which can be mapped to slots for DIMMS and/or physical CPU
sockets. Each of these files (e.g., .../edac/mc/mcX/csrowY/ue count) contain-
ers counters of errors that have occurred. This same information can be extracted
via the command line calls. In the syslog output, such errors are reported as: Feb
20 12:41:22 glory259 EDAC MC1: CE page 0x4340a1, offset 0x270, grain 8,
syndrome 0x44, row 1, channel 0, label DIMMB 2A: amd64 edac. Thus present-
ing the row, channel, DIMM, and error categorization, but in a different format.
In OVIS the same innate information is harvested from the two different
sources, and it is processed in complimentary and different fashion. Baler reduces
238 J. Brandt et al.

Fig. 4. OVIS physical cluster display. An outlier in Active Memory is seen (red, circled)
across nodes in this job (colored nodes). Job selection in the RM view (Figure 2)
automatically populates a) the analysis pane with relevant nodes and time and b) the
3D view with nodes.

specific occurrences of this pattern to: EDAC *: CE page * offset * grain *,


syndrome * row * channel * label * *: amd64 edac with the * indicating
wild card entries. This enables the user to get the number and, in the display,
the node-level placement of the memory errors with respect to time. However,
it hides the exact mapping to the hardware. While that information could be
extracted, such a procedure would be antithetical to the goal of Baler, which
seeks to provide pattern information to the user without requiring input on the
message format. In contrast, OVIS samplers collect the EDAC information via
the file sources inherently with the hardware associations. OVIS is capable of
presenting this information at the socket of DIMM level. In general OVIS will
collect and enable investigation at as high a fidelity representation of the display
as the user cares to provide.

4 Related Work

There has been much work done and various tools built within the HPC com-
munity with respect to information collection, analysis, and visualization some
of which we describe here. In each case, however, only a portion of the wealth of
information available has been harvested and hence the understanding that can
be realized is limited. By contrast we seek through the OVIS project to create an
integration framework for available information and tools to, through knowledge
extraction and system interaction, build more resilient HPC platforms.
Both Ganglia [9] and VMware’s Hyperic [10] provide a scalable solution to
capturing and visualizing numeric data on a per-host basis using processes run-
ning on each host and retrieving information. While Ganglia uses a round robin
database [14] to provide fine grained historic information over a limited time win-
dow and coarser historic information over a longer time Hyperic uses a server
hosted database. Each retains minimal information long term. While Hyperic,
Framework for Enabling System Understanding 239

unlike Ganglia, supports job based context in order to present more job centric
analysis or views, neither has support for complex statistical analysis. Ganglia
is released under a BSD license making it an attractive development platform
while Hyperic releases a stripped down GPL licensed version for free and a full
featured version under a commercial license.
Nagios [12] is a monitoring system for monitoring critical components and
their metrics and triggering alerts and actions based on threshold based direc-
tives. It provides no advanced statistical analysis capability nor the infrastruc-
ture for performing detailed analysis of logs, jobs, and host based numeric metrics
in conjunction. Nagios Core is GPL licensed.
Splunk [16] is an information indexing and analysis system that enables ef-
ficient storage, sorting, correlating, graphing, and plotting of both historic and
real time information in any format. Stearley et al give a variety of examples
of how Splunk can be used to pull information from RM’s, log files, and nu-
meric metrics and present a nice summary including descriptive statistics about
numeric metrics [4]. Some missing elements though are numeric data collection
mechanisms, spatially relevant display, and a user interface that facilitates drag
and drop type exploration. Like Hyperic, Splunk provides a free version with
limited data handling capability and limited features as well as a full featured
commercial version where the cost of the license is tied directly to how much
data is processed.
Analysis capabilities outside of frameworks exist. Related work on algorithms
for log file analysis can be found in [5], as the algorithmic comparison is not
directly relevant to this work. Lan et al have explored use of both principal
component analysis and independent component analysis [7] [6] as methods for
identifying anomalous behaviors of compute nodes in large scale clusters. This
work shows promise and would be more useful if the analyses were incorporated
into a plug and play framework such as OVIS where these and other analysis
methods could be easily compared using the same data.

5 Conclusions
While there are many efforts under way to mitigate the effects of failures in
large scale HPC systems, none have built the infrastructure necessary to explore
and understand the complex interactions of components under both non-failure
and failure scenarios nor to evaluate the effects of new schemes with respect to
these interactions. OVIS provides such an infrastructure with the flexibility to
allow researchers to add in new failure detection/prediction schemes, visualize
interactions and effects of utilizing new schemes in the context of real systems
either from the perspective of finding when prediction/detection would have
happened and validating that it is correct or by comparing operation parameters
both with and without implementation of such mechanism(s).
240 J. Brandt et al.

References
1. Brandt, J., Gentile, A., Mayo, J., Pebay, P., Roe, D., Thompson, D., Wong, M.:
Methodologies for Advance Warning of Compute Cluster Problems via Statistical
Analysis: A Case Study. In: Proc. 18th ACM Int’l Symp. on High Performance
Distributed Computing, Workshop on Resiliency in HPC, Munich, Germany (2009)
2. Brandt, J., Gentile, A., Houf, C., Mayo, J., Pebay, P., Roe, D., Thompson, D.,
Wong, M.: OVIS-3 User’s Guide. Sandia National Laboratories Report, SAND2010-
7109 (2010)
3. Brandt, J., Debusschere, B., Gentile, A., Mayo, J., Pebay, P., Thompson, D., Wong,
M.: OVIS-2 A Robust Distributed Architecture for Scalable RAS. In: Proc. 22nd
IEEE Int’l Parallel and Distributed Processing Symp., 4th Workshop on System
Management Techniques, Processes, and Services, Miami, FL (2008)
4. Stearley, J., Corwell, S., Lord, K.: Bridging the gaps: joining information sources
with Splunk. In: Proc. Workshop on Managing Systems Via Log Analysis and
Machine Learning Techniques, Vancouver, BC, Canada (2010)
5. Taerat, N., Brandt, J., Gentile, A., Wong, M., Leangsuksun, C.: Baler: Determin-
istic, Lossless Log Message Clustering Tool. In: Proc. Int’l Supercomputing Conf.,
Hamburg, Germany (2011)
6. Lan, Z., Zheng, Z., Li, Y.: Toward Automated Anomaly Identification in Large-
Scale Systems. IEEE Trans. on Parallel and Distributed Systems 21, 174–187 (2010)
7. Zheng, Z., Li, Y., Lan, Z.: Anomaly localization in large-scale clusters. In: Proc.
IEEE Int’l Conf. on Cluster Computing (2007)
8. EDAC. Error Detection and Reporting Tool see, for example, Documentation in the
Linux Kernel (linux/kernel/git/torvalds/linux-2.6.git)/Documentation/edac.txt
9. Ganglia, http://ganglia.info
10. Hyperic. VMWare, http://www.hyperic.com
11. lm-sensors, http://www.lm-sensors.org/
12. Nagios, http://www.nagios.org
13. OVIS. Sandia National Laboratories, http://ovis.ca.sandia.gov
14. RRDtool, http://www.rrdtool.org
15. SLURM. Simple Linux Utility for Resource Management,
http://www.schedmd.com
16. Splunk, http://www.splunk.com
Cooperative Application/OS DRAM Fault
Recovery

Patrick G. Bridges1 , Mark Hoemmen2, , Kurt B. Ferreira1,2,∗∗,


Michael A. Heroux2,∗∗ , Philip Soltero1 , and Ron Brightwell2,∗∗
1
Department of Computer Science
University of New Mexico
Albuquerque, NM 87131
{bridges,kurt,psoltero}@cs.unm.edu
2
Sandia National Laboratories
Albuquerque, NM 87123
{mhoemme,kbferre,maherou,rbbrigh}@sandia.gov

Abstract. Exascale systems will present considerable fault-tolerance


challenges to applications and system software. These systems are ex-
pected to suffer several hard and soft errors per day. Unfortunately, many
fault-tolerance methods in use, such as rollback recovery, are unsuitable
for many expected errors, for example DRAM failures. As a result, appli-
cations will need to address these resilience challenges to more effectively
utilize future systems. In this paper, we describe work on a cross-layer
application / OS framework to handle uncorrected memory errors. We
illustrate the use of this framework through its integration with a new
fault-tolerant iterative solver within the Trilinos library, and present ini-
tial convergence results.

Keywords: Fault Tolerance, DRAM Failure, Fault-Tolerant GMRES.

1 Introduction
Proposed exascale systems will present extreme fault tolerance challenges to
applications and system software. In particular, these systems are expected to
suffer soft or hard errors at least several times a day. Such errors include un-
correctable DRAM failures, I/O system failures, and CPU logic failures. Un-
fortunately, fault-tolerance methods currently in use by large-scale applications,
such as roll-back recovery from a checkpoint, may be unsuitable to address the
challenges of exascale computing. As a result, applications will need to address

This work was supported in part by a faculty sabbatical appointment from Sandia
National Laboratories and a grant from the U.S. Department of Energy Office of Sci-
ence, Advanced Scientific Computing research, under award number DE-SC0005050,
program manager Sonia Sachs.

Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 241–250, 2012.

c Springer-Verlag Berlin Heidelberg 2012
242 P.G. Bridges et al.

resilience challenges previously handled only by the hardware, OS, and run-time
system if they want to utilize future systems efficiently. Unfortunately, few in-
terfaces and mechanisms exist to provide applications with useful information
on the faults and failures that affect them. This is particularly true of DRAM
failures, one of the most common failures in current large-scale systems [19], but
for which only low-level models of failure and recovery currently exist.
In this paper, we describe work on a collaborative application / OS system to
handle uncorrected memory errors. We begin by reviewing the basics of DRAM
memory failures and how they are handled in current systems. We then discuss
specific models of memory failures that we are examining for application design
and system implementation purposes, how the application, OS, and hardware
can interact under these failure models, and how applications recover in this
scenario. Based on this, we then present a simple OS and hardware interface
to provide the information to applications necessary to handle these errors, and
we outline an implementation of this interface. We illustrate the use of this in-
terface through its integration with a new fault-tolerant iterative linear solver
implemented with components from the Trilinos solvers library [9], and present
initial convergence results showing the viability of this recovery approach. Fi-
nally, we discuss related and future work.

2 DRAM Failures
DRAM memory modules are one of the most plentiful hardware items in modern
HPC systems. Each node may have dozens of DRAM chips, and large systems
may have tens or hundreds of thousands of DRAM modules. The combination
of the quantity and the density of the information they store makes them par-
ticularly susceptible to faults. As a result, most HPC systems include some
built-in hardware fault tolerance for DRAM. The most common hardware mem-
ory resilience scheme has the CPU memory controller write additional checksum
bits on each block of data written (128-bit blocks are used on modern AMD
processors, for example). The controller uses these bits to detect and correct
errors reading these blocks of data back into the CPU. Most modern codes
use Single-symbol1 Error Correction and Double-symbol Error Detection (SEC-
DED) schemes, allowing them to recover from the simplest memory failures and
at least detect more complex (and less frequent) ones.
Recent research has shown that uncorrectable errors (e.g., double-symbol
errors) are increasingly common in systems with SEC-DED memory protec-
tion [19], with uncorrectable DRAM errors occurring in up to 8% of DIMMs per
year. Such errors result in a machine check exception being delivered to the oper-
ating system, which then generally logs the error and either kills the application
or reboots the system depending upon the location of the error in memory. Some
systems exist for recovering from such errors, for example in Linux when they
occur in memory used for caching data or owned by a virtual machine [13], but
these systems are much too low-level to be useful to application developers.
1
A symbol in modern DRAM systems typically comprises 4 or 8 bits of data.
Cooperative Application/OS DRAM Fault Recovery 243

3 Fault and Recovery Model


3.1 Fault Model
Because of their frequency, which is only expected to increase in future systems,
it is particularly important that applications be able to recover from DRAM
memory failures that hardware schemes cannot correct. Because these failures
can potentially destroy important application data, however, recovering from
them can be challenging. To address this issue, we examine a specific model for
memory failures that we believe is relevant to a range of applications, discuss
the application, OS, and hardware requirements of providing this failure model,
and discuss how applications can recover given this failure model.
For application design purposes, the initial model of failures we explore in
this work is of transient memory failures. In this model, applications designate
portions of DRAM memory may occasionally return incorrect values, and these
values are expected to revert to their original value in a finite amount of time.
We believe that this is a useful model for initial designs of robust numerical
algorithms, as it allows them to designate both essential memory and memory
that need only be usually correct. We illustrate the usefulness of this model to
a specific numeric algorithm later in this paper.
Because many DRAM memory errors are not in fact transient, the applica-
tion, OS, and hardware can cooperate to map the the hardware-provided fail-
ure model to one of transient DRAM memory failures. Specifically, we use a
combination of hardware-level memory error detection and either application or
system-level checkpointing [6] can work together to provide the appearance of
transient DRAM failures, assuming an underlying model of detected DRAM fail-
ures. This is particularly effective when the application and system rarely modify
the data being protected. For example, iterative methods for solving linear sys-
tems Ax = b generally do not modify the matrix A and any preconditioner(s),
which we exploit later in this work.

3.2 Recovery Model


To handle the failures discussed above, we present a recovery model oriented to-
wards iterative numerical methods where the algorithm can continue to converge
past faults that temporarily perturb its state. Because DRAM failures caused by
reading corrupted memory can be signaled by the hardware both synchronously
(due to an algorithm reading corrupted memory) and asynchronously (from the
memory scrubber), we focus on a model in which the algorithm running at the
time of a failure temporarily patches failures to allow progress to continue to be
made and then fully recovers from the detected error later.
Specifically, our recovery model comprises the following steps:
1. The application designates to the operating and runtime system the portions
of its memory in which it can tolerate a transient memory error. Errors in
portions of memory not so designated are deemed fatal and cause application
termination.
244 P.G. Bridges et al.

/* Register callback for handling failure in specific allocation of


* failable memory at a specified byte offset and length. arg is an
* opaque user-supplied argument. */
typedef void (*memfail_callback_t)( void *allocation, size_t off,
size_t len, void *arg);
void memfail_recover_init( memfail_callback_t cb, void *arg );

/* Allocate resp. free failable memory */


void * malloc_failable( size_t len );
void free_failable( void *addr );
Fig. 1. Application / Library interface to handle DRAM memory failures

2. Upon receiving notification of an uncorrectable memory error in designated


memory, the OS signals the application that an error has occurred at a
specified address.
3. The application performs immediate but potentially partial fixes on the failed
memory sufficient to allow the current calculation to continue, possibly with
some degradation of accuracy.2
4. The application records that an error has perturbed the current calculation,
to prevent it from incorrectly succeeding and returning incorrect values (e.g.,
if the convergence metric has been corrupted).
5. The application resumes execution at the instruction that was executing
when the failure was detected.
6. At a well-defined point (e.g., every few iterations), the system or application
recovers the corrupted memory using, for example, an application-specific
recovery technique or a local checkpoint.
7. The application records that corrupted values have been fixed, potentially
allowing it to complete after an additional successful iteration.

4 Application / OS Interface
4.1 Design
We have designed an application / OS interface to support the fault and re-
covery models described in Section 3, and implemented a library to provide this
interface. Our key design goals were to provide a simple interface for applications
and algorithmic libraries, and to support existing OS-level interfaces to handling
memory errors such as those provided by Linux.
This application level of this interface, shown in Figure 1, focuses on run-time
memory allocation. In particular, the interface provides the application with
separate calls for allocating failable memory—memory in which failures will
cause notifications to be sent to the application. These calls work like malloc()
and free(). In addition, the application also registers a callback with the library.
2
For example, a solver may replace corrupted matrix entries with averages of their
uncorrupted neighbors.
Cooperative Application/OS DRAM Fault Recovery 245

The callback is called once for every active allocation when the library is notified
by the OS of a detected but uncorrected memory fault in that allocation.
In addition to this interface, we also provide a simple producer-consumer
bounded ring buffer that the application can use to queue up a sequence of
failed allocations when signaled by the library. This ring buffer is non-blocking
and atomic to allow asynchronous callbacks from the library to enqueue failed
allocations that will be fully recovered at the end of an iteration. The application
determines the size of this buffer when it is allocated; the number of entries
needed must be sufficient to cover all of the allocations that could plausibly fail
during a single iteration. For applications with relatively few failable allocations,
this should be a minimal number of entries.
At the OS level, the library first notifies the operating system that it wishes
to receive notifications of DRAM failures, either in general or in specific areas of
its virtual address space depending upon the interface provided by the operating
system. Second, the library keeps track of the list of failable memory allocated
by the application so that it can call the application callback for each failed allo-
cation when necessary. Finally, the library handles any error notifications from
the operating system (e.g., using a Linux SIGBUS signal handler) and performs
OS-specific actions to clear a memory error from a page of memory if necessary
prior to notifying the application of the error.

4.2 Implementation
We added support for handling signaled memory failures as described in the
previous section to an existing incremental checkpointing library for Linux, lib-
hashckpt [8]. We chose this library because it helps track application memory
usage, and provides checkpointing functionality to recover from memory failures
for applications that cannot. Its ability to trap specific memory accesses eases
the testing of simulated memory failures, as described later in Section 4.3.
The modified version of the library adds the application API calls listed previ-
ously in Figure 1, with the failable memory allocator using malloc() to allocate
and free memory. This allocator also keeps a data structure sorted by allocation
address of failable memory allocations.
Linux notifies the library of DRAM memory failures, particularly failures
caught by the memory scrubber using a SIGBUS signal that indicates the ad-
dress of the memory page which failed. The library then unmaps this failed
page using munmap(), maps in a new physical page using mmap(), and calls the
application-registered callback with appropriate offset and length arguments for
every failable application allocation that overlapped with the page that included
the failure.
Note that Linux currently only notifies the application of DRAM failures de-
tected by the memory scrubber. When the memory controller raises an exception
caused by the application attempting to consume faulty data, Linux currently
kills the faulting application. In addition, Linux only notifies applications of the
page that failed and expects the application to discard the entire failed page.
This approach is overly restrictive in some cases, as the hardware notifies the
246 P.G. Bridges et al.

kernel of the memory bus line that failed, and some memory errors are soft and
could be corrected simply by rewriting the failed memory line.

4.3 Testing Support

To provide support for testing DRAM memory failures, we added support to the
incremental checkpointing library for simulating memory failures. In particular,
we added code that randomly injects errors at a configurable rate into the appli-
cation address space and uses page protection mechanisms, i.e., mprotect(), to
signal the application with a SIGSEGV when it touches a page to which a simu-
lated failure has been injected. The library then catches SIGSEGV and proceeds as
if it had received a memory failure on the protected page. We also implemented
a simulated memory scrubber in the library which can asynchronously inject
memory failures into the application by signaling the library when it scrubs a
memory location at which a failure has been simulated.

5 Fault-Tolerant GMRES

We have used the interface described in Section 4 to implement Fault-Tolerant


GMRES (FT-GMRES), an iterative method for solving large sparse linear sys-
tems Ax = b. FT-GMRES computes the correct solution x even if the system
experiences uncorrected faults in both data and arithmetic [10]. The algorithm
will always either converge to the right answer, or (in rare cases) stop and report
immediately to the caller if it cannot make progress. The algorithm accomplishes
this by dividing its computations into reliable and unreliable phases. Rather than
rolling back any faults that occur in unreliable phases, as a checkpoint / restart
approach would do, FT-GMRES rolls forward and progresses through any faults
in unreliable phases. FT-GMRES can also exploit fault detection, including but
not limited to the ECC memory fault notification discussed above, in order to
decide whether to accept the result of each unreliable phase.
FT-GMRES is based on the Flexible GMRES (FGMRES) algorithm of Saad
[16]. FGMRES extends the Generalized Minimal Residual (GMRES) method
of Saad and Schultz [18], by “flexibly” allowing the preconditioner (also called
the “inner solve”) to change in every iteration. The key observation behind
FT-GMRES is that flexible iterations allow successive inner solves to differ arbi-
trarily, even unboundedly. This suggests modeling memory or arithmetic faults
in the inner solves as “different preconditioners.” Taking this suggestion results
in FT-GMRES. The algorithm uses any existing solver to “precondition” an
FGMRES-based outer iteration. Our experiments use as the existing solver an
iterative method such as GMRES with its own preconditioner, though it may
be any algorithm (direct or iterative, dense or sparse or structured) that solves
linear systems.
FT-GMRES expects that inner solves do most of the work, so inner solves run
in the less expensive unreliable mode. The matrix A, right-hand side b, and any
other inner solver data may change arbitrarily, and those changes need not even
Cooperative Application/OS DRAM Fault Recovery 247

be transient. However, each outer iteration of FT-GMRES must run reliably, and
requires a correct copy of the matrix A, right-hand side b, and additional outer
solve data (the same that FGMRES would use). Since FT-GMRES expects only
a small number of outer iterations, interspersed by longer-running inner solves,
we need not store two copies (unreliable and reliable) of A and b in memory.
Instead, we can save them to a reliable backing store, or even recompute them.
With fault detection, we can avoid recovering or recomputing these data if no
faults occurred, or even selectively recover or recompute just the corrupted parts
of the critical data.

6 Initial Results

We have implemented FT-GMRES using components of the Trilinos solvers li-


brary [9], in particular Tpetra sparse matrices and vectors, Belos iterative linear
solvers (FGMRES for the outer iterations, and GMRES for the inner iterations),
and Ifpack2 incomplete factorization preconditioners. Trilinos already provides
a full hybrid-parallel (MPI + threads) implementation; we chose for simplicity
to limit initial experiments to a single MPI process. We only needed to modify
Trilinos to use the unreliable memory allocation interface described above. Our
entire solver prototype uses only 2500 lines of C++ code. For experiments, we
used the random fault injection feature of the incremental checkpointing library.
Detection of faults works entirely independently of fault injection, so the same
code would allow us to test actual hardware faults.
We tested FT-GMRES using the development (10.7) version of Trilinos, on a
Intel Xeon X5570 (8 cores, 2.93 GHz) CPU with 12 GB of main memory. We
chose for our test matrix an ill-conditioned Stokes partial differential equation
discretization Szczerba/Ill Stokes from the University of Florida Sparse Ma-
trix Collection (UFSMC) [4]. It has 25,187 rows and columns, 193,216 stored
entries, and an estimated 1-norm condition number of 4.85 × 109 . For initial
experiments, we chose a uniform [−1, 1] pseudorandom right-hand side.
We ran FT-GMRES with 10 outer iterations. Each inner solve used 50 iter-
ations of standard GMRES (without restarting), right-preconditioned by ILUT
(see e.g., Saad [17]) with level 2 fill, zero drop tolerance, 1.5 relative threshold and
0.1 absolute threshold. (These are not necessarily reasonable ILUT parameters,
but they ensure a valid preconditioner for the problem tested.) We compared
FT-GMRES with standard GMRES both with and without restarting: 500 it-
erations of each, restarting if applicable every 50 iterations. (This makes the
memory usage of the two methods approximately comparable.) We set no con-
vergence criteria except for iteration counts, so that we could fully observe the
behavior of the methods. Our initial experiments use random fault injection at
a rate of 1000 faults per megabyte per hour, which is high but demonstrates
the solver’s fault-tolerance capabilities. Faults were allowed to occur in floating-
point data belonging to the matrix and the ILUT preconditioner. Furthermore,
to demonstrate the value of algorithmic approaches, our restarted GMRES im-
plementation imitated FT-GMRES by also refreshing the matrix and ILUT
248 P.G. Bridges et al.

0
FTGMRES vs. standard GMRES, Ill_Stokes, fault rate 1e12
10
FTGMRES(50,10)
GMRES(500)
GMRES(50) x 10

Relative residual norm (log scale)

1
10
1 2 3 4 5 6 7 8 9 10 11
Outer iteration number

Fig. 2. FT-GMRES (10 outer iterations, 50 inner iterations each), 500 iterations of
non-restarted GMRES, and 10 restart cycles (50 iterations each) of restarted GMRES.
(Down is good.)

preconditioner from reliable storage before every restart cycle. (We optimized
by not refreshing if no memory faults were detected.)
Figure 2 shows our convergence results. FT-GMRES’ reliable outer itera-
tion makes it able to roll forward through faults and continue convergence. The
fault-detection capabilities discussed earlier in this work let FT-GMRES refresh
unreliable data only when necessary, so that memory faults appear transient to
the solver.

7 Related Work
There has been a wide range of research on application and OS techniques for
recovering from faults in HPC systems. This includes both algorithmic work
on allowing specific numeric codes to run failures, and OS-level work on han-
dling memory faults both transparently and directing to the application. In the
remainder of this section, we describe related work in both areas.

7.1 DRAM Error Handling


DRAM errors have been intensively studied in both HPC and other large com-
puting systems recently. Recent research has characterized both the frequency
of such errors in large scale systems [19], as well as the low-level patterns with
which they occur [14].
OS-level handling of such faults has generally been either very limited or used
very heavyweight solutions. Linux and other operating systems, for example,
provide low-level techniques for handling, logging, and notifying the applica-
tion of such errors [13]. These techniques generally terminate the application,
Cooperative Application/OS DRAM Fault Recovery 249

potentially invoking higher-level recovery systems based on, for example, check-
pointing or redundancy. Some systems have attempted to provide additional
protection against memory faults both on CPUs [5] and GPUs [15], though with
substantial cost.

7.2 Fault Tolerant Algorithms


Most work in numerical linear algebra on fault tolerant algorithms falls in the
category of algorithm-based fault tolerance (ABFT) (see e.g., [12]). ABFT en-
codes redundant data into the matrices and vectors themselves, such that data
owned by failed parallel processors can be recomputed. Recent results using this
technique show this method can be used with a very low performance over-
head [3,1]. Other authors have empirically investigated the behavior of iterative
solvers when soft faults occur (e.g., [1,11]).
Inner-outer iterations with FGMRES have been used as a kind of iterative
refinement in mixed-precision computation [2], but as far as we know, this work
is the first to use flexible iterative methods for reliability and robustness against
possibly unbounded errors. Inexact Krylov methods generalize flexible methods
by allowing both the preconditioner and the matrix to change in every itera-
tion [20,7]. However, inexact Krylov methods require error bounds, and thus
cannot be used to provide tolerance against arbitrary data and computational
faults when applying the matrix A.

8 Conclusions and Future Work


In this paper we presented a cooperative cross-layer application / OS framework
for recovering from DRAM memory errors. This framework allows the applica-
tion to allocate failable memory and provides notification and callback mecha-
nisms for failures that occur within these failable allocations. We described the
fault and recovery model for this framework. Finally, we presented initial results
of this framework using the new fault-tolerant linear solver FT-GMRES, and
showed that the solver is able to converge in the presence of memory failures.
These initial results are promising, but more work is needed. Larger-scale
studies are need to better characterize the cost of this framework at scale. Ad-
ditionally, we are in the process of identifying more algorithms that can benefit
from this DRAM failure framework.

References
1. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear al-
gebra methods. In: Proceedings of the 22nd Annual International Conference on
Supercomputing, ICS 2008, pp. 155–164. ACM, New York (2008)
2. Buttari, A., Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.: Computations to en-
hance the performance while achieving the 64-bit accuracy. Tech. Rep. UT-CS-06-
584, University of Tennessee Knoxville, lAPACK Working Note #180 (November
2006)
250 P.G. Bridges et al.

3. Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for paral-
lel matrix computations on volatile resources. In: 20th International Parallel and
Distributed Processing Symposium, IPDPS 2006 (April 2006)
4. Davis, T.A., Hu, Y.: The University of Florida Sparse Matrix Collection. ACM
Trans. Math. Softw. (2011) (to appear),
http://www.cise.ufl.edu/research/sparse/matrices
5. Dopson, D.: SoftECC: A System for Software Memory Integrity Checking. Master’s
thesis, Massachusetts Institute of Technology (September 2005)
6. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-
recovery protocols in message-passing systems. ACM Computing Surveys 34(3),
375–408 (2002)
7. van den Eshof, J., Sleijpen, G.L.G.: Inexact Krylov subspace methods for linear
systems. SIAM J. Matrix Anal. Appl. 26(1), 125–153 (2004)
8. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt:
Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis,
A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp.
272–281. Springer, Heidelberg (2011)
9. Heroux, M.A., Bartlett, R.A., Howle, V.E., Hoekstra, R.J., Hu, J.J., Kolda, T.G.,
Lehoucq, R.B., Long, K.R., Pawlowski, R.P., Phipps, E.T., Salinger, A.G., Thorn-
quist, H.K., Tuminaro, R.S., Willenbring, J.M., Williams, A., Stanley, K.S.: An
overview of the Trilinos project. ACM Trans. Math. Softw. 31(3), 397–423 (2005)
10. Heroux, M.A., Hoemmen, M.: Fault-tolerant iterative methods via selective relia-
bility. Tech. Rep. SAND2011-3915 C, Sandia National Laboratories (2011),
http://www.sandia.gov/~ maherou/
11. Howle, V.E.: Soft errors in linear solvers as integrated components of a simula-
tion. Presented at the Copper Mountain Conference on Iterative Methods, Copper
Mountain, CO, April 9 (2010)
12. Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix opera-
tions. IEEE Transactions on Computers C-33(6) (June 1984)
13. Kleen, A.: mcelog: memory error handling in user space. In: Proceedings of Linux
Kongress 2010, Nuremburg, Germany (September 2010)
14. Li, X., Huang, M.C., Shen, K., Chu, L.: A realistic evaluation of memory hardware
errors and software system susceptibility. In: Proceedings of the 2010 USENIX
Annual Technical Conference (USENIX 2010), Boston, MA (June 2010)
15. Maruyama, N., Nukada, A., Matsuoka, S.: A high-performance fault-tolerant soft-
ware framework for memory on commodity GPUs. In: 2010 IEEE International
Symposium on Parallel Distributed Processing (IPDPS), pp. 1–12 (April 2010)
16. Saad, Y.: A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Sci.
Comput. 14, 461–469 (1993)
17. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. SIAM, Philadel-
phia (2003)
18. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput. 7, 856–869
(1986)
19. Schroeder, B., Pinheiro, E., Weber, W.D.: DRAM errors in the wild: a large-scale
field study. Communications of the ACM 54, 100–107 (2011)
20. Simonici, V., Szyld, D.B.: Theory of inexact Krylov subspace methods and appli-
cations to scientific computing. SIAM J. Sci. Comput. 25(2), 454–477 (2003)
A Tunable, Software-Based DRAM Error
Detection and Correction Library for HPC

David Fiala1 , Kurt B. Ferreira2, ,


Frank Mueller1 , and Christian Engelmann3,
1
Department of Computer Science, North Carolina State University
{dfiala,fmuelle}@ncsu.edu
2
Scalable System Software, Sandia National Laboratories, Albuquerque, NM 87123
kbferre@sandia.gov
3
Oak Ridge National Laboratories
engelmannc@ornl.gov

Abstract. Proposed exascale systems will present a number of consid-


erable resiliency challenges. In particular, DRAM soft-errors, or bit-flips,
are expected to greatly increase due to the increased memory density of
these systems. Current hardware-based fault-tolerance methods will be
unsuitable for addressing the expected soft error frequency rate. As a
result, additional software will be needed to address this challenge. In
this paper we introduce LIBSDC, a tunable, transparent silent data cor-
ruption detection and correction library for HPC applications. LIBSDC
provides comprehensive SDC protection for program memory by im-
plementing on-demand page integrity verification. Experimental bench-
marks with Mantevo HPCCG show that once tuned, LIBSDC is able to
achieve SDC protection with 50% overhead of resources, less than the
100% needed for double modular redundancy.

1 Introduction
With the increased density of modern computing chips, components are shrink-
ing, heat is increasing, and hardware sensitivity to outside events is growing.
These variables combined with the extreme number of components expected to
make their way in to computing centers as our computational demands expand
are posing a strong challenge to the HPC community. Of particular interest are
soft errors in memory that manifest themselves as silent data corruption (SDC).

Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.

Research sponsored in part by the Laboratory Directed Research and Development
Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC
for the U.S. Department of Energy under Contract No. De-AC05-00OR22725. The
United States Government retains and the publisher, by accepting the article for pub-
lication, acknowledges that the United States Government retains a non-exclusive,
paid-up, irrevocable, world-wide license to publish or reproduce the published form
of this manuscript, or allow others to do so, for United States Government purposes.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 251–261, 2012.

c Springer-Verlag Berlin Heidelberg 2012
252 D. Fiala et al.

SDCs are of great importance due to their ability to render invalid results in
scientific applications.
Silent data corruption can occur in many components of a computer includ-
ing the processor, cache, and memory due to radiation, faulty hardware, and/or
lower hardware tolerances. While cosmic particles are one source of concern,
another growing issue resides within the circuits themselves, due to miniatur-
ization of components. As components shrink, heat becomes a design concern
which in turn leads to lower voltages in order to sustain the growing chip den-
sity. Lower component voltages result in a lower safety threshold for the bits
that they contain, which increases the likelihood of an SDC occurring. Further,
as the densities continue to grow, any event that upsets chips (i.e., radiation) is
more likely to both interact with and be successful at flipping bits in memory.
Currently, servers that use memory with hardware-based ECC are capable of
correcting single bit error and detecting double bit errors [1], but errors that re-
sult in three or more bit flips will produce undefined results including silent data
corruption which may produce invalid results without warning. Today, research
has been performed on the frequency and occurrence of single and double bit er-
rors [9], but data on the frequency of triple bit errors remains inconclusive even
though up to 8% of DIMMs will incur correctable errors while 2%-4% will incur
uncorrectable errors. Nonetheless, the overall occurrence of bit flips is expected
to increase as chip densities increase and computing centers move to millions of
cores.
To combat this growing problem, new methods to both detect and correct
faults that result in data corruption are essential. Specifically, it is critical to de-
velop a fault resilient framework that provides for SDC detection and continuous
execution in the face of faults. As applications increase in run time and scale
out, it is no longer feasible to rely on traditional checkpoint-restart solutions to
protect an application. Even with the bottlenecks that are checkpoint/restart
I/O aside, we can not guarantee that an execution will be able to fully execute
fault-free without interrupt due to a low average time between failures. Follow-
ing this thought, we may not be able to reliably verify application results by
simply running it twice if we are prone to a very high probability that a fault
will render the results of both runs incorrect.
One method to address silent data corruption is in the field of algorithmic
fault tolerance where researchers have proposed methods to protect matrices
from SDCs that corrupt elements within a matrix [3]. While it is possible for this
work to protect some matrix operations such as multiplication, this form of fault
tolerance may not be able to protect all types of possible matrix operations even
if we disregard the fact that matrices are only one of numerous important types
of structures. Although promising in some regards, fault tolerant algorithms
can be incredibly difficult or simply impossible to design for any arbitrary data
structure or operation on data. Worse, this type of protection does not provide
comprehensive coverage of the entire application, which leaves anything outside
of the algorithm such as other data and instructions entirely vulnerable to SDCs.
A Tunable, Software-Based DRAM Error Detection and Correction Library 253

For these reasons, there is a dire need to develop generic fault tolerance op-
tions that provide wide coverage to an application and its data while remaining
agnostic to the actual algorithms that applications utilize.
This paper outlines a generic memory protection library that increases the
resilience of all applications that it guards by protecting data at the page level
using a transparent, tunable on-demand verification system. The library pre-
sented within provides the following contributions:
– Provides transparent protection against SDC for all applications without the
need for any program modifications.
– Our solution is tunable to best match the data access patterns of an appli-
cation.
– Extensibility within the library provides for easy addition of new features
such as adding software-based ECC which can not only detect, but also
correct SDC that evades hardware ECC.

2 Design
In this paper we present LIBSDC, a transparent library that is capable of detect-
ing and optionally correcting soft-errors in system memory that cause corruption
in program data during execution. LIBSDC protects against SDCs by tracking
memory accesses at the virtual memory page level and verifies that the contents
of each accessed page have not unexpectedly been altered.
To ensure memory has not become corrupted, LIBSDC is responsible for mon-
itoring all read and write requests that an application incurs during execution
while simultaneously verifying these data accesses. Each memory access is hence-
forth assumed to be at the granularity of an entire page of virtual memory in-
stead of individual bytes. At a high level, each memory access that an application
makes will be intercepted by LIBSDC and the contents of the page in which the
memory address resides are verified against a previously known-good hash of
that page. If during execution an unexpected hash mismatch occurs between the
page and its last known value, then LIBSDC will terminate the process or roll
back to a previous checkpoint if available to ensure that the application does not
continue to compute and report invalid results. After a page’s integrity has been
successfully verified, then application is allowed to proceed with the memory
access and continue making forward progress.
Once a memory access completes verification, the entire page in which the
access resides will become available for use without further interception by LIB-
SDC. A page in this state will be referred to as unlocked. Likewise, all other pages
that have not yet been verified by LIBSDC will be considered locked. For each
additional locked memory access that occurs, LIBSDC will intercept the request
and verify the locked memory before unlocking it and allowing the application
to progress.
254 D. Fiala et al.

Page accesses(unlocking) can be thought of as such:


On page request (initial read or write):
If page is locked:
Perform hash of page
Compare current hash with previously stored known-good hash
If any inconsistency found:
Notify the presence of SDC and report location
Terminate application / Rollback to previous checkpoint
Mark page as unlocked
Return control to application
As an application executes over time, it is inevitable that all needed pages
within an application’s address space will at some point become unlocked, which
means that no further page-level error checking will occur. Therefore it is neces-
sary for LIBSDC to occasionally put pages back in a locked state when they are
no longer being used so that they may be protected from SDCs while resident
in memory.
Page locking is shown as follows:

On page lock:
Calculate new hash of entire page
Storage hash in separate location
Mark page as locked
Return control to application

Managing locked and unlocked pages internally requires LIBSDC to hook mem-
ory allocation functions such as malloc, realloc, and memalign to learn of new
memory addresses that should receive protection. When a new memory range
has been allocated for an application, LIBSDC automatically locks all pages in
the range of the new memory so that all future accesses to that memory are
within the scope of protection that LIBSDC provides.
As the amount of allocated memory per application as well as the working-set
of pages required varies, LIBSDC allows the user to tune the maximum number
of pages to allow in the unlocked state. This tunable parameter, known as max-
unlocked, is set prior to invoking an application and permanently defines the
maximum number of pages to allow unlocked at any given time during execution.
When the max-unlocked limit of unlocked pages is reached, any further accesses
to pages in the locked state will require LIBSDC to lock some other unlocked
page to accommodate for the new page of memory.
Tuning the max-unlocked parameter requires consideration as its value is di-
rectly related to both application performance as well as the effectiveness of SDC
protection. Providing a relatively low max-unlocked value will force LIBSDC to
more frequently lock and unlock pages resulting in unnecessary verifications. In
this case, the overhead of intercepting page accesses combined with frequent
rehashing will quickly diminish application performance. The effect of a max-
unlocked value much less than the application’s work-set of pages will result in a
A Tunable, Software-Based DRAM Error Detection and Correction Library 255

reaction comparable to thrashing. On the other hand, if the max-unlocked value


is set too high, (i.e. a value much greater than an application’s working-set)
then the maximum level of SDC protection afforded by LIBSDC might not be
attained. Too high a max-unlocked value will affect pages that remain unlocked
for long periods of time without use while leaving them vulnerable to SDCs as
their contents are only protected once switched back to the locked state. For
these reasons it is important to tune applications using LIBSDC with an rea-
sonable max-unlocked value that adequately expresses the level of protection vs.
overhead desired.

2.1 Extensions for Error Correction

Through the design section of this paper we have referred to LIBSDC storing a
hash of pages that are under its protection. When a page is hashed, the hash may
be compared against a future hash taken on the same page to determine if any
changes have occurred, but this information alone is not suitable for correcting
errors that a hash may detect. To provide additional SDC correction capabilities
on top of the detection mechanisms, it is possible to additionally compute and
store error correcting codes (ECC) such as hamming codes that may be used to
fix bit flips in memory. For example, 72/64 hamming codes which are frequently
used in hardware may be employed inside of LIBSDC to provide single error cor-
rect, double error detect (SECDEC) protection at the expense of the additional
storage required for the ECC codes. Combining LIBSDC with hardware-ECC
can provide not only the ability to detect triple bit errors or greater, but can
also provide correction capabilities as the software-layered protection in LIB-
SDC may still retain viable error correcting codes. If LIBSDC is extended with
hashing plus ECC codes then it is possible to enjoy the protection and speed of
hashing while limiting ECC code recalculation only to times when a page has
been modified during execution resulting in a changed hash.

2.2 Assumptions and Limitations

LIBSDC’s protection extends only to memory and is not designed to protect


against faults that occur in the CPU or other attached devices. Since protection
is provided for data stored in main memory, LIBSDC requires the capability to
detect memory accesses. LIBSDC achieves this by altering process page tables
and removes read/write page permissions in order to receive OS signals that
indicate which memory addresses are being accessed upon a page fault.
For simplicity, our prototype of LIBSDC at present only protects memory that
is dynamically allocated using previously mentioned functions such as malloc.
There is no reason that extensions could not also provide protection to all data
regions including the code, initialized data, and BSS sections.
As LIBSDC verifies page contents upon transitioning from the locked to the
unlocked state, any SDCs that affect unlocked memory during the window in
which they are not protected are vulnerable. For this reason it is important to
choose a max-unlocked value that does not needlessly leave more pages than
necessary in an unlocked state when not being utilized.
256 D. Fiala et al.

Any application that depends on DMA with devices such as network inter-
connects must ensure that buffers are in an unlocked state before DMA begins.
This assumption is necessary since DMA avoids the MMU and thus LIBSDC
is never notified of page accesses to buffers. Data written through DMA would
appear as corruption to LIBSDC because the changes were made while the data
pages written were in a locked state.

3 Implementation
LIBSDC protects memory from SDCs by comparing last known good hashes of
virtual memory pages with a hash of their current data upon page access by an
application. Therefore it is critical that LIBSDC be able to receive notification
when a page is being accessed by an application. To achieve this, LIBSDC uses
the mprotect system call to modify page permissions and take away read and
write access. By installing a signal handler for SIGSEGV (segmentation fault),
LIBSDC is notified by the operating system any time a locked page (one without
read/write permissions) is accessed. Upon notification, LIBSDC uses an internal
table to verify that the page being accessed is one that it intends to protect.
If it is, then verification is performed by taking a hash of the current page
and comparing it to the last known good hash which is stored in LIBSDC’s
table. After verification, the page’s read and write permissions are restored using
mprotect before returning control to the user application upon exiting the signal
handler.
Internally, the table that LIBSDC uses to store information on pages is com-
promised of several fields:
– A status flag to indicate locked, unlocked, or not managed by LIBSDC
– Storage for the page’s last known good hash
– Pointers to indicate which pages were accessed for use as a first-in-first-out
queue
Of particular interest of the LIBSDC’s table fields are the FIFO pointers. In
order to maintain a fair policy for evicting unlocked pages when the application
needs to access a page that is not currently available, LIBSDC maintains FIFO
ordering so that the oldest pages in the table are evicted first. Unfortunately
once a page is in the unlocked state it is not possible to track accesses to the
page until it is again locked. For this reason, the FIFO queue is based on the
order of unlocking, and while it may not exactly mirror an application’s data
access patterns, it should be similar.
Each locked page’s hash storage is tunable to accommodate the size of
whichever hashing algorithm is used. Additional fields can also be added to
accommodate storage for other needs such as ECC codes.

3.1 Handling User Pointers with System Calls


The use of a SIGSEGV handler allows the application’s data it depends on to
automatically transition from the locked state to the unlocked on demand during
A Tunable, Software-Based DRAM Error Detection and Correction Library 257

execution. Unfortunately, any system calls that are executed in kernel space do
not enjoy this luxury as kernel space does not call the SIGSEGV handler during
a page fault in a system call. System calls that attempt to access user space
pointers will fail unpredictably if proper page permissions are not applied prior
to the system call occurring. Therefore all system calls that accept user space
pointers require hooking in order to unlock memory regions that the kernel is
likely to access during the system call.
While in many cases it is possible to override GLIBC calls between applica-
tion linking/loading and replace them with wrappers that unlock any pointers
present, the GLIBC implementation may make system calls directly within it-
self instead of using your wrapper. For this reason it is essential that all system
calls are wrapped no matter their source. For simplicity, our LIBSDC prototype
makes a clone process of the original using the clone system call with CLONE_VM
as a parameter to share address spaces, and then uses the ptrace system call to
trace the application as it executes in order to receive notification of all system
calls occurring. The ptrace interface is provided as part of the Linux kernel and
allows a process to intercept all system calls and signals that another process
generates.
It should be noted that there are other less portable solutions that may ac-
complish system call hooking, but would require extensive per-platform work
such as binary rewriting to hook system calls or specialized kernel modules that
wrap system calls. Our prototype’s goal was to provide a platform for gauging
the viability and costs of SDC protection through hashing and page protection
while avoiding writing a complex platform specific system call hooking scheme
that would not add to the research contributions.

4 Results
To gauge the overheads and demonstrate the effects of tuning the max-unlocked
value, the HPCCG Mantevo Miniapp[8] was run with matrix size of 768x8x8

25

Runtimewithouthashing
20 Runtimewithhashing
ormalizedruntime

Doublemodularredundancy
15

10
No

0
4096 4224 4352 4480 4608 4736 4864 4992 5120
maxunlocked value(numberofpages)

Fig. 1. Normalized execution run-times of LIBSDC with HPCCG


258 D. Fiala et al.

scaled over 256 processes. The compute nodes used consisted of 2-way SMPs
with AMD Opteron 6128 (Magny-Cours), 32GB of memory per node, and a
40Gb/s Infiniband interconnect.
In Figure 1 we compared normalized execution time vs. the max-unlocked value
to demonstrate the effects of LIBSDC on an application. The baseline execution
time was taken by running HPCCG without LIBSDC performing any mprotect
calls and by default leaving all memory in an unlocked state. As a comparison,
the dashed line with a constant normalized time of 2 demonstrates the overhead
of double modular redundancy. LIBSDC’s overheads are shown with the dashed
line indicating the run-times without hashing and the solid line indicating the
run-times with hashing.
The choice of a range for max-unlocked between 4096 and 5120 is due to the
maximum working-set of pages residing near the middle of that range at around
4672 unlocked pages. As depicted in Figure 1, there is a dramatic drop in the
normalized run-time when we tune LIBSDC to use a max-unlocked value that
corresponds well to the active number of working pages. From the max-unlocked
range of 4672 to 5120, the normalized execution time falls from 1.79 to 1.53
respectively, which shows good improvement over even double modular redun-
dancy. Although not shown, in the poorly tuned ranges below 4096 a normalized
run-time of 21 or greater was observed.
For the results reported above, the average time spent calculating hashes
during execution is 15%.
It is important to note that the performance of LIBSDC’s hashing is highly
dependent on both the hashing algorithm used and on the way it is computed.
Although we chose to use SHA-1 computed on the CPU, research on comput-
ing hashes of pages using GPUs[2] has demonstrated that hashing performance
on GPUs greatly outperforms CPUs. This research indicates that applications
requiring page hashes should not consider the hashing itself to be a bottleneck.
We also find that the reason for the substantial overhead incurred with LIB-
SDC for a max-unlocked value less than the working-set of pages is due to our use
of the ptrace system call. ptrace is known to have performance penalties due to
frequent context switching on each system call and each received signal as well as
generating O.S. noise. This is worsened because each page unlock is intercepted
by ptrace during execution. While our prototype shows good performance for a
well tuned max-unlocked value, we expect that a production version of LIBSDC
would not use ptrace to intercept system calls. This would also result in better
performance for applications running with a well tuned max-unlocked value, too.

5 Related Work
Similar to LIBSDC, another approach [10] that is transparent to the application
achieves software-implemented error detection and correction using background
scrubbing combined with software calculated ECC to periodically validate all
memory and correct errors if possible. While this approach and LIBSDC are
both entirely transparent to the application, LIBSDC differentiates itself by
providing on-demand page-level checking based on the application’s data access
A Tunable, Software-Based DRAM Error Detection and Correction Library 259

patterns. In a HPC environment, software-based background scrubbing would


likely consume too much of the already limited memory bandwidth and generate
substantial noise during execution.
Other techniques involve modifying either the application source or the com-
piled form of the application to generate redundancy in data, instructions, or
both:
Source-to-source transformation techniques [6] have been investigated that
generate a redundant copy of all variables and code at the source level. Through-
out the transformed source code there are additional conditional checks that ver-
ify agreement in the redundant variables after each set of redundant calculations
are performed. If at any point throughout the execution redundant variables do
not agree then the application aborts. Unfortunately however, this technique
is unable to handle pointers, only supports basic data-types and arrays, and
doubles the required memory. SDCs that occur in the instruction memory may
not be detected thus causing unpredictable results. Due to a high number of
conditional jumps used for consistency checking, the efficiency of pipelining and
speculative execution suffers. LIBSDC differentiates itself from this work by not
requiring source modifications, lowering the memory requirement overheads sub-
stantially, supporting any type of code (pointers, data-types, etc are irrelevant
to LIBSDC), and can be instructed to protect any region of memory at run-time.
Duplicated instructions is another proposed technique to increase SDC re-
silience in software. EDDI [4] duplicates instructions and memory in the com-
piled form of an application in a manner similar to the source-to-source transfor-
mations, but achieves more support for programming constructs at the cost of
platform dependence. Unlike the source-to-source transformations, EDDI com-
piles applications to binary form, redundantly executes all calculations, ensures
separation between calculations by using differing memory addresses and differ-
ing registers, and attempts to order instructions to exploit super-scalar processor
capabilities. During execution the results of calculations are compared between
their redundant variable copies, but as a result, available memory is halved
and register pressure is doubled. LIBSDC differentiates itself from this work
by being platform-independent, not requiring redundant execution or program
modifications, and protecting instruction memory without the need for complex
control-flow checks.
Extensions to the EDDI have been proposed [7] that achieve better efficiency
by assuming reliable caches and memory, but still require redundant registers and
instructions. Their experiments showed an average normalized execution time of
1.41, but without protection for system memory. The similarity to EDDI may
indicate that even without protecting memory there is a substantial overhead
due to register pressure, additional instructions, and highly frequent conditionals
that come with duplicating instructions and registers. This work also showed that
compiled executables with the added fault tolerance were 2.40x larger than the
original unaltered executables.
Control-flow checking is another area of research that attempts to detect the
effects of SDCs in applications [5]. Unfortunately control-flow integrity
260 D. Fiala et al.

verification does not necessarily protect against SDCs that only alter data with-
out affecting the execution path of an application.

6 Conclusions and Future Work

In this paper we have presented a prototype implementation of the silent data


corruption detection library, LIBSDC. LIBSDC is a transparent, tunable library
that provides page-level protection against DRAM memory corruption. Initial
results show that this library is capable of providing SDC protection to parallel
HPC applications at a cost less than that of double modular redundancy.
Using LIBSDC, we were able to protect all dynamically allocated memory re-
gions of the HPCCG application with a 53% increase in run-time over a baseline
that lacked any SDC protection. Provided with hints from the application on
which regions of memory to protect, LIBSDC’s coverage can be tuned for an
application, therefore further reducing run-time overheads.
The results of this work are very promising, but further work is needed. One
considerable source of run-time overhead in our prototype implementation is
the ptrace mechanism. Once again, we use ptrace to intercept system calls
and ensure proper memory tracing and tracking is performed before the OS
performs the call. We believe that we can remove ptrace from the library and
provide an optimized system call wrapper to intercept these calls, though special
care must be taken in these wrapper functions as issues such as reentrancy
become critical to performance and correctness. In addition, we are investigating
mechanisms to enable LIBSDC to use a software-based error-correcting code
side-by-side with its current hash-based detection mechanisms. Use of these more
advanced error-correcting codes, for example codes that can correct double-bit
errors, will provide a level of protection beyond what is currently available today
in enterprise-class hardware.

References

1. Chen, C.L., Hsiao, M.Y.: Error-correcting codes for semiconductor memory ap-
plications: A state-of-the-art review. IBM Journal of Research and Develop-
ment 28(2), 124–134 (1984)
2. Ferreira, K.B., Riesen, R., Brighwell, R., Bridges, P., Arnold, D.: libhashckpt:
Hash-Based Incremental Checkpointing Using GPU’s. In: Cotronis, Y., Danalis,
A., Nikolopoulos, D.S., Dongarra, J. (eds.) EuroMPI 2011. LNCS, vol. 6960, pp.
272–281. Springer, Heidelberg (2011)
3. Huang, K.H., Abraham, J.: Algorithm-based fault tolerance for matrix operations.
IEEE Transactions on Computers C-33(6), 518–528 (1984)
4. Oh, N., Shirvani, P., McCluskey, E.J.: Error detection by duplicated instructions
in super-scalar processors. IEEE Transactions on Reliability 51(1), 63–75 (2002)
5. Oh, N., Shirvani, P., McCluskey, E.: Control-flow checking by software signatures.
IEEE Transactions on Reliability 51(1), 111–122 (2002)
A Tunable, Software-Based DRAM Error Detection and Correction Library 261

6. Rebaudengo, M., Reorda, M., Violante, M., Torchiano, M.: A source-to-source


compiler for generating dependable software. In: Proceedings of First IEEE In-
ternational Workshop on Source Code Analysis and Manipulation 2001, pp. 33–42
(2001)
7. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.I.: Swift: Software
implemented fault tolerance. In: Proceedings of the International Symposium on
Code Generation and Optimization, CGO 2005, pp. 243–254. IEEE Computer
Society, Washington, DC (2005), http://dx.doi.org/10.1109/CGO.2005.34
8. Sandia National Laboratory: Mantevo project home page (June 2011),
https://software.sandia.gov/mantevo
9. Schroeder, B., Pinheiro, E., Weber, W.D.: Dram errors in the wild: a large-scale
field study. In: Proceedings of the Eleventh International Joint Conference on Mea-
surement and Modeling of Computer Systems, SIGMETRICS 2009, pp. 193–204.
ACM, New York (2009), http://doi.acm.org/10.1145/1555349.1555372
10. Shirvani, P., Saxena, N., McCluskey, E.: Software-implemented edac protection
against seus. IEEE Transactions on Reliability 49(3), 273–284 (2000)
Reducing the Impact of Soft Errors
on Fabric-Based Collective Communications

José Carlos Sancho, Ana Jokanovic, and Jesus Labarta

Barcelona Supercomputing Center


Barcelona, Spain

Abstract. Collective operations might have a big impact on the per-


formance of scientific applications, specially at large scale. Recently, it
has been proposed Fabric-based collectives to address some scalability is-
sues caused by the OS jitter. However, soft errors are becoming the next
factor that significantly might degrade collective’s performance at scale.
This paper evaluates two approaches to mitigate the negative effect of
soft errors on Fabric-based collectives. These approaches are based on
replicating multiple times the individual packets of the collective. One
of them replicates packets through independent output ports at every
switch (spatial replication), whereas the other only uses one output port
but sending consecutively multiple packets through it (temporal replica-
tion). Results on a 1,728-node cluster showed that temporal replication
achieves a 50% better performance than spatial replication in the pres-
ence of random soft errors.

1 Introduction
Large-scale HPC clusters containing thousands of nodes are usually intercon-
nected using a commodity interconnect such as InfiniBand [1] and arranged
by a cost-effective slimmed fat-tree topology [2]. One popular example of these
systems is the Roadrunner supercomputer [3] built at Los Alamos National Lab-
oratory which was the first supercomputer that achieved one Petaflop of peak
performance.
On these systems, MPI is the de facto standard for communication. MPI pro-
vides both point-to-point communications as well as collective communications.
Collective communications are group communications between many nodes used
for different purposes such as combining partial results of computations (such as
Gather and Reduce), synchronization of nodes (Barrier), and publication (Broad-
cast). However, because collectives involve the participation of all the members
in the group before it can conclude, any variance in node communication re-
sponsiveness and performance for any member of the group have a big impact
on the completion time. One major cause of variability is produced by the OS
jitter. Recently, Fabric-based collective communications [4] have been proposed
to address this scalability problem by moving the collective’s calculation from
nodes onto switches which do not have OS jitter.
Another important factor that introduces a high variability and performance
degradation of collectives are soft errors. These errors corresponds to alterations

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 262–271, 2012.
Springer-Verlag Berlin Heidelberg 2012
Reducing the Impact of Soft Errors on Fabric-Based Collectives 263

in the bit stream received over a communication channel. They are caused by
various factors such as channel noise, interference, distortion, bit synchronization
and attenuation problems. The frequency of occurring these errors is measured
by network manufactures by the Bit Error Rate (BER)—the number of bit errors
divided by the total number of transferred bits during a studied time interval.
Typical BER values found on transmission channels range from 10−12 up to
10−15 for high-end optical cables. Although, the probability of an error happen-
ing on a single channel is small, the large amount of communication channels
found in clusters results in a system-wide failure rate quite small. Note that next
generation of supercomputers will contain in the order of millions of channels.
Unfortunately, Fabric-based collectives can suffer from soft errors. The gen-
eral approach to deal with these errors in current interconnection networks is
to detect errors at the receiver side with an CRC code, and then ask for a
re-transmission if the error is positive. However, these re-transmissions are in-
evitably adding delays to the individual messages involved in each step of the
collective’s calculation resulting in an overall performance loss. Providing a tech-
nique to avoid these re-transmissions or ameliorate their negative impact would
be of great interest. One possible technique would be the use of error-correcting
codes (ECC), so errors can be detected and also corrected. However, since some
of the collective’s messages are not fully stored in switches, ECC is not viable
on Fabric-based collectives.
In this paper, we propose the use of message replications in order to reduce
the degradation caused by soft errors on Fabric-based collectives. Two different
replication techniques have been evaluated: spatial and temporal replications.
Results on 1,728-node InfiniBand cluster arranged on a slimmed fat-tree shows
that temporal replications is the most effective solution to mitigate the negative
effects of soft errors on Fabric-based collectives. Performance of up to 50% can
be achieved with respect to spatial replications.
The rest of this paper is organized as follows. Section 2 briefly describes the
operation of Fabric-based collectives on slimmed fat-tree topologies. Section 3
shows how InfiniBand detects and handles soft errors. Section 4 describes two
approaches based on spatial and temporal replications to mitigate the negative
effects of soft errors. Section 5 characterizes the impact on collective performance
when soft errors are present in the network for both proposed techniques. Section
6 summarizes recent approaches to deal with network errors. Conclusions from
this work are given in Section 7.

2 Fabric-Based Collectives
Fabric-based collectives is an approach to accelerating the calculation of col-
lective communications. It uses the switch CPU to perform the collective steps
and required calculations instead of using the host CPU as in the traditional ap-
proach. Recently, it has been integrated with the popular OpenMPI and Platform
MPI message passing libraries and it is fully supported on InfiniBand networks.
Basically, this scheme is composed of a manager that orchestrates the ini-
tialization of the collective communication tree and a SDK that offloads the
264 J.C. Sancho, A. Jokanovic, and J. Labarta

computation of the collective onto the switches. Today’s switches have been re-
designed and optimized to support a super scalar FPU hardware engine that
performs single and double precision operations in just one single cycle. This
technology have been specifically targeted to the M P I− Barrier, M P I− Reduce,
and M P I− Allreduce operations, which in turn are the most frequent operations
found on scientific applications.
In essence, the collective calculation is composed on two phases— a reduction
phase and a broadcast phase. In the first phase, switches aggregate the collective
values from all the computing nodes and switches attached to them, calculate the
resulting value, and forward it to higher level switches. The root of the collective
tree calculates the final reduction and on the second collective phase the result is
broadcasted to computing nodes using multicast operations. Notice that in order
to calculate the reduction it implies that messages have to be fully received at
the switches before sending the partial result to upper switches in the reduction
phase. However, on the final broadcast phase, there is no need to wait until the
full message is received to start transmitting it down to computing nodes.

3 Handling Soft Errors on InfiniBand


The InfiniBand’s link layer is where most soft errors are detected when packets
traverse through the network. According to its specification [5] there is a diverse
set of soft errors that can be detected:

– Physical errors. Errors indicative of bit errors at the attached physical link.
These are detected by CRC checks.
– Malformed packet errors. Errors indicative of packets transmitted with in-
consistent content.
– Switch routing errors. Errors indicative of an error in switch routing.
– Buffer overrun. Error indicative of an error in the state of the flow control
machine.

When one of these errors is detected on any single packet at the switches, the
immediate action is to discard the packet, record the type of error for further
processing, and notify to sender side of the transmission that the packet was
corrupted. This last action is performed by the hardware-level ACK messages.
Originally, InfiniBand’s host channel adapters (HCA) where the ones sending
these notifications back to the sender HCAs. However, these notifications had
to be supported also at switches on Fabric-based Collectives because they are
becoming now the originators of packets.
In case of a corrupted packet it might be just dropped if it has not been
yet forwarded to the next switch; or in the case that it has been started the
transmission, switches are appending a bad CRC value and the End Bad Packet
delimiter (EBP) as an alternative to dropping the packet.
InfiniBand implements two different CRC checks in any packet which are
invariant (ICRC) and variant (VCRC) CRCs. ICRC is 4 bytes long and covering
only the fields of the packet which are invariant from end to end on the network.
Reducing the Impact of Soft Errors on Fabric-Based Collectives 265

The polynomial used is CRC-32. On the other hand, VCRC is a CRC-16 2-


bytes long polynomial which covers all fields of the packet and it is computed
at every switch. Both ICRC and VCRC are appended at the end of the packets.
Additionally, note that there is no support for ECC on InfiniBand.

4 Collective Replication Approaches


In this section we are describing two different approaches to mitigate the impact
of soft errors on Fabric-based collectives. The first one is based on replicating
collective packets to different output ports on the switches; we called this ap-
proach as Spatial replication. And the second approach is also based on replicat-
ing collective packets, but using the same output port, we refer to it as Temporal
replication.

4.1 Spatial Replication


This method takes advantage of the multiple number of output ports that might
be available to connect to upper level switches on slimmed fat-tree topologies.
In essence, it duplicates the resulting collective packets at each switch and re-
transmit them to each of these output ports. This message duplication provides
some resiliency to soft errors— In case of packet corruption in one of these
channels, the packet would be delayed, but another packet in the other output
channel will still deliver the collective value to upper level switches without no
delay. Therefore, with this scheme if the number of soft errors is lower than the
number of replications on the output ports, then there is no degradation due to
soft errors.
Figure 1 illustrates this approach showing in red lines the links used to calcu-
late the collective on a slimmed fat-tree topology [2] (XGFT(3;4,4,2;1,4,2)) for a
16-node parallel application spread out in four switches in the network. As can
be seen, S0 duplicates the resulting value R1 to both output ports which connect
to two different upper-level switches, S8 and S9 . In case that there is a soft error
occurring in channel S8 to S12 , R5 can still be delivered to upper level switches
through the alternate channel S8 to S13 . Note that in this approach we will have
multiple roots for calculating the final result. This is a necessary condition in
order to provide also resiliency when broadcasting the final result back to the
computing nodes. For example, if there is another soft error on channel from
root S12 to switch S8 , roots S13 , S14 , and S15 can still transmit down the result
R7 .
Also, notice that in order to provide resiliency at the HCAs, there must be also
multiple independent output channels available on HCAs. This can be achieved
on InfiniBand networks in two ways. The first one may involve the use of dual-
port HCAs where each port connects to the same switch or to another switch
in the network. This is the most expensive solution because it requires to du-
plicate the number of switches in the network in order to accommodate the
same number of computing nodes. Because of that this solution is not com-
monly found in current production clusters. The second solution, which is more
266 J.C. Sancho, A. Jokanovic, and J. Labarta

R7= R5 + R6 R7 R7 R7

S12 S13 S14 S15

Soft error

R5= R1+R2 R5 R5 R6
R5 R6= R3 + R4 R6 R6

S8 S9 S10 S11

R1= A0+A1+A2+A3 R3= A8+A9+A10+A11

R1 R2 R2= A4+A5+A6+A7 R3 R4 R4= A12+A13+A14+A15

S0 S1 S2 S3 S4 S5 S6 S7

A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15

Fig. 1. Spatial replication scheme

cost-effective, relies on single-port HCAs, but re-configuring the port to two or


more independent channels. InfiniBand defines three different port speeds, 1X,
4X, 12X. 4X and 12X are achieved by aggregating four and twelve 1X channels
together, respectively. Since most production clusters actually deploy 4X-port
HCAs we can actually configure them as four 1X independent channels in order
to get independent output channels. However, notice that with this solution,
soft-error resiliency is provided, but at the expenses of reducing the available
HCA bandwidth.

4.2 Temporal Replication


The temporal replication scheme re-transmits multiple times every collective
packet but to the same output port on the collective tree. Basically, unlike the
spatial replication, during the reduction phase one output port is enough to
provide resiliency to soft errors because the same packet is being re-transmitted
consecutively multiple times. And during the collective’s broadcast phase we also
multicast the collective final result to all output ports down towards computing
nodes, and also sending multiple packets through these channels.
Figure 2 illustrates this scheme for the same example as before for the col-
lective reduction phase. As it is shown, only one output port at each switch
and HCA is used, but on this channel the same packet is being transmitted
twice. Note that with this scheme there is no need to split up the HCA’s port in
multiple independent output channels because we only require one output port.
Therefore, there is no reduction on bandwidth at the HCAs.
However, contrary to the spatial replication scheme it may be some penalty
when a packet got corrupted. For example, in case that R5 got corrupted by a
soft error over the channel S8 to S12 , S12 will discard R5 and it will wait until
the R5 ’s second packet arrives. The collective’s completion would suffer a delay
of just one packet transmission. On the other case, if the corrupted packet was
Reducing the Impact of Soft Errors on Fabric-Based Collectives 267

R7= R5 + R6

S12 S13 S14 S15


Soft error
R5 R6
R5= R1+R2 R6= R3 + R4

Packet S8 S9 S10 S11


re-transmissions
R1 R2 R3 R4
R3= A8+A9+A10+A11 R4= A12+A13+A14+A15
R1= A0+A1+A2+A3 R2= A4+A5+A6+A7

S0 S1 S2 S3 S4 S5 S6 S7

A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15


Ai Ai Ai Ai
Ai Ai Ai Ai

Fig. 2. Temporal replication scheme

the second R5 instead of the first R5 then there will be no delay at all because
the first packet will still have a valid value and the second packet would be just
discarded when it arrives at S12 .

5 Experiments
In this section, we have evaluated both schemes, spatial and temporal replica-
tions Fabric-based collectives in the case of having one or multiple soft errors
in the network. The evaluation is performed by simulation using the Venus net-
work simulator [6]. In this simulator, it has been implemented the Fabric-based
collective technology and also both fault-tolerant schemes.
A large network containing 1,728 computing nodes is used in the evaluation.
It is arranged on a 3-level slimmed fat-tree topology, XGFT(3;24,12,6;1,12,6).
InfiniBand network is considered using 36-port switches and single port HCAs
with a 100ns delay for each HCAs and switches. A 4X SDR (10Gb/s) port config-
uration is used in the evaluation. On the spatial approach the HCA’s bandwidth
is reduced proportionally to the number of output port replications. Replications
of two, four, and six are considered for both temporal and spatial techniques.
We used the M P I− Allreduce collective operation in our evaluations. This is
the most common collective operation found in scientific applications because it
is being used on Conjugate Gradient Solvers. We assume a collective operation
with few operands that actually fit on the InfiniBand’s minimum transfer unit
that consist of 256 bytes.
One and multiple soft errors are injected in the network. It is considered
the case of one and two soft errors on specific channels and also multiple soft
errors randomly affecting multiple channels. In the latter case it is following
a exponential distribution of errors with mean values ranging from 1, 10, 100,
up to 1000 µs. In this case, the average time to complete the collective in the
presence of soft errors is reported over a thousand collective operations.
268 J.C. Sancho, A. Jokanovic, and J. Labarta

4.5 2.0

4.0 spatial 1.8 spatial

3.5 temporal 1.6 temporal


Collective time (usecs)

Collective time (usecs)


1.4
3.0
1.2
2.5
1.0
2.0
0.8
1.5
0.6

1.0
0.4

0.5 0.2

0.0 0.0
one network failure two network failures two network failures two network failures two network failures
non-failure all links failure once same switch level 1 same switch level 2 switch level 1-2

Fig. 3. Non-failure and all channels fail- Fig. 4. One and two soft errors on se-
ure scenarios lected network channels

5.1 Results

Figure 3 shows the performance of Fabric-based collectives for spatial and tem-
poral replications on two extreme cases: the case of a failure-free scenario and the
case of experiencing a failure on all the network links once. In both techniques
the number of replications used is two. As can be seen, temporal significantly
outperforms spatial by 22% in the case of non-failures. The reason for that is
that the HCA’s bandwidth had to be divided up to support multiple output
channels on spatial. For the last extreme case of experiencing a soft-error on ev-
ery link, temporal still significantly outperforms spatial because spatial is not able
to provide a free-failure path, and thus it is been heavily penalized due to mul-
tiple re-transmissions of collective packets once soft errors have been detected.
In particular, there is a difference in performance of almost a 2X factor.
Figure 4 shows various cases having one and two soft errors occurring in
specific network channels. The first case, having only one soft error in a chan-
nel significantly degrades temporal by 22% with respect to the non-failure case
showed before. This is because the waiting time to get the second collective
packet. In the case of spatial does not suffer additional degradation because an-
other collective packet is still being transmitted through another channel. In this
scenario both approaches are achieving the same performance. Similarly, the case
to having two simultaneous soft errors in two different network channels and in
two different switches, but at the same tree level, is not further degrading the
performance of both techniques. However, the interesting case comes when soft
errors are occurring in the same switch on the level 1 for spatial. In this par-
ticular case, both output ports are experiencing soft errors and thus it can not
provide a fault-free path suffering a 55% degradation. This case is not happening
at higher levels of the fat tree as it can be seen in the next set of results. This
is due to the fat-tree topology where replications on different output ports on
a switch on level i make the collective going to different switches on the upper
level i + 1. And hence, if one these switches at level i + 1 is experiencing failures
the other switch on the same level can still deliver the collective to upper levels.
The last case, shows a worse scenario for temporal where soft errors are actually
Reducing the Impact of Soft Errors on Fabric-Based Collectives 269

6 6

spatial spatial
5 5
temporal temporal
Collective time (usecs)

Collective time (usecs)


4 4

3 3

2 2

1 1

0 0
1000 100 10 1 2 4 6

Failure rate (usecs) Number of collective replications

Fig. 5. Multiple random failures Fig. 6. Various collective replications on


the worst scenario (1 µs)

occurring in two connecting switches, but each one sit on a different level. In this
failure scenario, temporal is suffering higher degradation than spatial because the
first collective packet is being consecutively delayed in both switches. Therefore,
suffering twice the penalty to wait for the second collective packet to arrive.
Figure 5 shows the scenario of experiencing random soft errors over multiple
collectives. As can be seen, temporal significantly outperforms spatial. Specifi-
cally, at 1ms, 100 µs, and 10 µs the collective time is reduced by 18%, 30%, and
14%, respectively. The margins are reduced at less frequent failure rate (1,000 µs)
because when there is a small number of failures both techniques performs simi-
lar as it was shown before. Also, on a very high failure rate (1 µs), both techniques
perform the same. The reason for that is that there are too many soft errors that
almost any collective packet is having a soft-error, and thus both techniques suf-
fer from having multiple re-transmissions. In order to provide more resiliency to
this environment we increased the number of collective replications up to six as it
is shown in Figure 6. As can be seen, the collective time significantly decreases as
we increases the number of replications, specially for temporal. In particular, the
collective time drops by 30% and 46% when going to 4 and 6 replications. Note
that spatial is not decreasing significantly collective time from 2 to 4 because it
is also reducing the available HCA bandwidth proportionally. Overall, temporal
is still outperforming spatial for these cases. Performance improvements of 50%
and 22% are seen for 4 and 6 replications.

6 Related Work

Providing fault-tolerance to MPI communications is a hot research topic today.


Currently, the MPI Forum’s Fault Tolerance Working Group is working in that
subject. Recently, they have just proposed a run-through stabilization compo-
nent in MPI to deal with failures [7]. This component provides an application
with the ability to continue running and using MPI even when one or more
270 J.C. Sancho, A. Jokanovic, and J. Labarta

processes in the MPI universe fail. In this context, it has been analyzed various
fault-tolerant algorithms to deal with hard failures for collectives in [8] and [9].
In [8] these algorithms are based on a new MPI function M P I− Comm− validate
that can check process failures any time. If a process failure is detected then it re-
builds the collective tree accordingly, so for the next collective, the tree is already
working and optimized. In [9], the collective tree is only re-built when the failure is
detected during the collective operation. However, re-building the collective tree
is too expensive in order to handle soft errors on Fabric-based collectives.
Additionally, it has been proposed in [10] an enhancement resilient protocol
for Eager and Rendezvous point-to-point communications that covers fabric end-
to-end hard and soft failures including also HCAs. Unlike our approach, the basic
idea is to solely act as soon as a failure is detected, but this approach may lead to
a higher degradation. We believe that a pro-active approach—also acting before
a failure will occur— is better to reduce the potential harmful degradation of
soft errors.

7 Conclusions
Soft errors are having a big impact on the performance of collective communi-
cation operations. For these communication operations, solely acting when soft
errors occur is not efficient enough, and thus pro-active solutions are highly rec-
ommended. We have evaluated two of these pro-active solutions called spatial
and temporal replications.
Evaluations show that temporal replications deliver higher performance than
spatial replications. In particular, a 50% lower degradation is observed on tem-
poral with respect to spatial in the presence of soft errors. Therefore, temporal
replications are effectively diminishing the impact of soft errors. In addition,
temporal replications can be seamlessly deployed in current production systems
because it does not require the use of special hardware. Note that spatial would
require at least 4X HCAs.
We understand that the only benefit of spatial would come from the potential
to mitigate also hard failures. However, this work demonstrates that spatial
achieves a very poor performance with respect to temporal, and thus it will
make it less attractive to be deployed as a stand-alone solution.
Acknowledgments. We thankfully acknowledge the support of the Spanish
Ministry of Science and Innovation under grant RYC2009-03989, the European
Commission through the HiPEAC-2 Network of Excellence (FP7/ICT 217068),
the Spanish Ministry of Education (TIN2007-60625, and CSD2007- 00050), and
the Generalitat de Catalunya (2009-SGR-980).

References
1. InfiniBand website: Infiniband trade association, official website on,
http://www.infinibandta.org
2. Öhring, S.R., Ibel, M., Das, S.K., Kumar, M.J.: On generalized fat trees. In: Pro-
ceedings of the 9th International Parallel Processing Symposium, p. 37. IEEE Com-
puter Society, Washington, DC (1995)
Reducing the Impact of Soft Errors on Fabric-Based Collectives 271

3. Barker, K.J., Davis, K., Hoisie, A., Kerbyson, D.J., Lang, M., Pakin, S., Sancho,
J.C.: Entering the petaflop era: the architecture and performance of roadrunner.
In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008,
pp. 1:1–1:11 (2008)
4. Mellanox: Fabric collective accelerator (2011),
http://www.mellanox.com/related-docs/
prod acceleration software/fca.pdf
5. InfiniBand specification: Infiniband trade association, infiniband architecture spec-
ification, vol. 1, release 1.0.a (2001)
6. Minkenberg, C., Rodriguez, G.: Trace-driven co-simulation of high-performance
computing systems using OMNeT++. In: Proceedings of the 2nd International
Conference on Simulation Tools and Techniques, Simutools 2009 (2009)
7. Fault Tolerance Working Group: Run-though stabilization interfaces and
semantics,
svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/runthroughstabilization
8. Hursey, J., Graham, R.: Preserving collective performance across process failure
for a fault tolerant. In: 16th International Workshop on High-Level Parallel Pro-
gramming Models and Supportive Environments (HIPS) held in conjunction with
the 25th International Parallel and Distributed Processing Symposium (IPDPS),
Anchorage, Alaska (May 2011)
9. Jaros, J.: Evolutionary Design of Fault Tolerant Collective Communications. In:
Hornby, G.S., Sekanina, L., Haddow, P.C. (eds.) ICES 2008. LNCS, vol. 5216, pp.
261–272. Springer, Heidelberg (2008)
10. Koop, M.J., Shamis, P., Rabinovitz, I., Panda, D.K.: Designing high-performance
and resilient message passing on infiniband. In: Communication Architecture for
Scalable Systems Workshop held in conjunction with the 25th International Parallel
and Distributed Processing Symposium (IPDPS), Atlanta, Georgia USA (April
2010)
Evaluating Application Vulnerability to Soft
Errors in Multi-level Cache Hierarchy

Zhe Ma1,3 , Trevor Carlson2,3, Wim Heirman2,3 , and Lieven Eeckhout2,3


1
Imec
Kapeldreef 75, 3000 Leuven, Belgium
mazhe@imec.be
2
Ghent University
Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
{tcarlson,wheirman,leeckhou}@elis.ugent.be
3
Intel ExaScience lab
Kapeldreef 75, 3000 Leuven, Belgium

Abstract. As the capacity of caches increases dramatically with new


processors, soft errors originating in cache memories has become a major
reliability concern for high performance processors. This paper presents
application specific soft error vulnerability analysis in order to under-
stand an application’s responses to soft errors from different levels of
caches. Based on a high-performance processor simulator called Graphite,
we have implemented a fault injection framework that can selectively in-
ject bit flips to different levels of caches. We simulated a wide range of
relevant bit error patterns and measured the applications’ vulnerabilities
to bit errors. Our experimental results have shown the differing vulner-
abilities of applications to bit errors in different levels of caches (e.g. the
application failure rate for one program is more than the doulbe of that
for another program for a given cache); the results have also indicated
the probabilities of different failure behaviors for the given applications.

Keywords: Soft error, processor simulator, fault injection.

1 Introduction
Two trends are observed in the ongoing development of the future generations
of high performance computing systems: 1) the processor is fabricated with the
CMOS processing technology that is constanly scaling down and 2) commodity
high-end processors, rather than customized processors, are more widely em-
ployed to reduce the total cost. The combination of these two trends makes the
reliable working of hardware much more difficult [16]. Unreliable hardware be-
haviors can be roughly split into hard errors and soft errors. While a hard error is
a persistent hardware failure, a soft error is a transient failure and hence harder
to detect and analyze.

This work is funded by Intel and by the Institute for the Promotion of Innovation
through Science and Technology in Flanders (IWT).

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 272–281, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Evaluating Application Vulnerability to Soft Errors 273

Soft error is a well known reliability concern. However, it is known that appli-
cations have intrinsic masking of soft errors (error rate derating) [12,14]. Thus
many bit errors are filtered out and are not visible at the application level. To find
a cost-effective soft error mitigation strategy, it is necessary for system designers
to have fault injection tests in order to obtain a good estimate of application
level soft error rate. How to perform error injection is an important topic in a
computer system reliability analysis. Different approaches are developed and can
be roughly grouped into the following categories:

Hardware Built-In Injection. These approaches [10] depend on specific units


built into hardwares, usually in the form of Built-In Self-Test. They are only
available in existing hardware.
Accelerated Hardware Emulation. These approaches [13,6] employe a de-
tailed model of a target hardware, and simulate this model on an accelerator
(usually a FPGA platform). The requirement of a detailed model usually
means this is performed already at a later stage of a design workflow.
Software Based Injection. These methods [5,2] run an altered software ap-
plication natively on its target processor. The modifications inside the ap-
plication can inject faults to the application visible memory addresses.
Simulated Hardware Based Injection. These approaches [17] employe a
high level model of a target hardware and simulate this model on a gen-
eral purpose computer. The high level model is usually less accurate but is
easier to modify in order to model innovative features in a yet to be im-
plemented processor. This high level model is usually fast enough to test
relevant software applications and hence to have more application specific
estimate of the effect of faults.

We choose simulated hardware based injection because 1) it has relatively low


development and deployment cost and 2) it provides flexibility to explore novel
processor architectures. Thanks to a high performance processor simulator called
Graphite [11], we can efficiently simulate an application’s responses when soft
error induced bit errors take place in a processor’s cache hierarchy.
Previous studies [5,2] investigated the scientific applications derating ratio
to bit errors from the (off-chip) memory; they did not compare the effects of
soft errors from a multi-level cache hierarchy. However, as its capacity increases
dramatically, cache has become a major source of soft error induced bit errors
on high end processors [1]. To have an accurate estimation of an application’s
response to soft errors in cache, it is necessary to have an application’s access pat-
terns to different levels of caches. However, application cache accesses patterns
are difficult to determine from a static analysis, especially for multi-threading
applications.
In this paper we present a simulation based analysis that can better reveal
realistic cache access patterns. We then perform fault injection directly into the
caches when they are accessed by simulated applications. In this way we can
have the best approximation of the soft error’s effect to an application; and can
distinguish this effect from different levels of caches.
274 Z. Ma et al.

In the rest of this paper we first describe the processor simulator that we used
for fault injection and how we can conduct fault injections in the cache hierarchy
(Section 2); then we present our motivation for various bit error patterns used
in the fault injection (Section 3); next, we describe the experimental process
(Section 4) and present the experimental results (Section 5); we finally discuss
the collected results and draw some conclusions (Section 6).

2 Graphite Simulator and Bit Error Injection

We want to obtain the applications response to cache bit errors by directly ob-
serving the simulated results after the fault injection. This simulation should
faithfully execute the target application’s instructions; and the error bits are
only injected to the instructions that are accessing the specific cache that we
select for injection during the specified time period. We have chosen a proces-
sor architecture simulator called Graphite developed at MIT [11]; and used the
extensions made by University of Ghent [3] .

2.1 Fast Processor Simulation


Graphite is a high performance processor simulator. Based on the dynamic in-
strumentation tool called Pin [9], Graphite can directly dynamically re-compile
and execute instructions from the x86 executable binary of a target application.
This feature allows us to easily evaluate scientific applications that are already
compiled for x86 target machines. Also, Graphite can be configured with differ-
ent processor parameters about cache, such as levels, capacities, latencies and
replacement strategies. This is especially true after we applied the extension de-
scribed in [3]. One thing to keep in mind is that Graphite is only for application
space simulation. All system calls in the simulated applications are intercepted
and handled by Graphite. Although these emulated system calls can affect the
insight into OS kernel’s responses to fault injection, our experiments are not
greatly affected because we are mainly interested in evaluating scientific com-
puting applications where most processor time is spent in application space.

2.2 Cache Error Injection


We use a partly modified Graphite that allows us to more flexibly configure the
cache hierarchy of a modeled processor. Based on a user defined cache model,
this Graphite simulator can determine for each memory access if it is hit in a
specific cache. If it is a hit, Graphite can also determine which cache line is
accessed.
Our Graphite can read in a separate configuration file with information about
the location and time in the cache hierarchy to insert a bit flip. An illustration
of such a configuration file is shown below.
Evaluating Application Vulnerability to Soft Errors 275

...
[fault_injection_model/L3]
start_cycle = 12022450
total_faults_nr = 1
err_bit_nr = 2
multi_byte_upset = false
...
A random configuration generator has been made to generate a large number
of fault injection configuration files. While the injection location and time are
randomized by the generator, the bit error pattern (see Section3) in the configu-
ration files is given as an input to the generator. Because soft error is a rare event
and is unlikely to hit the same application more than once during its execution
with the input size used in our simulation, we only inject one soft error with a
single injection configuration file. One simulation is launched for each individual
fault injection configuration file. During every simulation, the Graphite simula-
tor flips the error bits that are specified in this configuration file providing that
a cache access at the selected cache level takes place during the specified time
period in the configuration file. The injected error bits stay as long as they are
not overwritten or flushed out of the cache.

3 Multiple Cell Upset and Bit Error Patterns


When processing technology scales down, the probability of Multiple Cell Upset
(MCU) increases dramatically [15,7]. We simulate the 2-, 4- and 6-cell upsets
which are known to increase in SRAMs and are hard to detect and/or correct
by a conventional ECC mechanism.
Because caches can use physical bit interleaving, A MCU cannot be directly
translated into a Multiple Bit Upset (MBU). Instead, how a MCU is translated
into a MBU depends on the physical bit interleaving implemented in the cache
array. Due to high energy overheads associated with physical bit interleaving in
large cache arrays [8], we only simulated physical bit interleaving for L1 and L2
caches (with different degrees of interleaving):
L1(D+I) 4-way interleaving. For both L1 data and instruction caches we
assume a 4-way physical bit interleaving. Then the 2-, 4- and 6-cell MCUs
are translated into single bit upset and 2-bit upset in two consecutive bytes.
L2 2-way interleaving. For L2 cache the 2-, 4- and 6-cell MCUs are translated
into single bit upset, 2- and 3-bit upset in two consecutive bytes.
L3 no interleaving. When no physical bit interleaving is present, the 2-, 4-
and 6-cell MCUs are directly translated into 2-, 4, and 6-bit upset in a single
byte.

4 Simulation Setup
4.1 Applications
We use the SPLASH-2 benchmarks [18] as our applications for the fault injec-
tion simulation. SPLASH-2 benchmarks have a variety of scientific computing
276 Z. Ma et al.

programs that are widely used with processor simulators. Because computa-
tional kernels usually account for the main execution times of scientific com-
putations, we only present the results from three computational kernels from
SPLASH-2 suite in this paper. The selected kernels are a sparse matrix factor-
ization (Cholesky), a fast fast Fourier transform (FFT) and an integer radix sort
(Radix). The problem sizes used for each benchmark are listed in Table 1. All
benchmarks are compiled by GCC in 64-bit mode, with the -O3 optimization.

Table 1. Simulated SPLASH-2 benchmarks and problem sizes

Benchmark Program size


Cholesky tk25.O
FFT 256K points
Radix 256K integers

4.2 Simulation Parameters


We simulate our application with a processor model that resembles the Intel
Xeon Dunnington processor (X7460). An X7460 has six cores; each core has
private L1 data and instruction caches (32KB + 32KB). Every two cores share
a L2 cache with the size of 3MB. All cores share a single L3 cache with the size
of 16MB.
Because of the large numbers of cache accesses, it is too expensive to do
an exhaustive fault injection at each cache access during simulation. Hence we
apply statistical sampling techniques to estimate the responses from simulated
applications. Suppose a cache is accessed for X times during the execution, the
total population space for this cache is X. What we need to determine is a sam-
pling size x that is (much) smaller but can still give a reasonable estimation
of the probabilities associated with different application responses. For applica-
tions that run for an extended period of time, the cache access numbers are very
large (up to several hundreds of millions for L1 caches). Thus we consider the
application responses follow the normal distribution. Based on the sampling the-
ory [4], and the baseline profiling results, we can calculate the sampling number
that we need for each cache is around 500. Such a sampling size can obtain an
error margin less than 5% with a statistical confidence interval of 95%. There-
fore, we repeat the fault injection for each individual cache for 500 times (i.e.,
with 500 different random fault injection files for each cache when simulating an
application).

5 Simulation Results
We first profile our target applications by running them on the simulator without
fault injection. These profiling results are called baseline results. We use baseline
results 1)to setup the normal execution time for each application and 2) to collect
Evaluating Application Vulnerability to Soft Errors 277

the correct output if applicable. We repeat each simulation for 10 times and
obtain consistent profiling results as shown in Table 2. Note that as a Dunnington
processor has six L1 I/D caches and three L2 caches, the access numbers in the
table are averaged access numbers for each individual L1 and L2 cache. The
total execution cycle is the largest execution cycle number among six cores.

Table 2. Baseline profiling information

L1-I cache L1-D cache L2 cache L3 cache Total exec.


Benchmark
accesses accesses accesses accesses cycles
Cholesky 147344731 48832936 1776513 1147177 296651192
FFT 34513627 7581382 368357 572405 54740721
Radix 12408179 1878307 150693 62016 16366277

In the second step we simulate the applications with the randomly generated
fault injection configuration files. We have observed different responses from the
simulations with injections. In the rest of this section, we first describe four
different responses caused by fault injections; then we compare the responses
from different benchmarks for fault injections from each cache level.

5.1 Application Response to Fault Injections


The four types of responses that we observed from the applications are listed
below. These responses have different frequencies in our simulations. We sum-
marize the occuring percentages of each response in the Table 3, 4,5,6. Also, we
compared the vanished fault percentages of all applications in Figure 1. As shown
in this figure, applications have different levels of vulnerabilities to injected bit
errors in different caches.
Fault Vanished. This is a response in which a target application finishes its
execution successfully.
Application Crash. This is a response in which a target application aborts
its execution. This usually happens with an error return value from the
application or from the libraries (such as glibc) used by the application. In
most cases the application gets a segmentation fault.
Application Hang. Because each run of simulation for a given application
takes the same input data, and the applications have no intrinsic reasons
to show very different execution times, we consider an application becomes
a hanging application if it has been running for three times as long as its
baseline execution time.
Silent Data Corruption. This response is defined as a target application fin-
ishes its execution successfully without exceeding three times of its baseline
execution time. But the final output cannot pass the correctness test. We
only perform the test for FFT in this paper.
278 Z. Ma et al.

Table 3. Applications responses percentages for L1 instruction cache fault injection


simulations; as explained in Section 3, two types of bit error patterns are simulated:
1-bit upset in single byte(SBU1) and 2-bit upset in consecutive bytes(MBU2)

Cholesky FFT Radix


Crash 10.5 12.1 11.2
Hang 1.1 1.9 1.5
SBU1
SDC – 0.1 –
Vanished 88.4 85.9 87.3
Crash 10.8 11.6 11.5
Hang 1.2 2.2 1.5
MBU2
SDC – 0.3 –
Vanished 88.0 85.9 87.0

Table 4. Applications responses percentages for L1 data cache fault injection simula-
tions: 1-bit upset in single byte (SBU1) and 2-bit upset in consecutive bytes (MBU2)

Cholesky FFT Radix


Crash 12.1 7.9 5.0
Hang 7.3 4.2 2.2
SBU1
SDC – 1.5 –
Vanished 80.6 86.4 92.8
Crash 16.0 9.0 6.7
Hang 5.2 4.1 2.3
MBU2
SDC – 1.5 –
Vanished 78.8 85.4 91.0

Table 5. Applications responses percentages for L2 cache fault injection simulations;


three types of bit error patterns are simulated: 1-bit upset (MBU1), 2-bit upset (MBU2)
and 3-bit upset (MBU3), all three cases in consecutive bytes

Cholesky FFT Radix


Crash 8.1 6.9 8.0
Hang 2.8 1.0 3.0
MBU1
SDC – 0.5 –
Vanished 89.1 91.6 89.0
Crash 9.0 7.5 8.9
Hang 4.2 1.7 4.2
MBU2
SDC – 0.9 –
Vanished 86.8 89.9 86.9
Crash 9.4 7.6 9.5
Hang 4.9 1.9 4.9
MBU3
SDC – 0.9 –
Vanished 85.7 89.6 85.6
Evaluating Application Vulnerability to Soft Errors 279

Table 6. Applications responses percentages for L3 cache fault injection simulations;


three types of bit error patterns are simulated: 2-bit upset (SBU2), 4-bit upset (SBU4)
and 6-bit upset (SBU6), all three cases in a single byte

Cholesky FFT Radix


Crash 5.0 8.8 4.8
Hang 1.0 3.0 1.1
SBU2
SDC – 0.7 –
Vanished 94.0 87.5 94.1
Crash 5.2 11.4 5.6
Hang 1.3 4.8 1.4
SBU4
SDC – 1.1 –
Vanished 93.5 82.7 93.0
Crash 5.5 13.0 5.7
Hang 1.2 4.9 1.5
SBU6
SDC – 5.1 –
Vanished 93.3 77.0 92.8

Fig. 1. Comparison of consolidated percentages of vanished fault for all applications;


vanished fault percentage for each application is the average of its vanished fault per-
centages, assuming each bit error pattern has the same occuring probability

6 Conclusions
We present a cache fault injection framework based on a fast processor simulator.
Running several scientific computing programs on this simulator with injected
280 Z. Ma et al.

cache bit errors, we have observed various responses from the simulated programs
with different probabilities. All programs show that a large percentage of errors
are filtered out and hence invisible at the application level. For the errors that do
cause an application failure, application crash is the most likely type of failures
(4.8% – 16.0%); while silent data corruption, though relatively rare, is still not
negligible (up to 5.1% for FFT). Moverover, our results indicate that different
programs have different levels of vulnerability to bit errors injected in different
caches (e.g., 6.4% application failures for Cholesky v s. 17.6% for FFT in L3 cache
fault injection simulations). These results suggest that the benefits of protecting
an individual cache depends on the application program that is running on this
processor.

References
1. Baumann, R.: Soft errors in advanced computer systems. IEEE Design & Test of
Computers 22(3), 258–266 (2005)
2. Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear al-
gebra methods. In: SELSE (2007)
3. Carlson, T.E., Heirman, W., Eeckhout, L.: Exploring the level of abstraction for
scalable and accurate parallel multicore simulation. In: SC (2011)
4. Cochran, W.G.: Sampling Techniques, 3rd edn. John Wiley (1977)
5. da Lu, C., Reed, D.A.: Assessing fault sensitivity in MPI applications. In: SC, p.
37. IEEE Computer Society (2004)
6. Daveau, J.-M., Blampey, A., Gasiot, G., Bulone, J., Roche, P.: An industrial fault
injection platform for soft-error dependability analysis and hardening of complex
system-on-a-chip. In: IRPS, pp. 212–220 (2009)
7. Heidel, D., Marchal, P., et al.: Single-event upsets and multiple-bit upsets on a
45nm SOI SRAM. IEEE Transactions on Nuclear Science 56(6), 3499–3504 (2009)
8. Kim, J., Hardavellas, N., Mai, K., Falsafi, B., Hoe, J.C.: Multi-bit error tolerant
caches using two-dimensional error coding. In: MICRO, pp. 197–209 (2007)
9. Luk, C.-K., Cohn, R.S., Muth, R., Patil, H., Klauser, A., Geoffrey Lowney, P., Wal-
lace, S., Reddi, V.J., Hazelwood, K.M.: Pin: building customized program analysis
tools with dynamic instrumentation. In: PLDI, pp. 190–200 (2005)
10. Mak, T.M., Mitra, S., Zhang, M.: DFT assisted built-in soft error resilience. In:
IOLTS, p. 69 (2005)
11. Miller, J.E., Kasture, H., Kurian, G., Gruenwald III, C., Beckmann, N., Celio, C.,
Eastep, J., Agarwal, A.: Graphite: A distributed parallel simulator for multicores.
In: HPCA, pp. 1–12 (2010)
12. Mukherjee, S.S., Weaver, C.T., Emer, J.S., Reinhardt, S.K., Austin, T.M.: A sys-
tematic methodology to compute the archi- tectural vulnerability factors for a
high-performance microprocessor. In: MICRO, pp. 29–42. ACM/IEEE (2003)
13. Ramachandran, P., Kudva, P., Kellington, J.W., Schumann, J., Sanda, P.: Statis-
tical fault injection. In: DSN, pp. 122–127. IEEE Computer Society (2008)
14. Rao, S., Sanda, P., Ackaret, J., Barrera, A., Yanez, J., Mitra, S.: Examing workload
dependence of soft error rates. In: SELSE (2008)
Evaluating Application Vulnerability to Soft Errors 281

15. Ruckerbauer, F.X., Georgakos, G.: Soft error rates in 65nm SRAMs analysis of
new phenomena. In: IOLTS, pp. 203–204 (2007)
16. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high performance
computing systems. In: DSN, pp. 249–258 (2006)
17. Wang, N.J., Fertig, M., Patel, S.J.: Y-branches: When you come to a fork in the
road, take it. In: IEEE PACT, pp. 56–66 (2003)
18. Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs:
Characterization and methodological considerations. In: ISCA, pp. 24–36 (1995)
Experimental Framework for Injecting Logic
Errors in a Virtual Machine to Profile
Applications for Soft Error Resilience

Nathan DeBardeleben1 , Sean Blanchard1 , Qiang Guan1,2 ,


Ziming Zhang1,2 , and Song Fu2
1
Los Alamos National Laboratory, Ultrascale Systems Research Center
High Performance Computing Division, Los Alamos NM 87544, USA
{ndebard,seanb}@lanl.gov
2
University of North Texas, Dependable Computing Systems Lab
Department of Computer Science and Engineering, Denton TX 76203, USA
{QiangGuan,ZimingZhang}@my.unt.edu, Song.Fu@unt.edu

Abstract. As the high performance computing (HPC) community con-


tinues to push for ever larger machines, reliability remains a serious ob-
stacle. Further, as feature size and voltages decrease, the rate of transient
soft errors is on the rise. HPC programmers of today have to deal with
these faults to a small degree and it is expected this will only be a larger
problem as systems continue to scale.
In this paper we present SEFI, the Soft Error Fault Injection frame-
work, a tool for profiling software for its susceptibility to soft errors.
In particular, we focus in this paper on logic soft error injection. Using
the open source virtual machine and processor emulator (QEMU), we
demonstrate modifying emulated machine instructions to introduce soft
errors. We conduct experiments by modifying the virtual machine itself
in a way that does not require intimate knowledge of the tested applica-
tion. With this technique, we show that we are able to inject simulated
soft errors in the logic operations of a target application without affecting
other applications or the operating system sharing the VM. We present
some initial results and discuss where we think this work will be useful
in next generation hardware/software co-design.

Keywords: soft errors, resilience, fault tolerance, reliability, fault injec-


tion, virtual machines, high performance computing, supercomputing.

1 Introduction

Reliability is recognized as one of the core challenge areas for extreme-scale


supercomputers by a number of studies including the Defense Advanced Re-
search Projects Agency (DARPA)[8] and the International Exascale Software
Project[7]. Additionally, the US Department of Energy’s Office of Advanced Sci-
entific Computing Research (ASCR) has held several workshops that produced
reports on this subject. Furthermore, several studies specifically on reliability

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 282–291, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Experimental Framework for Injecting Logic Errors in a Virtual Machine 283

have found that major undertakings would be required to create resilient next-
generation systems[6,5]. High performance computing (HPC) systems of today
already struggle with reliability and these concerns are expected to only amplify
as systems are pushed to even larger scales.
The high performance computing (HPC) field of resilience aims to find ways to
run applications on often unreliable hardware with emphasis on making timely
progress toward a correct solution. The goal of resilience is to move beyond
merely tolerating faults but coexisting with failure to a point where failure is
recognized as the norm and not the exception.
One of the more daunting areas of resilience research is soft errors - those errors
which are generally transient in nature and difficult or impossible to reproduce.
Often these errors cause incorrect data values to be present in the system. While
soft errors are generally rare, there is evidence to believe that the rate is increas-
ing as feature sizes and voltages decrease[10]. Not only will these increasingly
common errors negatively impact performance while hardware corrects some of
them, we believe these errors will occur not only in the more familiar memory
but in logic circuits where traditional techniques will neither detect or be able
to correct the error. This leads us to believe that next generation systems will
either have to be hardened to get around these errors or application program-
mers will have to learn to design for systems that give incorrect answers with
some noticeable probability.
In this work we present SEFI, the Soft Error Fault Injection framework, a tool
aimed at quantifying just how resilient an application is to soft errors. While our
goal is to look at both corrupted data in memory and corrupted logic circuits, we
start our research by examining the latter. We choose to focus on logic errors as
faults in memory have been studied in the past and, to a large extent, hardware
to detect and correct such errors exists. Our software tools inject soft errors in the
logic operations at known locations in an application which allows us to observe
how the application responds to faulty behavior of the simulated hardware.
The rest of this paper is organized as follows: Section 2 presents an overview
of the logic soft error injection framework and then Section 3 outlines an initial
experiment and discusses the results. In Section 4 we discuss the importance of
this work and its intended uses. Section 5 compares our approach with other
work in the field. Finally, Section 6 discusses the future work and we conclude
with our findings in Section 7.

2 Overview of Methodology

SEFI’s logic soft error injection operational flow is roughly depicted in Figure 1.
First, the guest environment is booted and the application to inject faults into
is started. Next, we probe the guest operating system for information related to
the code region of the target application and notify the VM which code regions
to watch. Then the application is released, allowing it to run. The VM observes
the instructions occurring on the machine and augments ones of interest. A more
detailed explaination of these techniques follows.
284 N. DeBardeleben et al.

STARTUP PROBE FAULT INJECTION


Boot guest OS in VM Extract application Opcodes changed to
Attach debugger memory region give anomalous results

Fig. 1. Overview of SEFI

2.1 Startup
Initial startup of SEFI begins by simply booting a debug enabled Linux kernel
within a standard QEMU virtual machine. QEMU allows us to start a gdbserver
within the QEMU monitor such that we can attach to the running Linux kernel
with an external gdb instance. This allows us to set breakpoints and extract
kernel data structures from outside the guest operating system as well as from
outside QEMU itself. This is a fairly standard technique used by many Linux
kernel developers. Figure 2 depicts the startup phase.

Qemu Qemu monitor


gdbserver started
Linux [probe]
Boots Linux

gdb started External gdb


attaches to gdbserver

Fig. 2. SEFI’s Startup Phase

2.2 Probe
Once the guest Linux operating system is fully booted and sitting idle we use
the attached external gdb to set a breakpoint at the end of the sys exec call
tree but before an application is sent to a cpu to be executed. We are currently
focused on only ELF binaries and have therefore set our breakpoint at the end
of the load elf binary routine. This is trivial to generalize to other binary
formats in future work. With the breakpoint set we are free to issue a continue
via gdb to allow the Linux kernel to operate. The application of interest can
now be started and will almost immediately hit our set breakpoint and bring
the kernel back to a stopped state. By this point in the exec procedure the
kernel has already loaded an application’s text section into physical memory
in a memory region denoted by the start code and end code elements of the
task’s mm struct memory structure. We can now extract the location in memory
assigned to our application by the kernel by walking the task list in the kernel.
Starting with the symbol init task, we can find the application of interest either
by comparing a binary name to the task struct’s comm field or by searching for
a known pid which is also contained in the task struct. The physical addresses
within the VM of the application’s text region can now be fed into our fault
Experimental Framework for Injecting Logic Errors in a Virtual Machine 285

injection code in the modified QEMU virtual machine. Currently this is done
by hand but we have plans to automate this discovery and transfer using scripts
and hypervisor calls.
Figure 3 depicts the probe phase of SEFI.

Qemu
Linux
start app

gdb gdb
[setup] set break at end of task−>mm−>start_code [fault injection]
load_elf_binary task−>mm−>end_code

Fig. 3. SEFI’s Probe Phase

2.3 Fault Injection

In figure 4 we see that once QEMU has the code segment range of the target
application, the application is resumed. Next, when any opcode is called in the
guest hardware that we are interested in injecting faults into, QEMU checks the
current instruction pointer register (EIP). If that instruction pointer address is
within the range of the target application (obtained during the probe phase),
QEMU now is aware that the application we are targeting is running this par-
ticular instruction. At this point we are able to inject any number of faults and
have confidence that we are affecting only the desired application.

Qemu Qemu Qemu


mod opcodes but only
for application ops Linux Linux
application runs application completes
Linux with mod opcodes behavior analyzed

gdb
[probe] remove breakpoint
continue

Fig. 4. SEFI’s Fault Injection Phase

The opcode fault injection code has several capabilities. Firstly, it can simply
flip a bit in the inputs of the operation. Flipping a bit in the input simulates
a soft error in the input registers used for this operation. Secondly, it can flip
a bit in the output of the operation. This simulates either a soft error in the
actual operation of the logic unit (such as a faulty multiplier) or soft error in
286 N. DeBardeleben et al.

the register after the data value is stored. Currently the bit flipping is random
but can be seeded to produce errors in a specified bit-range. Thirdly, opcode
fault injection can perform complicated changes to the output of operations by
flipping multiple bits in a pattern consistent with an error in part but not all of
an opcodes physical circuitry. For example, consider the difference in the output
of adding two floating point numbers of differing exponents if the a transient
error occurs for one of the numbers while setting up the significant digits so that
they can be added. By carefully considering the elements of such an operation
we can alter the output of such an operation to reflect all the different possible
incorrect outputs that might occur.
The fault injector also has the ability to let some calls to the opcode go
unmodified. It is possible to cause the faults to occur after a certain number of
calls or with some probability. In this way the fault can occur every time which
closely emulates permanently damaged hardware or can be used to emulate
transient soft errors by causing a single call to be faulty.

3 Experiments

To demonstrate SEFI’s capability to inject errors in specific instructions we


provide two simple experiments. For each experiment we modified the translation
instructions inside of QEMU for each instruction of interest. Once the instruction
was called, the modified QEMU would check the current instruction pointer
(EIP) to see if the address was within the range of the target application. If
so, then a fault could be injected. We performed two experiments in this way,
injecting faults into the floating point multiply and floating point add operations.

3.1 Floating Point Multiply Fault Injection

For this experiment we instrumented the floating point multiply operation,


“mulsd”, in QEMU. We created a toy application which iteratively performs
Equation 1 40 times. The variable y is initialized to 1.0.

y = y ∗ 0.9 (1)

Then, at iteration 10 we injected a single fault into the multiplication operation


by flipping a random bit in the output. Figure 5 plots the results of this experi-
ment. The large, solid line, represents the output as it is without any faults. The
other five lines represent separate executions of the application with different
random faults injected. Each fault introduces a numerical error in the results
which continues through the lifetime of the program.
In Figure 6 we focus on two areas of interest from the plot in Figure 5. In
Figure 6(a) the plot is zoomed in to focus on the point where the five faults are
injected so as to make it easier to see. Figure 6(b) is focused on the final results
of the application. In this figure it becomes clear that each fault caused an error
to manifest in the application through to the final results.
Experimental Framework for Injecting Logic Errors in a Virtual Machine 287

y *= 0.9 (initially y = 1) - with and without injected faults


1
No faults
0.9 Run 1 with fault injected at iteration 10
Run 2 with fault injected at iteration 10
Run 3 with fault injected at iteration 10
0.8 Run 4 with fault injected at iteration 10
Run 5 with fault injected at iteration 10
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40
Iterations

Fig. 5. The multiplication experiment uses the floating point multiply instruction
where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9. For five
different experiments a random bit was flipped in the output of the multiply at itera-
tion 10, simulating a soft error in the logic unit or output register.

0.44 0.035

0.42

0.4 0.03

0.38

0.025
0.36

0.34

0.02
0.32

0.3
0.015
0.28
10 11 12 35 36 37 38 39 40

(a) Multiply Experiment - area of interest: (b) Multiply Experiment - area of interest:
injected faults final results

Fig. 6. Experiment #1 with the focus on the injection point (a) and the effects on the
final solution (b). In (a) it can be seen that each of the five separately injected faults
all cause the value of y to change - once radically, the other times slightly. In (b) it can
be seen that the final output of the algorithm differs due to these injected faults.
288 N. DeBardeleben et al.

Table 1. Results of Addition Tests

A B C D E F G H
30.0 30.0 30.0 30.0 30.0 30.0 30.0 30.0
31.0 31.125 32.0 481.0 23.0 8.5 128849018881.0 1966081.0
32.0 32.125 33.0 482.0 24.0 9.5 128849018882.0 1966082.0

3.2 Floating Point Addition Fault Injection


To demonstrate SEFI’s capability to inject faults into different instructions, we
provide another simple experiment which uses the floating point add operation,
“addsd”. This experiment simply added the value 1.0 repeatedly, as in Equa-
tion 2. At iteration 31 we had SEFI inject an error into the resulting addsd
instruction. As can be seen from Table 1, the error is varied and sometimes
appears in the exponent and other times in the mantissa of the binary repre-
sentation. In the table we focus only on the iterations of importance for brevity.
Column A represents the correct answer while the remaining columns all contain
an error on the second row (31st iteration).

y = y + 1.0 (2)
These experiments were crafted to demonstrate the capability of SEFI to inject
errors into specific instructions and clearly do not represent interesting applica-
tions. The next steps will be to inject faults into benchmark applications (such as
BLAS and LAPACK) to study the soft error vulnerability of those applications.

4 Intended Uses
It is our intention to use SEFI to study the susceptibility of applications to
soft errors (logic initially, and later followed by memory). We expect to be able
to produce reports on the vulnerability of applications at a fine grain level -
at least at the functional level and perhaps at the instruction level. We have
demonstrated that we can inject logic faults at specific assembly instructions but
translating those instructions back to original higher level language instructions
will likely prove complex.
Hardware designers expend a great deal of resources to protect soft errors
from propagating into the software stack. While current wisdom is that these
protections are necessary, there are a variety of applications that could survive
with a great deal less protection and would willingly trade resilience for increases
in performance or decreases in power or cost. We believe SEFI begins to present
a way to experiment with and quantify the level of resilience of an application
to soft errors and might be useful in co-design of future systems.

5 Related Work
The work presented in this paper builds on years of open source research on
QEMU[1], a processor emulator and virtual machine. Bronevetsky, et. al[3,4,2]
Experimental Framework for Injecting Logic Errors in a Virtual Machine 289

is probably the closest related work to SEFI in the high performance computing
field. In [2] they create a fault injection tool for MPI that simulates MPI faults
that are often seen on HPC systems, such as stalls and dropped messages. In
[3,4] they performed random bit flips of application memories and observed how
the application responded.
It is important to understand the difference between our approach and that
presented in the memory bit flipping work of Bronevetsky. Bronevetsky’s ap-
proach most likely closely simulates a bit flip caused by a transient soft error in
that the bit flip happens randomly in memory. While they target these bit flips
at a target application, there appears to be no correlation to whether the mem-
ory region will be used by the application. As stated, this closely approximates
a real transient soft error. Our work, on the other hand, directly targets specific
instructions and forces corruption to appear at those lines. This approach is
directly targeted more at hardening a code from soft errors. It is our intention
to add functionality similar to Bronevetsky’s approach as a plug-in to SEFI in
future work.
Naughton, et. al, in [9] developed a fault injection framework that either
uses ptrace or the Linux kernel’s built-in fault injection framework. The kernel
approach allows injection of three different types of errors: slab errors, page
allocation errors, and disk I/O errors. While both approaches in this work are
similar to SEFI, our technique allows us to probe a wider range of possible faults.
TEMU[11] is a tool built upon QEMU like SEFI. The TEMU BitBlaze infras-
tructure is used to analyze applications for “taint” in a security context. This
tool does binary analysis using the tracecap software. We have not yet had the
time to determine of this suite of tools is usable for our interests but it does
appear promising that we can build upon TEMU.
NFTAPE[12] is a tool which is similar to SEFI in that it provides a fault
injection framework for conducting experiments on a variety of types of faults.
NFTAPE is a commercial tool, however, and therefore we have not had the
luxury of experimenting with it to this point.

6 Future Work

In order to validate our simulation of soft errors in logic we plan to test the same
applications we use in the VM on actual hardware subjected to high neutron
fluxes. Neutrons are well known to be the component of cosmic ray showers
that causes the greatest damage to computer circuits[13]. Neutrons are known
to cause both transient errors due to charge deposition and hard failures due to
permanent damage. We will use the neutron beam at the Los Alamos Neutron
Science Center (LANSCE) to approximate the cosmic ray induced events in a
logic circuit over the lifetime of a piece of computational hardware. Previous
work using the LANSCE beam has shown its usefulness in inducing silent data
corruption (SDC) in applications of interest.
Future versions of SEFI will include plugins to simulate more sophisticated
types of faults. Logic errors are unlikely to consist of simple random bit flips.
290 N. DeBardeleben et al.

We believe the combination of SEFI testing and neutron beam validation will
allow us to build realistic models of specific types of logic failures. We also plan
on extending SEFI to model multi-bit memory errors which are undetectable by
current memory correction techniques.

7 Conclusion
In this paper we have demonstrated the capability to inject simulated soft errors
into a virtual machine’s instruction emulation facilities. More importantly, we
have demonstrated how to target these errors so as to be able to reasonably
conduct experiments on the soft error vulnerability of a target application. This
type of experimentation is usually complicated because faults that are introduced
cause errors in other portions of the system, especially the operating system, and
often results in outright crashes. This makes getting meaningful data about the
injected faults difficult. The approach presented in this paper gets around these
limitations and provides quite a bit of control.

Acknowledgements. Ultrascale Systems Research Center (USRC) is a collab-


oration between Los Alamos National Laboratory and the New Mexico Con-
sortium(NMC). NMC provides the enviroment to foster collaborative research
between LANL, universities and industry allowing long-term interactions in Los
Alamos for professors, students and industry visitors.
This work was supported in part by the U.S. Department of Energys National
Nuclear Security Administration under contract DE-AC52-06NA25396 with Los
Alamos National Security, LLC.

References
1. Bellard, F.: Qemu, a fast and portable dynamic translator. In: Proceedings of the
Annual Conference on USENIX Annual Technical Conference, ATEC 2005, p. 41.
USENIX Association, Berkeley (2005)
2. Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B., Schulz, M., Anh, D.: Statis-
tical fault detection for parallel applications with automaded. In: IEEE Workshop
on Silicon Errors in Logic - System Effects, SELSE (March 2010)
3. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra
methods. In: Workshop on Silicon Errors in Logic - System Effects, SELSE (April
2007)
4. Bronevetsky, G., de Supinski, B.R., Schulz, M.: A foundation for the accurate
prediction of the soft error vulnerability of scientic applications. In: IEEE Workshop
on Silicon Errors in Logic - System Effects (March 2009)
5. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale
resilience. International Journal of High Performance Computing Applications 23,
374–388 (2009)
6. DeBardeleben, N., Laros, J., Daly, J., Scott, S., Engelmann, C., Harrod, B.: High-
end computing resilience: Analysis of issues facing the hec community and path-
forward for research and development (December 2009),
http://institute.lanl.gov/resilience/docs/HECResilience.pdf
Experimental Framework for Injecting Logic Errors in a Virtual Machine 291

7. Dongarra, J., et al.: The international exascale software project roadmap. Interna-
tional Journal of High Performance Computing Applications 25, 3–60 (2011)
8. Kogge, P., et al.: Exascale computing study: Technology challenges in achieving
exascale systems (2008)
9. Naughton, T., Bland, W., Vallee, G., Engelmann, C., Scott, S.L.: Fault injection
framework for system resilience evaluation: fake faults for finding future failures. In:
Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience
2009, pp. 23–28. ACM, New York (2009)
10. Quinn, H., Graham, P.: Terrestrial-based radiation upsets: A cautionary tale. In:
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Cus-
tom Computing Machines, pp. 193–202. IEEE Computer Society, Washington, DC
(2005)
11. Song, D., Brumley, D., Yin, H., Caballero, J., Jager, I., Kang, M.G., Liang, Z., New-
some, J., Poonsankam, P., Saxena, P.: A high-level overview covering vine, temu,
and rudder. In: Proceedings of the 4th International Conference on Information
Systems Security (December 2008)
12. Stott, D., Floering, B., Burke, D., Kalbarczpk, Z., Iyer, R.: Nftape: a framework
for assessing dependability in distributed systems with lightweight fault injectors.
In: Proceedings of IEEE International Computer Performance and Dependability
Symposium, IPDS 2000, pp. 91–100 (2000)
13. Ziegler, J.F., Lanford, W.A.: The effect of sea level cosmic rays on electric devices.
Journal Applied Physics 528 (1981)
High Availability on Cloud with HA-OSCAR

Thanadech Thanakornworakij1, Rajan Sharma1, Blaine Scroggs1,


Chokchai (Box) Leangsuksun1, Zeno Dixon Greenwood1,
Pierre Riteau2,3, and Christine Morin3
1
College of Engineering & Science,
Louisiana Tech University, Ruston, LA 71270, USA
{tth010,rsh018,bas031,box}@latech.edu, greenw@phys.latech.edu
2
Université de Rennes 1, IRISA
3
INRIA Rennes – Bretagne Atlantique
Pierre.Riteau@irisa.fr, Christine.Morin@inria.fr

Abstract. Cloud computing provides virtual resources so that end-users or


organizations can buy computing power as a public utility. Cloud service
providers however must strive to ensure good QoS by offering highly available
services with dynamically scalable resources. HA-OSCAR is an open source
High Availability (HA) solution for HPC/cloud that offers component
redundancy, failure detection, and automatic fail-over. In this paper, we
describe HA-OSCAR as a cloud platform and analyze system availability of
two potential cloud computing systems, OSCAR-V cluster and HA-OSCAR-V.
We also explore our case study to improve Nimbus, a popular cloud IaaS
toolkit. The results show that the system that deploys HA-OSCAR has a
significantly higher degree of availability.

Keywords: HA-OSCAR, High Availability, OSCAR-V.

1 Introduction

Cloud computing refers to a service-oriented paradigm where service providers offer the
computing resources such as hardware, software, storage and platforms as services
according to the demands of the user. The benefit of cloud computing is to increase
utilization of available computing resources and reduction of burden and responsibilities
of end-users by renting resources, and thus, increase economic efficiency [1]. Cloud
computing collects computing resources and manages them automatically through
dynamic provisioning and often virtualized resources. The user or client companies do
not deal with software and hardware administrative issues, as they can buy these virtual
resources through the cloud service providers depending on their needs [2]. The focus of
cloud computing is to provide easy, secure, fast, convenient and inexpensive net
computing and data storage service centered on the Internet. This however transfers
such responsibilities to service providers in order to ensure QoS.
HA systems are increasingly vital and important due to their ability to sustain
critical services to users. Also HA services are very important in clouds because

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 292–301, 2012.
© Springer-Verlag Berlin Heidelberg 2012
High Availability on Cloud with HA-OSCAR 293

companies and users depend on these cloud providers for their critical data. In order
for the cloud computing to be effective in business, scientific research etc., high
availability is a must. Thus, we foresee that it is critically important that we enable
cloud infrastructure with HA.
The HA-OSCAR [7] project originally started from OSCAR (Open Source Cluster
Application Resource) project. OSCAR is a cluster software stack that provides a high
performance computing runtime stack and tools for cluster computing [5]. The main
goal of the HA-OSCAR project was to leverage existing OSCAR technology, so the
HA-OSCAR project was formed to provide high-availability capabilities in OSCAR
clusters. HA-OSCAR then introduces several enhancements and new features to
OSCAR mainly in areas of availability, scalability and security [5],[11]. Initially
HAOSCAR [8] only supported OSCAR clusters, however, the current version
supports most Linux-based IT infrastructures, and not just OSCAR clusters. Thus,
HA-OSCAR is a capable cloud platform that enables not only scalability aspect via its
cluster computing capability with OSCAR, or the like, but also HA solutions.
In this paper we introduce a new system, HA-OSCAR 2.0 [14] (an Open Solution
HA-enabling framework for mission critical systems), capable of enhancing HA in
cloud platforms by adopting component redundancy to eliminate single-point-of-
failures. Thus, system critical resources are replicated so that failures of any resources
will not take down the entire system, thereby making cloud infrastructure highly
available, especially at the head node. Some of the new and improved features this
version incorporates are self-healing mechanisms, failure detection, automatic
synchronization, and, fail-over and fail-back functionality [7], [14].

2 Related Work

OSCAR –V [6] provides an enhanced set of tools and packages for the creation,
deployment and the management of virtual machines and host operating system
within a physical cluster. Virtualization in OSCAR clusters is needed to decouple the
operating system, customize the execution environment, and provide computing based
on the need of the user. Increased HA thus makes it perfectly compatible with HPC
cloud computation. OSCAR-V uses Xen as the virtualization solution and provides
V2M virtual machine management of physical clusters.
Another interesting and highly scalable cluster solution in cloud computing is
Rocks+ [9]. It can be used for running a public cloud or setting up an internal private
cloud. Rocks+ is based upon the well-known software Rocks. Rocks contain all the
necessary software components required to easily build or maintain a cluster or cloud.
Rocks+ can manage an entire data center running all the computational resources and
services necessary to operate a cloud infrastructure with a single management point.
Rocks+, with Rolls pre-owned software, allows users to build web servers, database
servers, and compute servers in the cloud. Rocks+ also provides a framework for
user-specific needs on clouds. Rocks+ provides CPU and GPU cluster management in
less time and with lower costs through software automation. [10].
294 T. Thanakornworakij et al.

Availability is a crucial factor in cloud computing services. Many studies attempt


to balance HA, based on cloud system performance and cost. Jung [3] studied a
replication technique to guarantee HA while maximizing performance on a certain
number of resources. In Ref [3], replication of software components is used to provide
HA. In case of hardware failure, they used component redundancy and regenerated
the software components into the remaining resources to achieve HA and optimize
performance, based on a queuing model with different “mean time between failure”
(MTBF) and “mean time to repair” (MTTR).

3 HA-OSCAR 2.0

HA-OSCAR 2.0 is an open source framework that provides HA for mission critical
applications and ease of system provisioning and administration. The main goal of the
new HA-OSCAR Project is to provide an open solution, with improved flexibility,
which seeks to combine the power of HA and High Performance Computing (HPC)
solutions. It is capable of enhancing HA for potential cloud computing infrastructure
such as web services, and HPC clouds by providing the much-needed redundancy for
mission critical applications. To achieve HA, HA-OSCAR 2.0 uses HATCI (High
Availability Tools Configuration and Installation). HATCI is composed of three
components: Node Redundancy, Service Redundancy and Data Replication Services.
The installation process requires just a few steps with minimum user input. HA-
OSCAR 2.0 incorporates a feature to clone the system in the installation step to make
the data and software stacks consistent on the standby head node or a cloud gateway.
If the primary component fails, the cloned node takes over the responsibilities. HA-
OSCAR also features monitoring services with a flexible event-driven rule-based
system. Moreover, it provides data synchronization between the primary and
secondary system. All of these features are enabled in the installation process.

3.1 HA-OSCAR Software Architecture

HA-OSCAR 2.0 combines many existing technology packages together to provide


HA solution. HA-OSCAR software architecture has three components. The first
component is IP monitoring using Heartbeat. Heartbeat is a service designed to detect
failure of the physical components such as network and IP services. Heartbeat uses
virtual IP address to make the fail back mechanism possible. When the primary node
is not in a healthy mode, Heartbeat handles IP fail-over and fail-back mechanisms.
The second component is the service monitor, MONIT. MONIT is a small and
lightweight service used to monitor the health of important services to make those
services highly available. MONIT will attempt to restart the failing services for 4
times by default and is tunable. If the services cannot be successfully restarted,
MONIT will send the message to Heartbeat to trigger fail-over. This can be
customized to meet the needs of the user. The third component, data synchronization,
is provided by a service called HA-filemon. HA-filemon is a daemon that monitors
changes made in the given directory trees and provides replication service via rsync
High Availability on Cloud with HA-OSCAR 295

accordingly. This is a new module that was not available in earlier versions of HA-
OSCAR. During a fail-back event, data will synchronize from the standby to the
primary server. This backwards synchronization occurs to propagate changes made to
the secondary server while it is the head node. By default, HA-filemon will invoke
rsync 2 minutes after it detects the first change in files, to allow groups of changes to
transmit together. Users can change this time according to the need of their
applications. System Imager is used for cloning the primary node during the
installation process. It creates a standby head node image from the primary server.
Finally, HA-OSCAR 2.0 supports virtual machine management via integration with
OSCAR-V.
The incorporation of all the above services endows HA-OSCAR 2.0 with the
ability to provide true HA and high performance for cloud users, for whom critical
services must be guaranteed. Thus HA-OSCAR 2.0 is a potentially viable open source
solution for achieving HA in cloud computing.
A mission-critical web service is an example where HA-OSCAR 2.0 can be
applied to a cloud. Hardware or software failure, and routine maintenance are
potential factors affecting services unavailability. Installing HA-OSCAR to provide
component redundancy can alleviate this problem. In the installation process of HA-
OSCAR 2.0 for web services, a clone of the primary web server is made that acts as a
standby server, and maintains data synchronization between primary and standby
server for data consistency. The primary web server receives requests from clients and
serves the requests directly or reroutes them to a web farm via LVS [15] or the like.
When failure occurs on the primary web server, the standby web server will take over
as primary web server. It will automatically be configured with the same IP address as
the primary web server so that all the requests are redirected to the standby server in
the same cloud advertised address, making the web service highly available. When
the primary web server is available again, and the repair is completed, by default this
server will become the standby server. If users need the fixed server to be the primary
server, they have to run the fail-back script to make the fixed server work as primary
web server.

4 System Architecture

In this section, we first examine the OSCAR-V architecture as a potential HPC cloud
and its anatomy, in order to identify single-point-of-failure components. This will
provide an opportunity to introduce system level redundancy that will produce a HA
initiated improvement over the existing OSCAR-V cluster framework. However, only
a brief description of the proposed architecture is entailed here. Additional HA-
OSCAR details may be found in [14].

4.1 OSCAR-V Cluster Architecture

Figure 1 shows the architecture of an OSCAR-V cluster system. Each individual


machine within a cluster is referred to as a node. There are two types of nodes: head
296 T. Thanakornworakij et al.

nodes and compute nodes. The head node provides service requests and routes
appropriate tasks to compute nodes. Compute nodes are primarily dedicated to
computation. The present OSCAR-V cluster architecture consists of a single server
node and a number of client nodes, where all the client nodes can be virtualized by
Xen virtualization solution.

Fig. 1. OSCAR-V Architecture

4.2 HA-OSCAR-V Cluster Architecture

As a proof-of-concept for HPC clouds, we employ OSCAR-V as a solution enabling


users to deploy virtualized clusters. In order to support HA requirements, clustered
systems must provide ways to eliminate single-point-of-failures. Hardware and
network redundancy are common techniques utilized for improving the availability of
computer systems. To achieve our cloud platform, we must first provide a redundant
cluster head node. The dual master head nodes will replicate the data and
configurations by using Rsync. In the event of a head node outage, all functionalities
provided by that primary head node will fail-over to the second redundant head node.
An additional HA functionality supported in HA-OSCAR is that of providing a high-
availability network via redundant Ethernet ports on every machine in addition to
duplicate switching fabrics (network switches, cables, etc.) for the entire network
configuration.
Figure 2 shows the typical HPC cloud architecture enabled by HA-OSCAR
solution. Each of the redundant server nodes is automatically installed and configured
as well as connected to an external network by two or more different links. These
redundant links will keep the system connected to the external environment if one
network, each server node is also connected to one or more switches and each client
node is connected to both switches providing optional two redundant switching
fabrics.
High Availability on Cloud with HA-OSCAR 297

Fig. 2. A typical HA-OSCAR-V cluster system

4.3 Case Study: HA-OSCAR-Enabled Nimbus


The Nimbus IaaS toolkit [16] is software that allows deploying an Infrastructure-as-a-
Service cloud, similar to what the Amazon EC2 platform. offers. Users interact with a
central service in order to request virtual machines (individual ones or clusters), get
their status, terminate them, etc. Additionally, they interact with a storage repository
to upload and download virtual machine images. This repository is named Cumulus
and implements the Amazon S3 API.Hence, fault tolerance capabilities for these two
components are vital requirements for ensuring high availability of a Nimbus cloud
crucial services. In case of a failure, users are not capable of monitoring and
managing their virtual machines or modifying their statuses.
Using HA-OSCAR 3 main services, namely replication, monitoring and
synchronization, it would be possible to provide standby nimbus services,
synchronize Nimbus configuration files, internal databases (where information about
users and VMs is persisted), and repository content (such as newly uploaded VM
images). When a failure occurs, HA-OSCAR would failover to the standby node in
order to ensure the critical service availability. For now, we consider only high
availability of the Nimbus service and repository, providing transparent high
availability of the VMs running in the cloud in a more complex problem, as shown by
previous research [17].

5 System Model

In this section, we evaluate availability improvement when deploying HA-OSCAR on


a regular cluster cloud. We first define the state space reliability model [13] for
system availability evaluation. In our case, let’s consider a OSCAR-V and HA-
OSCAR integrated solution. Our analysis focuses on servers and switches that
dominate cluster availability since there are many more compute nodes in the HPC
298 T. Thanakornworakij et al.

cloud that will not suffer from single point of failure. We made several assumptions
for the state-space as follows:
• Time to failure for both virtualization system servers and switches is
exponentially distributed, with the parameters λv for the servers and λw for
the switches, respectively. We consider the failure of the Virtualization
system server that has virtualization and the physical server as one
component.
• Failed components can be repaired.
• Times to repair for a server and switches are exponentially distributed with
parameters μ and β.
• When the system is down, no further failure can take place. Hence, for the
OSCAR-V cluster, when the server is down, no further failure can take place
on the switch. Similarly, when the switch is down, no further failure can take
place on the server. For HA-OSCAR-V clusters, when both servers are down,
no further failure can take place on the switches. Similarly, when both
switches are down, no further failure can take place on the HA-OSCAR-V
cluster.

5.1 OSCAR-V Cluster System Model


Figure 3 shows the Continuous Time Markov Chain (CTMC) model [13]
corresponding to the OSCAR-V cluster system. In state 1, both server nodes and
switches are well functioning. The transition to state 2 occurs if a switch node has a
failure and the transition from state 1 to state 3 occurs when a server has a failure. The
system will be available for service in state 1, and will be unavailable for state 2 and
state 3. The system goes from state 1 to state 2 when switch failure occurs at rate λw ,
and from state 1 to state 3 when server failure occurs at rate λv . After switch recovery
at rate β, the system is back in state 1 from state 2. Moreover, after server recovery at
rate μ, the system is back in state 1 from state 3. To have HA, we must try to keep the
system in the state 1 as long as possible.

Fig. 3. CTMC diagram for OSCAR-V cluster system

5.2 HA-OSCAR-V Cluster System Model

Figure 4 shows the CTMC [12], [13] model corresponding to HA-OSCAR-V cluster
system. Table 1 shows the states, number of operating components, and their
corresponding system status. The system is available for service in states 1, 2, 4 and 5,
and is unavailable in states 3, 6, 7 and 8. The system goes from one state to another at
the rates showed in the arrow lines in Figure 4.
High Availability on Cloud with HA-OSCAR 299

Fig. 4. CTMC diagram for HA-OSCAR-V cluster system

Table 1. System status


State Number of Number of System
Number Operating Operating Status
Servers Switches
1 2 2 Up
2 2 1 Up
3 2 0 Down
4 1 2 Up
5 1 1 Up
6 1 0 Down
7 0 2 Down
8 0 1 Down

6 Availability Analysis

Let πi be the steady-state probability of state i of the CTMC. They will satisfy the
following equations:
and πQ = 0 ,
where Q is the infinitesimal generator matrix [13].
Let U be the set of up states, the availability of the system A is

6.1 OSCAR-V Cluster System Analysis


We calculate the steady-state probabilities by balance equations. For this model, state
1 is the only state that available for the service. We have the steady-state availability

, (1)
300 T. Thanakornworakij et al.

where

6.2 HA-OSCAR-V Cluster System Analysis


Considering the operating states, which is the system still responding to the request
for the service; we compute availability from the steady-state probabilities by balance
equations. We then have the steady-state availability.

, (2)
where

6.3 HA Comparison
We assume that λv = 0.001hr-1, λw = 0.0005 hr-1, μ = 0.5 hr-1, and β = 1.0 hr-1.
With equations 1, 2, and 3 [13] , we can calculate the availability of the system. The
availability for OSCAR-V server cluster is 0.996, and the availability for the HA-
OSCAR-V cluster system is 0.99999. The downtime of the two systems in a year is
39.2 hours and 4.45 minutes, respectively. Typically a HA system is one that has a
downtime that does not exceed “five-nines” or 99.999%.

7 Conclusions and Future Work


HA is an important factor for Cloud service providers to ensure QoS and to meet
SLA. HA-OSCAR aims to improve HA of any Linux-based cloud computing
platform. The results of our analysis and comparison of the experimental and
theoretical HA of OSCAR-V and HA-OSCAR-V cluster systems show that the
availability of HA-OSCAR-V cluster systems is significantly higher than that of
OSCAR-V cluster systems. HA-OSCAR 2.0 now supports any cluster or enterprise
system to improve HA, not only for OSCAR or OSCAR-V but also any Linux system
based on Debian distributions. We plan to explore additional ways to extend the
availability, robustness and reliability of a HA-OSCAR system to other cloud
environment including storage clouds.

References
1. Zhang, S., Zhang, S., Chen, X., Huo, X.: Cloud Computing Research and Development
Trend. In: Second International Conference on Future Networks, ICFN 2010, January 22-
24, pp. 93–97 (2010)
High Availability on Cloud with HA-OSCAR 301

2. Zhang, S., Zhang, S., Chen, X., Wu, S.: Analysis and Research of Cloud Computing
System Instance. In: 2010 Second International Conference on Future Networks, ICFN
2010, pp. 88–92 (2010)
3. Jung, G., Joshi, K.R., Hiltunen, M.A.: Performance and Availability Aware Regeneration
for Cloud Based Multitier Application. In: Dependable Systems and Networks (DSN), pp.
497–506 (2010)
4. Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., Warfield, A.: Remus:
High Availability via Asynchronous Virtual Machine Replication. In: 5th USENIX
Symposium on Networked Systems Design and Implementation (2008)
5. Brim, M.J., Mattson, T.G., Scott, S.L.: OSCAR: Open Source Cluster Application
Resources. In: Ottawa Linux Symposium 2001, Ottawa, Canada (2001)
6. OSCAR-V, http://www.csm.ornl.gov/srt/oscarv/
7. Leangsuksun, C., Liu, T., Scott, S.L., Libby, R., Haddad, I., et al.: HA-OSCAR Release
1.0: Unleashing HABeowulf. In: International Symposium on High Performance
Computing Systems (HPCS), Canada (May 2004)
8. Haddad, I., Leangsuksun, C., Scott, S.L.: HA-OSCAR: the birth of highly available
OSCAR. Linux J. 2003(115), 1 (2003)
9. Rock+, http://www.clustercorp.com/
10. http://www.hpcwire.com/offthewire/
Clustercorp-Brings-Rocks-to-the-Cloud-108706864.html
11. Leangsuksun, C.B., Shen, L., Liu, T., Scott, S.L.: Achieving HA and performance
computing with an HA-OSCAR cluster. Future Generation Computing Syst. 21(4), 597–
606 (2005)
12. Leangsuksun, C., Shen, L., Song, H., Scott, S.L., Haddad, I.: The Modeling and
Dependability Analysis of High Availability OSCAR Cluster. In: The 17th Annual
International Symposium on High Performance Computing Systems and Applications,
Quebec, Canada, pp. 11–14 (May 2003)
13. Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science
Applications. John Wiley and Sons, New York (2001)
14. HA-OSCAR 2.0, http://hpci.latech.edu/blog/?page_id=45
15. Linux Virtual Server (LVS), http://www.linuxvirtualserver.org/
16. Nimbus, http://www.nimbusproject.org
17. Nicholas, B., Papaioannou, T.G., Aberer, K.: An Economic Approach for Scalable and
Highly-Available Distributed Applications. In: IEEE International Conference on Cloud
Computing (2010)
On the Viability of Checkpoint Compression
for Extreme Scale Fault Tolerance

Dewan Ibtesham1 , Dorian Arnold1 ,


Kurt B. Ferreira2, , and Patrick G. Bridges1
1
University of New Mexico, Albuquerque, NM, USA
{dewan,darnold,bridges}@cs.unm.edu
2
Sandia National Laboratories, Albuquerque, NM
kbferre@sandia.gov

Abstract. The increasing size and complexity of high performance com-


puting systems have lead to major concerns over fault frequencies and
the mechanisms necessary to tolerate these faults. Previous studies have
shown that state-of-the-field checkpoint/restart mechanisms will not scale
sufficiently for future generation systems. In this work, we explore the
feasibility of checkpoint data compression to reduce checkpoint commit
latency and storage overheads. Leveraging a simple model for check-
point compression viability, we conclude that checkpoint data compres-
sion should be considered as a part of a scalable checkpoint/restart solu-
tion and discuss additional scenarios and improvements that may make
checkpoint data compression even more viable.

Keywords: Checkpoint data compression, extreme scale fault-tolerance,


checkpoint/restart.

1 Introduction
Over the past few decades, high-performance computing (HPC) systems have
increased dramatically in size, and these trends are expected to continue. On
the most recent Top 500 list [27], 223 (or 44.6.%) of the 500 entries have greater
than 8,192 cores, compared to 15 (or 3.0%) just 5 years ago. Also from this most
recent listing, four of the systems are larger than 200K cores; an additional six are
larger than 128K cores, and another six are larger than 64K cores. The Lawrence
Livermore National Laboratory is scheduled to receive its 1.6 million core system,
Sequoia [2], this year. Furthermore, future extreme systems are projected to have
on the order of tens to hundreds of millions of cores by 2020 [14].
It also is expected that future high-end systems will increase in complexity;
for example, heterogeneous systems like CPU/GPU-based systems are expected
to become much more prominent. Increased complexity generally suggests that

Sandia National Laboratories is a multiprogram laboratory managed and operated
by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,
for the U.S. Department of Energy’s National Nuclear Security Administration under
contract DE-AC04-94AL85000.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 302–311, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Checkpoint Compression 303

individual components likely will be more failure prone. Increased system sizes
also will contribute to extremely low mean times between failures (MTBF), since
MTBF is inversely proportional to system size. Recent studies indeed conclude
that system failure rates depend mostly on system size, particularly, the number
of processor chips in the system. These studies also conclude that if current HPC
system growth trend continues, expected system MTBF for the biggest systems
on the Top 500 lists will fall below 10 minutes in the next few years [10,26]
Checkpoint/restart [5] is perhaps the most commonly used HPC fault-tolerance
mechanism. During normal operation, checkpoint/restart protocols periodically
record process (and communication) state to storage devices that survive tol-
erated failures. Process state comprises all the state necessary to run a process
correctly including its memory and register states. When a process fails, a new
incarnation of the failed process is resumed from the intermediate state in the
failed process’ most recent checkpoint – thereby reducing the amount of lost com-
putation. Rollback recovery is a well studied, general fault tolerance mechanism.
However, recent studies [7,10] predict poor utilizations (approaching 0%) for ap-
plications running on imminent systems and the need for resources dedicated to
reliability.
If checkpoint/restart protocols are to be employed for future extreme scale
systems, checkpoint/restart overhead must be reduced. For the checkpoint com-
mit problem, saving an application checkpoint to stable storage, we can consider
two sets of strategies. The first set of strategies hide or reduce commit laten-
cies without actually reducing the amount of data to commit. These strategies
include concurrent checkpointing [17,18], diskless checkpointing [22] and check-
pointing filesystems [3]. The second set of strategies reduce commit latencies
by reducing checkpoint sizes. These strategies include memory exclusion [23],
incremental checkpointing [6] and multi-level checkpointing [19].
This work falls under the second set of strategies. We focus on reducing the
amount of checkpoint data, particularly via checkpoint compression. We have
one fundamental goal: to understand the viability of checkpoint compression
for the types of scientific applications expected to run at large scale on future
generation HPC systems. Using several mini-applications or mini apps from the
Mantevo Project [12] and the Berkeley Lab Checkpoint/Restart (BLCR) frame-
work [11], we explore the feasibility of state-of-the-field compression techniques
for efficiently reducing checkpoint sizes. We use a simple checkpoint compression
viability model to determine when checkpoint compression is a sensible choice,
that is, when the benefits of data reduction outweigh the drawbacks of compres-
sion latency.
In the next section, we present a general background of checkpoint/restart
methods, after which we describe previous work in checkpoint compression and
our checkpoint compression viability model. In Section 3, we describe the ap-
plications, compression algorithms and the checkpoint library that comprise our
evaluation framework as well as our experimental results. We conclude with a
discussion of the implications of our experimental results for future checkpoint
compression research.
304 D. Ibtesham et al.

2 Checkpoint Compression

In this section, we describe the checkpoint compression viability model that we


use to determine when checkpoint compression should be considered. We then
discuss previous research directly and indirectly related to our checkpoint data
compression study.

2.1 A Checkpoint Compression Viability Model


Intuitively, checkpoint compression is a viable technique when benefits of check-
point data reduction outweigh the drawbacks of the time it takes to reduce the
checkpoint data. Our viability model is very similar to the concept offered by
Plank et al [24]. Fundamentally, checkpoint compression is viable when:

compression latency + time to commit < time to commit


compressed checkpoint uncompressed checkpoint
or

|checkpoint| (1 − compression-factor) × |checkpoint| |checkpoint|


+ <
compression-speed commit-speed commit-speed

where |checkpoint| is the size of the original, compression-factor is the percent-


age reduction due to data compression, compression-speed is the rate of data
compression, and commit-speed is the rate of checkpoint commit (including all
associated overheads). The last equation can be reduced to:

commit-speed
< compression-factor (1)
compression-speed
In other words, if the ratio of the checkpoint commit speed to checkpoint com-
pression speed is less than the compression factor, checkpoint data compression
provides an overall time (and space) performance reduction. Our model assumes
that checkpoint commit is synchronous; that is, the primary application pro-
cess is paused during the commit operation and is not resumed until checkpoint
commit is complete. In Section 4, we discuss the implications of this assumption.

2.2 Previous Work


Li and Fuchs implemented a compiler-based checkpointing approach, which ex-
ploited compile time information to compress checkpoints [16]. They concluded
from their results that a compression factor of over 100% was necessary to
achieve any significant benefit due to high compression latencies. Plank and
Checkpoint Compression 305

Li proposed in-memory compression and showed that, for their computational


platform, compression was beneficial if a compression factor greater than 19.3%
could be achieved [24]. In a related vein, Plank et al also proposed differen-
tial compression to reduce checkpoint sizes for incremental checkpoints [25].
Moshovos and Kostopoulos used hardware-based compressors to improve check-
point compression ratios [20]. Finally, in a related but different context, Lee et
al study compression for data migration in scientific applications [15].
Our work (currently) focuses on the use of software-based compressors for
checkpoint compression. Given recent advances in processor technologies, we
demonstrate that since processing speeds have increased at a faster rate than
disk and network bandwidth, data compression can allow us to trade faster CPU
workloads for slower disk and network bandwidth.

3 Evaluating Checkpoint Compression


The goal of this work is to evaluate the use of state-of-the-field algorithms for
compressing checkpoint data from applications that are representative of those
expected to run at large scale on current and future generation HPC systems.

3.1 The Applications


We chose four mini-applications or mini apps 1 from the Mantevo Project [12],
namely HPCCG version 0.5, miniFE version 1.0, pHPCCG version 0.4 and
phdMesh version 0.1. The first three are implicit finite element mini apps and
phdMesh is an explicit finite element mini app. HPCCG is a conjugate gradi-
ent benchmark code for a 3D chimney domain that can run on an arbitrary
number of processors. This code generates a 27-point finite difference matrix
with a user-prescribed sub-block size on each processor. miniFE mimics the fi-
nite element generation assembly and solution for an unstructured grid problem.
pHPCCG is related to HPCCG, but has features for arbitrary scalar and inte-
ger data types, as well as different sparse matrix data structures. PhdMesh is
a full-featured, parallel, heterogeneous, dynamic, unstructured mesh library for
evaluating the performance of operations like dynamic load balancing, geometric
proximity search or parallel synchronization for element-by-element operations.
In general, we chose problem sizes that would allow each application to run
long enough so that we can take at least 5 different checkpoints. Additionally,
at this preliminary stage we were not overly concerned with scaling to large
numbers of MPI processes. Primarily, we wish to observe the compressibility
of checkpoints from singleton MPI tasks. For the three implicit finite element
mini apps, we chose a problem size of 100x100x100. Both HPCCG and pHPCCG
were run with openMPI with 3 processes while miniFE was run with 2 processes.
phdMesh was run without MPI support on a problem size of 5x6x5.

1
Mini apps are small, self-contained programs that embody essential performance
characteristics of key applications.
306 D. Ibtesham et al.

3.2 The Checkpoint Library

The Berkeley Lab Checkpoint/Restart library (BLCR) [11], a system-level in-


frastructure for checkpoint/restart, is perhaps the most widely available check-
point/restart library available and is deployed on several HPC systems. For
our experiments, we obtain checkpoints using BLCR. Furthermore, we use the
OpenMPI [9] framework which has the capability to leverage BLCR for fault
tolerance.

3.3 The Compression Algorithms

For this study, we focused on the popular compression algorithms investigated


in Morse’s comparison of compression tools [13]. We settled on the following
subset, which performed well in preliminary tests2 :
– zip: ZIP is an implementation of Deflate [4], a lossless data compression al-
gorithm that uses the LZ77 [28] compression algorithm and Huffman coding.
It is highly optimized in terms of both speed and compression efficiency. The
ZIP algorithm treats all types of data as a continuous stream of bytes. Within
this stream, duplicate strings are matched and replaced with pointers fol-
lowed by replacing symbols with new, weighted symbols based on frequency
of use.
We vary zip’s parameter that toggles the tradeoff between compression factor
and compression latency. This integer parameter ranges from zero to nine,
where zero means fastest compression speed and nine means best compres-
sion factor. In our charts we use the label zip(x), where x is the value of
this parameter.
– 7zip[1]: 7zip is based on the Lempel-Ziv-Markov chain algorithm (LZMA) [21].
This algorithm uses a dictionary compression scheme similar to LZ77 and has
a very high compression ratio.
– bzip2:BZIP2 is an implementation of the Burrows-Wheeler transform [8],
which utilizes a technique called block-sorting to permute the sequence of bytes
to an order that is easier to compress. The algorithm converts frequently-
recurring character sequences into strings of identical letters and then applies
move to front transform and Huffman coding.
We vary bzip2’s compression performance by varying the block size for
the Burrows-Wheeler transform. The respective integer parameter ranges
in value from zero to nine a smaller value specifies a smaller block size. In
our charts, we use the label bzip2(x), where x is the value of this parameter.
– pbzip2[8]: pbzip2 is a parallel implementation of bzip2. pbzip2 is multi-
threaded and, therefore, can leverage multiple processing cores to improve
compression latency. The input file to be compressed is partitioned into
multiple files that can be compressed concurrently.

2
We do not present results for several other algorithms, for example gzip, that did
not perform well.
Checkpoint Compression 307

We vary two pbzip2 parameters. The first parameter is the same block size
parameter as in bzip2. The second parameter defines the file block size into
which the original input file is partitioned. This is labeled as pbzip2(x, y),
where x is the value of the first parameter and y is the value of the second
parameter.
– rzip: Rzip uses a very large buffer to take advantage of redundancies that
span very long distances. It finds and encodes large chunk of duplicate data
and then use bzip2 as a backend to compress the encoding.
We vary rzip’s parameter, which toggles the tradeoff between compression
factor and compression latency. As was the case for zip, this integer parame-
ter ranges from zero to nine, where one means fastest compression speed and
nine means best compression factor. In our charts we use the label rzip(x),
where x is the value of this parameter.

3.4 The Tests

Each test consists of a mini app, a parameterized compression algorithm3 , and


five successive checkpoints. For HPCCG the checkpoint interval was 5 seconds,
for miniFE and pHPCCG it was 3 seconds and for phdMesh the 5 checkpoints
were taken randomly. There was no particular logic in varying the checkpoint
interval except for making sure to have the checkpoints spread uniformly across
the execution time of the application. The BLCR library is used to collect the
mini app checkpoints, and then we use the selected algorithms to perform check-
point compressions. While checkpoints were being compressed, the system was
not doing any additional work.
For testing, we used a 64-bit four core Intel Xeon processor with a clock
speed of 2.33 GHz and 2 GB of memory running a Linux 2.6.32 kernel. Our
software stack consists of OpenMPI-1.4.1 configured with BLCR version 0.8.2.
The compression tools used were ZIP 3.0 by Info-ZIP, rzip version 2.1, bzip2
1.0.5, PBZIP2 1.0.5 and p7zip.

3.5 Compression Results

For each application, the average uncompressed checkpoint size ranged from 311
MB to 393 MB. Our first set of results, presented in Figure 1, demonstrate
how effective the various algorithms are at compressing checkpoint data. With
the exception of the Rzip(-0), all the algorithms achieve a very high compres-
sion factor of about 70% or higher, where compression factor is computed as:
1 − uncompressed
compressed size
size . This means, then that the primary distinguishing factor be-
comes the compression speed, that is, how quickly the algorithms can compress
the checkpoint data.
Figure 2 shows how long the algorithms take to compress the checkpoints.
In general, and not surprisingly, the parallel implementation of bzip2, pbzip2,
generally outperforms all the other algorithms.
3
For each algorithm, a different set of parameter values constitute a different test.
308 D. Ibtesham et al.

Fig. 1. Checkpoint compression ratios for the various algorithms and applications

4 Discussion

In the previous section, we presented the empirical results of our checkpoint


compression. We conclude this paper with a discussion of the implications of
these results. We also discuss known limitations and shortcomings of this work
that we plan to address as we continue this project.
This work seeks to answer the question, “Should checkpoint compression be
considered as a potentially feasible optimization for large scale scientific appli-
cations?” Based on our preliminary experiments, we believe the answer to this
question is “Yes.” Based on Equation 1, if the product of checkpoint commit
speed (or throughput) is less than the product of compression factor and com-
pression speed, checkpoint compression will provide a time (and space) perfor-
mance benefit. Figure 3 shows this product as derived from the data in Section 3.
Even with many optimizations and high performance parallel file systems that
stripe large writes simultaneously across many disks and file servers, it is diffi-
cult to achieve disk commit bandwidths on the order of ones of Gigabits/second.
Figure 3 shows that we with basic compression tools like pbzip, a file system
must achieve per process bandwidth on the order of 14 Gigabits/second and as
much as 56 Gigabits/second to compete with checkpoint compression strategy.
Furthermore, we can explore additional strategies, like using multicore CPUs or
even GPUs, to accelerate compression performance.

4.1 Current Limitations

While the results of this preliminary study are promising, we observe several
shortcomings that we plan to address. These shortcomings include:

– Testing on larger applications: while the Mantevo mini applications


are meant to demonstrate the performance characteristics of their larger
Checkpoint Compression 309

Fig. 2. Checkpoint compression times for the various algorithms and applications

Fig. 3. Checkpoint Compression Viability: Unless, checkpoint commit rate exceeds the
compression speed × compression factor product (y-axis), checkpoint compression is a
good solution

counterparts, we plan to evaluate the effectiveness of checkpoint compres-


sion for these larger applications.
– Testing at larger scales: Our current tests are limited to very small scale
applications. We plan to extend this study to applications running at much
larger scales, on the order of tens or even hundreds of thousands of tasks.
310 D. Ibtesham et al.

Qualitatively, we expect similar results since compression effectiveness is


primarily a function of the compression performance for individual process
checkpoints.
– Checkpoint intervals: For these tests, in order to keep run times man-
agable, we used a relatively small checkpoint intervals. We plan to evaluate
whether compression effectiveness changes as applications execute for longer
times. We have no reason to expect significant qualitative differences.

Acknowledgments. This work was supported in part by Sandia National Lab-


oratories subcontract 438290. A part of this work was performed at the Sandia
National Laboratories, a multiprogram laboratory managed and operated by
Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corpora-
tion, for the U.S. Department of Energys National Nuclear Security Adminis-
tration under contract DE-AC04-94AL85000. The authors are grateful to the
member of the Scalable Systems Laboratory at the University of New Mexico
and the Scalable System Software Group at the Sandia National Laboratory for
helpful feedback on portions of this study. We also acknowledge our reviewers
for comments and suggestions for improving this paper.

References

1. 7zip project official home page, http://www.7-zip.org


2. ASC Sequoia, https://asc.llnl.gov/computing_resources/sequoia
(visited May 2011)
3. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte,
M., Wingate, M.: Plfs: a checkpoint filesystem for parallel applications. In: Con-
ference on High Performance Computing Networking, Storage and Analysis (SC
2009), pp. 21:1–21:12. ACM, New York (2009)
4. Deutsch, P.: Deflate compressed data format specification
5. Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-
recovery protocols in message-passing systems. ACM Computing Surveys 34(3),
375–408 (2002)
6. Elnozahy, E.N., Johnson, D.B., Zwaenpoel, W.: The performance of consistent
checkpointing. In: 11th IEEE Symposium on Reliable Distributed Systems, Hous-
ton, TX (1992)
7. Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the
future of practical rollback-recovery. IEEE Transactions on Dependable and Secure
Computing 1(2), 97–108 (2004)
8. Elytra, J.G.: Parallel data compression with bzip2
9. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M.,
Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J.,
Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next
Generation MPI Implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J.
(eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg
(2004), doi:10.1007/978-3-540-30218-6 19
10. Gibson, G., Schroeder, B., Digney, J.: Failure tolerance in petascale computers.
CTWatch Quarterly 3(4) (November 2007)
Checkpoint Compression 311

11. Hargrove, P.H., Duell, J.C.: Berkeley lab checkpoint/restart (blcr) for linux clus-
ters. Journal of Physics: Conference Series 46(1) (2006)
12. Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C.,
Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improv-
ing performance via mini-applications. Technical Report SAND2009-5574, Sandia
National Laboratory (2009)
13. Morse Jr., K.G.: Compression tools compared (137) (September 2005)
14. Kogge, P.: ExaScale Computing Study: Technology Challenges in Achieving Ex-
ascale Systems. Technical report, Defense Advanced Research Projects Agency
Information Processing Techniques Office (DARPA IPTO) (September 2008)
15. Lee, J., Winslett, M., Ma, X., Yu, S.: Enhancing data migration performance
via parallel data compression. In: Proceedings International on Parallel and Dis-
tributed Processing Symposium, IPDPS 2002, Abstracts and CD-ROM, pp. 444–
451 (2002)
16. Li, C.-C., Fuchs, W.: Catch-compiler-assisted techniques for checkpointing. In: 20th
International Symposium on Fault-Tolerant Computing, FTCS-20, Digest of Pa-
pers, pp. 74–81 ( June 1990)
17. Li, K., Naughton, J.F., Plank, J.S.: Real-time, concurrent checkpoint for paral-
lel programs. In: 2nd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPOPP 1990), pp. 79–88. ACM, Seattle (1990)
18. Li, K., Naughton, J.F., Plank, J.S.: Low-latency, concurrent checkpointing for par-
allel programs. IEEE Transactions on Parallel and Distributed Systems 5(8), 874–
879 (1994)
19. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling,
and evaluation of a scalable multi-level checkpointing system. In: Proceedings of
the 2010 ACM/IEEE International Conference for High Performance Computing,
Networking, Storage and Analysis, SC 2010, pp. 1–11. IEEE Computer Society,
Washington, DC (2010)
20. Moshovos, A., Kostopoulos, A.: Cost-effective, high-performance giga-scale check-
point/restore. Technical report, University of Toronto (November 2004)
21. Pavlov, I.: Lzma sdk (software development kit) (2007)
22. Plank, J., Li, K., Puening, M.: Diskless checkpointing. IEEE Transactions on Par-
allel and Distributed Systems 9(10), 972–986 (1998)
23. Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing
the performance of checkpointing systems. Software – Practice & Experience 29(2),
125–142 (1999)
24. Plank, J.S., Li, K.: ickp: A consistent checkpointer for multicomputers. IEEE Par-
allel & Distributed Technology: Systems & Applications 2(2), 62–67 (1994)
25. Plank, J.S., Xu, J., Netzer, R.H.B.: Compressed differences: An algorithm for fast
incremental checkpointing. Technical Report CS-95-302, University of Tennessee
(August 1995)
26. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance
computing systems. In: Dependable Systems and Networks (DSN 2006), Philadel-
phia, PA (June 2006)
27. Top 500 Supercomputer Sites, http://www.top500.org/ (visited September 2011)
28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory 23(3), 337–343 (1977)
Can Checkpoint/Restart Mechanisms Benefit
from Hierarchical Data Staging?

Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron,


Vilobh Meshram, and Dhabaleswar K. Panda

Network-Based Computing Laboratory


Department of Computer Science and Engineering
The Ohio State University
{rajachan,ouyangx,besseron,meshram,panda}@cse.ohio-state.edu

Abstract. Given the ever-increasing size of supercomputers, fault re-


silience and the ability to tolerate faults have become more of a ne-
cessity than an option. Checkpoint-Restart protocols have been widely
adopted as a practical solution to provide reliability. However, traditional
checkpointing mechanisms suffer from heavy I/O bottleneck while dump-
ing process snapshots to a shared filesystem. In this context, we study
the benefits of data staging, using a proposed hierarchical and modular
data staging framework which reduces the burden of checkpointing on
client nodes without penalizing them in terms of performance. During a
checkpointing operation in this framework, the compute nodes transmit
their process snapshots to a set of dedicated staging I/O servers through
a high-throughput RDMA-based data pipeline. Unlike the conventional
checkpointing mechanisms that block an application until the checkpoint
data has been written to a shared filesystem, we allow the application to
resume its execution immediately after the snapshots have been pipelined
to the staging I/O servers, while data is simultaneously being moved from
these servers to a backend shared filesystem. This framework eases the
bottleneck caused by simultaneous writes from multiple clients to the
underlying storage subsystem. The staging framework considered in this
study is able to reduce the time penalty an application pays to save a
checkpoint by 8.3 times.

Keywords: checkpoint-restart, data staging, aggregation, RDMA.

1 Introduction

Current High-End Computing (HEC) systems operate at petaflop or multi-


petaflop level. As we move towards Exaflop systems, it is becoming clear that
such systems will be comprised of millions of cores and components. Although

This research is supported in part by U.S. Department of Energy grants #DE-FC02-
06ER25749 and #DE-FC02-06ER25755; National Science Foundation grants #CCF-
0621484, #CCF-0833169, #CCF-0916302, #OCI-0926691 and #CCF-0937842;
grant from Wright Center for Innovation #WCI04-010-OSU-0.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 312–321, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Hierarchical Data Staging 313

each component has only a very small chance of failure, the combination of all
components has a much higher chance of failure. The Mean Time Between Fail-
ures (MTBF) for typical HEC installations is currently estimated to be between
eight hours and fifteen days [19,7]. In order to continue computing past the
MTBF of the system, fault-tolerance has become a necessity. The most common
form of fault-tolerant solution on current generation system is checkpointing.
An application or library periodically generates a checkpoint that encapsulates
its state and saves it to a stable storage (usually a central parallel filesystem).
Upon a failure, the application can be rolled back to the last checkpoint.
Checkpoint/Restart support is provided by most of the commonly used MPI
stacks [8,12,6]. Checkpointing mechanisms are notoriously known for their heavy
I/O overhead to simultaneously dump images of many parallel processes to a
shared filesystem. Many studies have been carried out to tackle this I/O bot-
tleneck [16,5]. For example, SCR [15] proposes a multi-level checkpoint system
that stores data to the local storage on compute node, and relies on redundant
data copy to tolerate node failures. It requires a local disk or RAM disk to be
present at each compute node to store checkpoint data. There are many disk-less
clusters, and a memory-intensive application can effectively disable RAM disk
by using up most of system memory. Hence its applicability is constrained.
With the rapid advances in technology, many clusters are being built with high
performance commercial components such as high-speed low-latency networks
and advanced storage devices such as Solid State Drives (SSDs). These advanced
technologies provide an opportunity to redesign existing solutions to tackle the
I/O challenges imposed by Checkpoint/Restart. In this paper, we propose a
hierarchical data staging architecture to address the I/O bottleneck caused by
Checkpoint/Restart. Specifically we want to answer several questions:
1. How to design a hierarchical data staging architecture that can relieve com-
pute nodes from the relatively slow checkpoint writing, so that applications
can quickly resume execution?
2. How to leverage high speed network and new storage media such as SSD to
accelerate staging I/O performance?
3. How much of a performance penalty will the application have to pay to adopt
such a strategy?
We have designed a hierarchical data staging architecture that uses a dedicated
set of staging server nodes to offload checkpoint writing. Experimental results
show that the checkpoint time, as it appears to the application, can be 8.3 times
lesser compared to the basic approach for which each application process directly
writes checkpoint to a shared Lustre filesystem.
The rest of the paper is organized as follows. In section 2, we give a background
about the key components involved in our design. In Section 3, we propose
our hierarchical staging design. In section 4, we present our experiments and
evaluation. Related work is discussed in Section 5, and in section 6, we present
the conclusion and future work.
314 R. Rajachandrasekar et al.

Computing Computing Computation


node node

Computing Computing Direct


node node checkpoint

Checkpoint
staging
Staging
node Background
transfer

Shared Shared
filesytem filesytem

Classic direct checkpoint approach Checkpoint staging approach

Fig. 1. Comparison between the direct checkpoint and the checkpoint staging ap-
proaches

2 Background
Filesystem in Userspace (FUSE). Filesystem in Userspace (FUSE) [1] is
a software that allows the creation of a virtual filesystem in the user level. It
relies on a kernel module to perform privileged operations at the kernel level,
and provides a userspace library to communicate with this kernel module. FUSE
is widely used to create filesystems that do not really store the data itself but
relies on other resources to effectively store the data.

InfiniBand and RDMA. InfiniBand is an open standard of high speed inter-


connect, which provides send-receive semantics, and memory-based semantics
called Remote Direct Memory Access (RDMA) [13]. RDMA operations allow
a node to directly access a remote node’s memory contents without using the
CPU at the remote side. These operations are transparent at the remote end
since they do not involve the remote CPU in the communication. InfiniBand
empowers many of today’s Top500 Super Computers [3].

3 Detailed Design
The central principle of our Hierarchical Data Staging Framework is to provide
a fast and temporary storage area in order to absorb the I/O load burst induced
by a checkpointing operation. This fast staging area is governed by, what we
call, a Staging server. In addition to what a generic compute-node is configured
with, staging servers are over-provisioned with high-throughput SSDs and high-
bandwidth links. Given the fact that such hardware is expensive, this design
avoids the need to install them on every compute-node.
Figure 1 shows a comparison between the classic direct-checkpointing and our
checkpoint-staging approaches. On the left, with the classic approach, the check-
point files are directly written on the shared filesystem. Due to the heavy I/O
burden imposed on the shared filesystem by the checkpointing operation, the
parallel writes get multiplexed, and the aggregate throughput is reduced. This
increases the time for which the application blocks, waiting for the checkpoint-
ing operation to complete. On the right, with the staging approach, the staging
Hierarchical Data Staging 315

Fig. 2. Design of Hierarchical Data Staging Framework

nodes are able to quickly absorb the large amount of data thrust upon them
by the client nodes, with the help of the scratch space provided by the staging
servers. Once the checkpoint data has been written to the staging nodes, the
application can resume. Then, the data transfer between the staging servers and
the shared filesystem takes place in background and overlaps with the compu-
tation. Hence, this approach reduces the idling time of application due to the
checkpoint. Regardless of which approach is chosen to write the checkpointing
data, it eventually has to reach the same media.
We have designed and developed an efficient software subsystem which can
handle large, concurrent snapshot writes from typical rollback recovery protocols
and can leverage the fast storage services provided by the staging server. We use
this software subsystem to study the benefits of hierarchical data staging in
Checkpointing mechanisms.
Figure 2 shows a global overview of our Hierarchical Data Staging Framework
which has been designed for use with these staging nodes. A group of clients,
governed by a single staging server, represents a staging group. These staging
groups are building blocks of the entire architecture. Our design imposes no
restriction on the number of blocks that can be used in a system. The internal
interactions between the compute nodes and a staging server are illustrated for
one staging group in the figure.
With the proposed design, neither the application nor the MPI stack needs
to be modified to utilize the staging service. We have developed a virtual filesys-
tem based on FUSE [1] to provide this convenience. The applications that run
on compute nodes can access this staging filesystem just like any other local
filesystem. FUSE provides the ability to intercept standard filesystem calls such
as open(), read(), write(), close() etc., and manipulate the data as needed at
user-level, before forwarding the call and the data to the kernel. This ability is
exploited to transparently send the data to the staging area, rather than writing
to the local or shared filesystem.
One of the major concerns with checkpointing is the high degree of concur-
rency with which multiple client nodes write process snapshots to a shared stable
storage subsystem. These concurrent write streams introduce severe contention
316 R. Rajachandrasekar et al.

at the Virtual Filesystem Switch (VFS) which impairs the total throughput. To
avoid this contention caused by small and medium-sized writes which is com-
mon in the case of checkpointing, we use the write-aggregation method proposed
and studied in [17]. It allows to coalesce the write requests from the applica-
tion/checkpointing library, and group them into fewer large-sized writes, which
in turn reduces the number of pages allocated to them from the page cache.
After aggregating the data buffers, instead of writing them to the local disk, the
buffers are en-queued in a work-queue which is serviced by a separate thread
that handles the network transfers.
The primary goal of this staging framework is to let the application which is
being checkpointed proceed with its computation as early as possible, without
penalizing it for the shortcomings of the underlying storage system. The Infini-
Band network fabric has RDMA capability which allows for direct reads/writes
to/from host memory without involving the host processor. This capability has
been exploited to directly read the data that is aggregated in the client’s mem-
ory, which then gets transferred to the staging node which governs it. The stag-
ing node writes the data to a high-throughput node-local SSD while it receives
chunks of data from the client node (step A1 in Fig. 2). Once the data has been
persisted in these Staging servers, the application can be certain that the check-
point has been safely stored, and can proceed with its computation phase. The
data from the SSDs on individual servers are then moved to a stable distributed
filesystem in a lazy manner (step A2 in Fig. 2).
Concerning the reliability of our staging approach, we have to notice that, af-
ter a checkpoint, all the checkpoint files are eventually stored in the same shared
filesystem as in the direct-checkpointing approach. So the both approaches pro-
vide the same reliability regarding the saved data. However, with the staging
approach, the checkpointing operation is faster. This reduces the odds of losing
the checkpoint data due to a compute node failure. During a checkpoint, the
staging servers introduce additional points of failure. To counter effects of such
a failure, we ensure that the previous set of checkpoint files are not deleted before
all the new ones are safely transferred to the shared filesystem.

4 Experimental Evaluation
4.1 Experimental Testbed
A 64-node InfiniBand Linux cluster was used for the experiments. Each client node
has eight processor cores on two Intel Xeon 2.33 GHz Quad-core CPUs. Each node
has 6 GB main memory and a 250 GB ext3 disk drive. The nodes are connected
with Mellanox MT25208 DDR InfiniBand HCAs for low-latency communication.
The nodes are also connected with a 1 GigE network for interactive logging and
maintenance purposes. Each node runs Linux 2.6.30 with FUSE library 2.8.5.
The primary shared storage partition is backed by Lustre. Lustre 1.8.3 is con-
figured using 1 MetaData Server (MDS) and 1 Object Storage Server (OSS), and
is set to use InfiniBand transport. The OSS uses a 12-disk RAID-0 configuration
which can provide a 300 MB/s write throughput.
Hierarchical Data Staging 317

1 process/node 2 processes/node 4 processes/node 8 processes/node


550
Throughput (in MB/s)

530
510
490

1 2 3 4 5 6 7 8
Number of client nodes

Fig. 3. Throughput of a single staging server with varying number of clients and pro-
cesses per client (Higher is better)

The cluster also has 8 storage nodes, 4 of which have been configured to be
used as the “staging nodes”(as described in Fig. 2) for these experiments. Each
of these 4 nodes have PCI-Express based SSD cards with 80 GB capacity, two
of them being Fusion-io ioXtreme cards (350 MB/s write throughput) and two
others being Fusion-io ioDrive cards (600 MB/s write throughput).

4.2 Profiling of a Stand-Alone Staging Server


The purpose of this experiment is to study the performance of a single staging
node with varying number of clients. The I/O throughput was computed using
the standard IOzone benchmark [2]. Each client writes a file of size 1 GB using
1 MB records. Figure 3 reports the results of this experiment.
We see maximal throughput of 550 MB/s when a single client with 1 process
writes data. This throughput matches the write throughput of the SSD used as
the staging area (i.e. 600 MB/s). This indicates that transferring the files over
the InfiniBand network does not prove to be a bottleneck. As the number of
processes per client node (and the total number of processes in turn) increases,
there is contention at the SSD which slightly decreases the throughput. For 8
processes per node and 8 client nodes, i.e. 64 client processes, the throughput is
488 MB/s, which represents only a 11% decline.

4.3 Scalability Analysis


In this section, we study the scalability of the whole architecture from the ap-
plication’s perspective. In these experiments, we choose to associate 8 compute
nodes with a given staging server.
We measure the total throughput using the IOzone benchmark for 1 and 8
processes per nodes. Each process writes a total of 1 GB of data using 1 MB
record size. The results are compared to the classic approach where all processes
directly write to the Lustre shared filesystem.
318 R. Rajachandrasekar et al.

Staging 1 proc/node Staging 8 proc/node Lustre 1 proc/node Lustre 8 proc/node


Total throughput (in MB/s)

1500
500

10 15 20 25 30
Number of client nodes (1 staging node for 8 client nodes)

Fig. 4. Throughput scalability analysis, with increasing number of Staging groups and
8 clients per group (Higher is better)

Figure 4 shows that the proposed architecture scales even as we increase the
number of groups. It is expected because it is designed in such a way that the I/O
resources are added proportionally to the number of computing resources. Con-
versely, the Lustre configuration does not offer such a possibility, so the Lustre
throughput stays constant. The maximal aggregated throughput observed for all
the staging nodes is 1,834 MB/s, which is close to the sum of write throughput
of the SSDs from these nodes (1,900 MB/s).

4.4 Evaluation with Applications


As explained in Figure 1, the purpose of the staging operation is to allow the
application to resume its execution faster after a checkpoint. In the experiment,
we measure the time required to perform a checkpoint from the application
perspective, i.e. the time during which the computation is suspended because
of the checkpoint. We compared this staging approach with the classic way in
which the application processes directly write their checkpoints to the parallel
Lustre filesystem. As a complement, we also measure the time required by the
staging node to move the checkpointed data to Lustre in background once the
checkpoints have been staged and the computation has resumed.
The next experiment used two applications (LU and BT) from the NAS Paral-
lel Benchmarks. The class D input has a large memory footprint, and hence, big
checkpoints files. These applications were run on 32 nodes with MVAPICH2 [8]
and were checkpointed using the integrated Checkpoint/Restart support based
on BLCR [10]. Table 1 shows the checkpoint size of these applications for the
considered test cases.

Table 1. Size of the checkpoint files

Average size per process Total size


LU.D.128 109.3 MB 13.7 GB
BT.D.144 212.1 MB 29.8 GB
Hierarchical Data Staging 319

Direct checkpoint Direct checkpoint


Checkpoint staging Checkpoint staging
Background transfer Background transfer
99.3 s 241 s

200
80
Time (seconds)

Time (seconds)

150
60

49.5 s
105.3 s

100
40

50
20

11.9 s 28.8 s
0

0
Lustre directly Staging approach Lustre directly Staging approach

(a) LU.D.128 (b) BT.D.144


Fig. 5. Comparison of the checkpoint times between the proposed staging approach
and using the classic approach (Lower is Better)

Figure 5 reports the checkpointing time that we measured for the considered
application. For the proposed approach, two values are distinctly shown: the
checkpoint staging time (step A1 in Figure 2) and the background transfer time
(step A2 in Figure 2). The staging time is the checkpointing time as seen by the
application, i.e. the time during which the computation is suspended. The back-
ground transfer time is the time to transfer the checkpoint files from the staging
area to the Lustre filesystem, which takes place in parallel to the application
execution once the computation resumes.
For the classic approach, the checkpoint is directly written to the Lustre
filesystem, so we show only the checkpoint time (step B in figure 2). The appli-
cation is blocked on the checkpointing operation for the entire duration shown.
The direct checkpoint time and the background transfer time both write the
same amount of data to the same Lustre filesystem. The huge difference (twice
faster or more) between these data transfer times is because, thanks to our hi-
erarchical architecture, the contention on the shared filesystem is reduced. With
the direct-checkpointing approach, 128 or 144 processes write their checkpoint
simultaneously to the shared filesystem. With our staging approach, only 4 stag-
ing servers write simultaneously to the shared filesystem.
It is interesting to compare only the direct checkpoint time to the checkpoint
staging time because they correspond to the time which is seen by the application
(for classic approach and staging approach, respectively). Indeed, the background
transfer is overlapped by the computation.
Our results show the benefit of using the staging approach which considerably
reduces the time during which the application is suspended. For both our test
cases, the checkpoint time, as seen by the application, appears to be 8.3 times
faster. Then, the time gained can be used to make progress in the computation.

5 Related Work
Checkpoint/Restart is supported by several MPI stacks [8,12,6] to achieve fault
tolerance. Many of these stacks use FTB [9] as a back-plane to propagate fault
320 R. Rajachandrasekar et al.

information in a consistent manner. However, Checkpoint is well known for its


heavy I/O overhead to dump process images to stable storage [18].
A lot of efforts have been conducted to tackle this I/O bottleneck. PLFS [5] is
a parallel log-structured filesystem proposed to improve the checkpoint writing
throughput. This solution only deals with N-1 scenario where multiple processes
write to the same shared file, hence it cannot handle MPI system-level checkpoint
where each process is checkpointed to a separate image file.
SCR [15] is a multi-level checkpoint system that stores data to the local storage
on compute nodes to improve the aggregated write throughput. SCR stores
redundant data on neighbor nodes to tolerate failures of a small portion of the
system, and it periodically copies locally cached data to parallel filesystem to
tolerate cluster-wide catastrophic failures. Our approach differs from SCR where
a compute node stages its checkpoint data to its associated staging server, such
that the compute node can quickly resume execution while the staging server
asynchronously moves checkpoint data to a parallel filesystem.
OpenMPI [11] proposes a feature to store process images to node-local filesys-
tem, and later copies these files to a parallel filesystem. Dumping a memory-
intensive job to a local filesystem is usually bounded by the local disk speed,
and it is hard to work on disk-less clusters where RAM disk is not feasible due
to the high application memory footprint. Our approach aggregates node-local
checkpoint data and stages it to a dedicated staging server, which takes advan-
tages of high bandwidth network and advanced storage media such as SSD to
achieve good throughput.
Isaila et al. [14] designed a two-level staging hierarchy to hide file access
latency from applications. Their design is coupled with Blue Gene’s architecture
where dedicated I/O nodes service a group of compute nodes, and not all clusters
have such a hierarchical structure.
DataStager [4] is generic service for I/O staging which is also based on In-
finiBand RDMA. However, our work is specialized for the Checkpoint/Restart.
Thus, we can optimize the I/O scheduling for this scheme. For example, we give
the priority to the data movement from the application to the staging nodes to
shorten the checkpoint time from the application perspective.

6 Conclusion and Future Work

As a part of this work, we explored several design alternatives to develop a


hierarchical data staging framework to alleviate the bottleneck caused by heavy
I/O contention at the shared storage when multiple processes in an application
dump their respective checkpointed data. Using the proposed framework, we
have studied the scalability and throughput of hierarchical data staging and the
merits it offers when it comes to handling large amounts of Checkpoint data.
We have evaluated the Checkpointing times of different applications, and have
noted that they are able to resume their computation up to 8.3 times faster than
what they would normally, in the absence of data staging. This clearly indicates
that Checkpoint/Restart mechanisms can indeed benefit from hierarchical data
Hierarchical Data Staging 321

staging. As part of the future work, we would like to extend this framework to
offload several other Fault-Tolerance protocols to the Staging server and relieve
the client of additional overhead.

References
1. Filesystem in userspace, http://fuse.sourceforge.net
2. IOzone filesystem benchmark, http://www.iozone.org
3. Top 500 supercomputers, http://www.top500.org
4. Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:
Datastager: Scalable data staging services for petascale applications. In: HPDC
(2009)
5. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte,
M., Wingate, M.: PLFS: a checkpoint filesystem for parallel applications. In: SC
(2009)
6. Buntinas, D., Coti, C., Herault, T., Lemarinier, P., Pilard, L., Rezmerita, A.,
Rodriguez, E., Cappello, F.: Blocking vs. non-blocking coordinated checkpointing
for large-scale fault tolerant mpi protocols. Future Generation Computer Systems
(2008)
7. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale
resilience. IJHPCA (2009)
8. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent check-
point/restart for mpi programs over infiniband. In: ICPP (2006)
9. Gupta, R., Beckman, P., Park, B.H., Lusk, E., Hargrove, P., Geist, A., Panda, D.K.,
Lumsdaine, A., Dongarra, J.: Cifts: A coordinated infrastructure for fault-tolerant
systems. In: ICPP (2009)
10. Hargrove, P.H., Duell, J.C.: Berkeley Lab Checkpoint/Restart (BLCR) for Linux
Clusters. In: SciDAC (2006)
11. Hursey, J., Lumsdaine, A.: A composable runtime recovery policy framework sup-
porting resilient hpc applications. Tech. rep., University of Tennessee (2010)
12. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and imple-
mentation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS
(2007)
13. InfiniBand Trade Association: The InfiniBand Architecture,
http://www.infinibandta.org
14. Isaila, F., Garcia Blas, J., Carretero, J., Latham, R., Ross, R.: Design and evalua-
tion of multiple-level data staging for blue gene systems. TPDS (2011)
15. Moody, A., Bronevetsky, G., Mohror, K., de Supinski, B.R.: Design, modeling, and
evaluation of a scalable multi-level checkpointing system. In: SC (2010)
16. Ouyang, X., Gopalakrishnan, K., Gangadharappa, T., Panda, D.K.: Fast Check-
pointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore
Architecture. HiPC (2009)
17. Ouyang, X., Rajachandrasekhar, R., Besseron, X., Wang, H., Huang, J., Panda,
D.K.: CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In:
ICPP (2011) (to appear)
18. Plank, J.S., Chen, Y., Li, K., Beck, M., Kingsley, G.: Memory exclusion: Optimizing
the performance of checkpointing systems. In: Software: Practice and Experience
(1999)
19. Schroeder, B., Gibson, G.A.: Understanding failures in petascale computers. Jour-
nal of Physics: Conference Series (2007)
Impact of Over-Decomposition on Coordinated
Checkpoint/Rollback Protocol

Xavier Besseron1 and Thierry Gautier2


1
Dept. of Computer Science and Engineering, The Ohio State University
besseron@cse.ohio-state.edu
2
MOAIS Project, INRIA
thierry.gautier@inrialpes.fr

Abstract. Failure free execution will become rare in the future exas-
cale computers. Thus, fault tolerance is now an active field of research.
In this paper, we study the impact of decomposing an application in
much more parallelism that the physical parallelism on the rollback step
of fault tolerant coordinated protocols. This over-decomposition gives
the runtime a better opportunity to balance workload after failure with-
out the need of spare nodes, while preserving performance. We show
that the overhead on normal execution remains low for relevant factor of
over-decomposition. With over-decomposition, restart execution on the
remaining nodes after failures shows very good performance compared
to classic decomposition approach: our experiments show that the exe-
cution time after restart can be reduced by 42 %. We also consider a
partial restart protocol to reduce the amount of lost work in case of
failure by tracking the task dependencies inside processes. In some cases
and thanks to over-decomposition, this partial restart time can represent
only 54 % of the global restart time.

Keywords: parallel computing, checkpoint/rollback, over-decomposition,


global restart, partial restart.

1 Introduction

The number of components involved during the execution of High Performance


applications keeps growing. Thus, the probability of failure during an execution
is very high. Fault tolerance is now a well studied subject and many protocols
has been provided [8]. The coordinated checkpoint/rollback scheme is widely
used, principally because of its simplicity, in particular in [5,10,13,23]. However,
among other drawbacks, the coordinated checkpoint/rollback approach suffers
from the following issues.

1. Lack of flexibility after restart. Coordinated checkpoint implies that the ap-
plication will be recovered in the same configuration. Three approaches exist
to restart the failed processes. The first one is to wait for free nodes or for

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 322–332, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 323

failed nodes to be repaired. To avoid waste of time, the second approach is


based on spare nodes which are pre-allocated before the beginning of execu-
tion. Third, the application is restarted only on the remaining nodes (over-
subscription). However, without redistribution of the application workload,
this approach leads to poor performance of the execution after restart.
2. Lost work because of the restart. The global restart technique associated to
the coordinated checkpoint requires all the processes to rollback to their last
checkpoint in case of failure, even those which did not failed. Then, a large
amount of computation has to be executed twice, which constitutes a waste
of time and computing resources.

Addressing these issues is a challenging task. To overcome these limitations,


our proposition is based on the over-decomposition technique [19,4,18], coupled
with a finer representation of the internal computation with a data flow graph.
Thanks to this, the scheduler can better balance the workload. The contributions
of this work are: 1. We leverage over-decomposition to restart an application
on a smaller number of nodes (i.e. without spare node) while preserving well-
balanced workload in order to lead to a better execution speed. 2. We combine
over-decomposition and the partial restart technique proposed in [2] to reduce
the time required to re-execute the lost work. Thus, it speeds up the recovery
time of the application after a failure. 3. We propose an experimental evaluation
of these techniques using the Kaapi [11] middleware.
This paper is organized as follow. The next Section gives background about
the over-decomposition principle, the Kaapi data flow model and coordinated
checkpoint/restart. The Section 3 explains how over-decomposition can benefit
to the global and partial restart techniques. Experiments and evaluations are
presented in Section 4. Then, we give our conclusions.

2 Background
This work was motivated by the parallel domain decomposition applications. In
the remaining of the paper, we consider an iterative application called Poisson3D
which solves the Poisson’s partial differential equation with a 7-point stencil over
a 3D domain using finite difference method. The simulation domain is decom-
posed in d sub-domains. Then, the sub-domains are assigned to processes for the
computation (classically one sub-domain per MPI process).
Kaapi and Data Flow Model. Kaapi1 [11] is a task-based model for parallel
computing inherited from Athapascan [9]. Using the access mode specifications
(read, write) of the function-task parameters, the runtime is able to dynami-
cally compute the data flow dependencies between tasks from the sequence of
function-task calls, see [9,11]. These dependencies are used to execute concur-
rently independent tasks on the idle resources using work stealing scheduling [11].
Furthermore, this data flow graph is used to capture the application state for
many original checkpoint/rollback protocols [14,2].
1
http://kaapi.gforge.inria.fr
324 X. Besseron and T. Gautier

Processor 1 Processor 2 Processor 1 Processor 2 Processor 3

dom[0].0 dom[1].0 dom[2].0 dom[3].0 dom[4].0 dom[5].0 dom[0].0 dom[1].0 dom[2].0 dom[3].0 dom[4].0 dom[5].0

Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation Computation

dom[0].1 dom[1].1 dom[2].1 dom[3].1 dom[4].1 dom[5].1 dom[0].1 dom[1].1 dom[2].1 dom[3].1 dom[4].1 dom[5].1

(a) Scheduling on 2 processors (b) Scheduling on 3 processors

Fig. 1. Example of over-decomposition: the same data flow graph, generated for 6
sub-domains, is scheduled on 2 or 3 processors

In the context of this paper, Kaapi schedules an over-decomposed data flow


graph using a static scheduling strategy [15,11]. Once computed for a loop body,
the data flow graph scheduling can be re-used through several iterations until
the graph or the computing resources change.
Over-decomposition. The over-decomposition principle [19] is to choose a num-
ber d of sub-domains significantly greater than the number n of processors [19,18],
i.e. d n. Once d fixed, it defines the parallelism degree of the application. Thanks
to this, the scheduler has more freedom to balance the workload among the pro-
cessors. Figure 1 shows a basic example of how a simple over-decomposed data
flow graph can be partitioned by Kaapi using static scheduling.
Coordinated Checkpoint/Rollback. The coordinated checkpoint/rollback
technique [8,22] is composed of two phases. During the failure-free execution,
the application is periodically saved. The processes are coordinated to build a
consistent global state [7], and then they are checkpointed. In case of failure, the
application is recovered using the last valid set of checkpoints. This last step
requires all the processes to rollback. It is called a global restart.

3 Over-Decomposition for Rollback


Kaapi provides a fault-tolerance protocol called CCK for Coordinated Check-
point in Kaapi [2]. It is based on the blocking coordinated checkpoint/rollback
protocol, originally designed by Tamir et al. in [22]. CCK provides two kinds of
recovery protocol. The global restart is the classic recovery protocol associated
with the coordinated checkpoint approach. The partial restart is an original re-
covery presented in [2] which takes advantage of the data flow model.

3.1 Over-Decomposition for Global Restart


The global restart protocol of CCK works like the recovery of the classic coor-
dinated checkpoint/rollback protocol: once a failure happens, all the processes
rollback to their last checkpoint. During this step, and contrary to most of the
other works, we do not assume that spare nodes are available to replace the
failed nodes. In a such case, standard MPI programming model would impose
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 325

Re-execution of Re-execution of
Checkpoint Failure the lost work Checkpoint Failure the lost work
Computation Computation
domain domain

1111
0000 11111
00000 111
000
P0
0000
1111 00000
11111 P0
000
111
0000
1111 00000
11111 1
0 000
111
P1
0000
1111
0000
1111 00000
11111
00000
11111
P1

0000
1111
0
1 000
111
000
111
0000
1111 00000
11111 0000
1111 000
111
P2
0000
1111
0000
1111 00000
11111
00000
11111
P2
0000
1111
0000
1111 000
111
000
111
0000
1111 00000
11111 0000
1111
0000
1111 000
111
000
111
0000
1111 00000
11111
partial
P3 W global
lost P3 W lost

0000
1111 00000
11111 0000
1111 000
111
P4
0000
1111 00000
11111 P4
0000
1111 000
111
global Time partial Time
Tlost Tre−execution Tlost Tre−execution

(a) Global restart (b) Partial restart

Fig. 2. Lost work and time to re-execute the lost work for global restart and partial
restart

to restart many processes on the same core (over-subscription), and that would
lead to poor execution performance after restart.
With Kaapi, an application checkpoint is made of its data flow graph [11,14].
Then it is possible to balance the workload after restart on the remaining pro-
cesses, without requiring new processes or new nodes. The over-decomposition
allows the scheduler to freely re-map tasks and data among processors in order
to keep a well-balanced workload. Experimental results on actual executions of
the Poisson3D application are presented in Section 4.1.

3.2 Over-Decomposition for Partial Restart

The partial restart for CCK [2] assumes that the application is checkpointed
periodically using a coordinated checkpoint. However, instead of restarting all
the processes from their last checkpoint as for global restart, partial restart only
needs to re-execute a subset of the work executed since the last checkpoint to
recover the application. We call the lost work, the computation that has been
executed before the failure, but that needs to be re-executed in order to restart
global
the application properly. Wlost is the lost work for global restart on Figure 2a
partial
and Wlost represents the lost work for partial restart on Figure 2b.
To allow the execution to resume properly, and similarly to the message logging
protocols [8], the non-failed processes have to replay the messages that have been
sent to the failed processes since their last checkpoint. Since these messages have
not been logged during execution, they will be regenerated by re-executing a subset
of tasks on the non-failed processes. This strictly required set of tasks is extracted
from the last checkpoint by tracking the dependencies inside the data flow graph [2].
This technique is possible because the result of the execution is directed by a
data flow graph where the reception order of the message does not impact the
computed value [9,11]. As a result, this ensures that the restarted processes will
reach exactly the same state as the failed processes before the failure [2].
326 X. Besseron and T. Gautier
Iteration time (s)

0.4
0.2

1 node
100 nodes
0.0

0 20 40 60 80 100
Number of sub−domains per node

Fig. 3. Iteration time in function of the number of sub domains per node with a
constant domain size per node (lower is better)

With over-decomposition, the computation of each process is composed by


a large amount of small computation tasks in our data flow model. This gives
freedom to the scheduler to re-execute the lost work in parallel. The size of
the lost work has been studied theoretically and using simulations in [2]. The
experiments of the Section 4.2 show results on actual executions.

4 Experimental Results
We evaluate experimentally the techniques proposed in this paper with the Pois-
son3D application sketched in Section 2. The amount of the computation in
each iteration is constant, so the iteration time remains approximately constant
between steps. The following experiments report the average iteration time.
Our experimental testbed is composed on the Griffon and Grelon clusters
located at Nancy and part of the Grid’5000 platform2 . The Griffon cluster has
92 nodes with 16 GB of memory and two 4-core Intel Xeon L5420. The Grelon
cluster is composed of 120 nodes with 2 GB of memory and two 4-cores Intel Xeon
5110. All nodes from the both clusters are connected with a Gigabit Ethernet
network and 2 level of switches.
Over-decomposition overhead. Over-decomposition may introduce an over-
head at runtime due to the management of the parallelism. The purpose of this
first experiment is to measure this overhead. We use a constant domain size per
core: 107 double-type reals, i.e. 76 MB and we vary the decomposition d, which is
the number of sub-domains per core. We run this on 1 and 100 nodes. In both cases,
we use only 1 core for computation on each node to simplify the result analysis.
Figure 3 shows the results of the experiment on the Grelon cluster. Each
point is the average value of one hundred iterations and the error-bars show the
standard deviation. With one node, the iteration time for decomposition in 1
or 2 sub-domains is about 0.4 s. For 3 sub-domains, the execution time drops
by 35 % due to better cache use with small blocks. For higher decomposition
factors, the iteration time slowly increases linearly. It is the overhead due to the
2
http://www.grid5000.fr
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 327

Decomposition level per node (d)


Slow−down after restart on 100−p nodes

No over−decomposition 5 sub−domains/node 20 sub−domains/node


(compared to before the failure)

2 sub−domains/node 10 sub−domains/node 50 sub−domains/node


2.0
1.5
1.0
0.5
0.0

1 failed node 10 failed nodes 20 failed nodes 50 failed nodes


Number of failed nodes (p)

Fig. 4. Slow-down after restart on 100 − p nodes compared to the execution before the
failures for different decompositions d (lower is better)

management of this higher level of parallelism. Compared to the best value (i.e.
for 3 sub-domains per node), the overhead is around 3 % for 10 sub-domains
per node and 25 % for 100 sub-domains. The curve shape with 100 nodes is
similar but it is shifted up, between 0.05 and 0.1 seconds higher, due to the
communication overhead.

4.1 Global Restart

We measure the gain on the iteration time due to the capacity to reschedule the
workload after the failure of p processors. We consider the following scenario:
The application is executed on n nodes with periodic coordinated checkpoint.
Then, p nodes fail3 . The application is restarted on n − p nodes using the global
restart approach and the load-balancing algorithm is applied.

Execution speed after restart. We run the Poisson3D application on n =


100 nodes of the Grelon cluster (using only 1 core per node) with coordinated
checkpoint and we use a total domain size of 109 doubles, i.e. 7.5 GB. Then,
after the failure of p nodes, the application is restarted on 100 − p nodes and
the sub-domains are balanced among all the remaining nodes. We measured the
iteration time of the application before and after the failure and then got the
average of 100 values. Figure 4 reports the slow-down (iteration time after failure
over iteration time before failure) for different decomposition factors d.
First, we have to notice that the execution after restart is always slowed-down,
because it uses a smaller number of nodes. For example, after the failure of 50 nodes,
the execution is almost 2 times slower because it now uses 50 nodes instead of 100
nodes. The lost of half of the nodes slows down by a factor less than 2 the execution
after failure due to fewer messages communicated on less nodes.
When using 10, or more, sub-domains per node, the slow-down is considerably
reduced, in particular when the number of failed nodes is not a divisor of the
3
A node failure is simulated by killing all the processes on the node.
328 X. Besseron and T. Gautier

to re−execute (in %)
Proportion of tasks

20 40 60 80

Experiemental measures
Simulation results
0

0 50 100 150 200 250 300


Checkpoint period (in iteration number)

Fig. 5. Proportion of lost work for the partial restart in comparison to the classic
global restart approach (lower is better)

initial node number. When only one node failed, our results shows that execution
time after restart with over-decomposition is reduced by 42 %. We want to
emphasize that this improvement applies to all the iterations after the restart.
Thus, it will be beneficial to all the rest of the execution.
Load-balancing cost. This experiment evaluates the cost of the restart and
load-balancing steps. We measure this cost on the Griffon cluster using 80 nodes
with 8 cores, i.e. 640 cores, each with a domain size of 106 doubles per core,
i.e. 7.6 MB. One node fails and the application is restarted on 79 nodes. The
7s of the global restart time is decomposed as following: 2.1s for the process
coordination time and the checkpoint loading time; 1.7s to compute and apply
the new schedule; and finally, data redistribution between processes takes 3.2s.

4.2 Partial Restart


In this Section, we focus on partial restart. The application runs on 100 nodes
and one node fails. It is then restarted on 100 nodes using one spare node.
Lost work. Figure 5 reports the proportion of tasks to re-execute with respect
partila global
to the global restart, i.e. Wlost /Wlost . These values correspond to the worst
case, i.e. when the failure happens just before the next checkpoint.
The X-axis represents the checkpoint period (in iteration count). A smaller
period implies frequent checkpoints and few dependencies between sub-domains
and thus a smaller number of tasks to re-execute. A bigger period lets the de-
pendencies between sub-domains be propagated among the sub-domains and
processors (for domain decomposition application, the communication pattern
introduces local communications): more tasks need to be re-executed to recover.
Experimental measures ran on 100 nodes of the Grelon cluster with 16 sub-
domains per node. Simulation results come from [2].
Partial restart time. The results of the Figure 6 shows the restart time of
global and partial restart. For checkpoint periods of 10 or 100 iterations, the lost
work of partial restart represents respectively 6 % and 51 % of the lost work of
the global restart. We have used 100 nodes of Grelon and a total domain size
of 76 MB and we restarted the application on the same number of nodes. Using
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 329

29.6 59.9

50
Global restart Global restart
Restart time (s)

Restart time (s)


Partial restart Partial restart
20

32.2

30
10

5.1 5.7 9.7

10
3.9
5

4.6
0

0
10 iterations 100 iterations 10 iterations 100 iterations
Checkpoint period (in iteration number) Checkpoint period (in iteration number)

(a) Sub-domain computation ≈ 2 ms (b) Sub-domain computation ≈ 50 ms

Fig. 6. Comparison of the restart time between global restart and partial restart, for
different checkpoint periods and different computation grains (lower is better)

two different sub-domain computations allows to see the influence of the data
redistribution (the data size and the communication volume are kept identical).
The restart time for partial restart include the time to re-execute the lost
work, i.e. the strictly required set of tasks, and also the time to redistribute
the data, which can be costly. It is difficult to measure these two steps indepen-
dently because they are overlapped in the Kaapi implementation. For the global
restart, this time corresponds only to the time to re-execute the lost work: in
this experiment, there is no need to redistribute the data because the workload
remains the same as before the failure.
For a small computation grain, i.e. a sub-domain computation time of 2 ms,
the performance of partial restart is worse than global restart because the data
redistribution represents most of the cost of the partial restart, mainly because
the load-balancing algorithm used does not take in account the data locality.
For a coarser grain, i.e. a sub-domain computation time of 50 ms, the partial
restart achieves better performance. For a 100-iteration checkpoint period, the
partial restart time represents only 54 % of the global restart time (for a lost
work which corresponds to 51 %).

5 Related Works

The over-decomposition principle is to decompose the work of an application


in a number of tasks significantly greater than the number of computing re-
sources [19,18]. Over-decomposition is applied to many programming models in
order to hide communication latency by computation (Charm++ [18]), to sim-
plify scheduling of independent tasks (Cilk [3]) or with data flow dependencies
(PLASMA [21] and SMPSS [1] which do not consider recursive computation; or
Kaapi [9,11] that allows recursive data flow computation). For the MPI program-
ming model, the parallelism description is tightly linked to the processor num-
ber. Alternatives are AMPI [18], or hybrid approaches like MPI/OpenMP [20]
or MPI/UPC [16] which make use of over-decomposition.
330 X. Besseron and T. Gautier

On the fault tolerance aspect, most of the works focus in the message passing
model. Many protocols, like checkpoint/rollback protocols and message logging
protocols, has been designed [8] and they are widely used [5,10,13,23,6]. In [12],
communication determinism is leveraged to propose optimized approaches for
certain application classes.
Charm++ can use over-decomposition to restart an application only on the
remaining nodes after a failure [17] with coordinated checkpoint/restart and
message logging approaches. It relies on a dynamic load-balancing algorithm
which periodically collects load information of all the nodes and redistributes
the Charm++ objects if required.
Similarly to Charm++, our work in Kaapi allows to restart an application on
the remaining nodes using over-decomposition. Additionally, we leverage over-
decomposition to reduce the restart time of the application thanks to the original
partial restart approach. Also in our work, we consider a data flow model which
allows a finer representation of the application state. Furthermore, our load-
balancing algorithm is based on the data flow graph of the application and is
executed only after the restart.

6 Conclusion and Future Work

We presented the impact of over-decomposition on execution after failure using


the classical checkpoint/rollback scheme. First, when an application is restarted
without spare node, over-decomposition allows to balance the workload. In our
experimental results, the execution time after restart with over-decomposition
is reduced by 42 % compared to a decomposition based on the process number.
Furthermore, the improvement benefits to all the rest of the execution.
Secondly, we leverage over-decomposition to improve the restart with the
partial restart technique proposed in [2]. The partial restart allows to reduce
the amount of work required to restart an application after a failure. The over-
decomposition exposes the parallelism residing in this lost work. In our experi-
mental results, we showed that, in some cases, the partial restart time represents
only 54 % of the global restart time. The experiments also highlight that the
data redistribution induced by the load-balancing can have a significant impact
on the partial restart performance.
For future work, it is planned to extend the partial restart algorithm to
support the uncoordinated checkpoint approach. Thus, this will avoid the I/O
contention due to the coordinated checkpoint. Additionally, load-balancing al-
gorithms that take in account data movement will be studied to improve the
partial restart.

Acknowledgment. Experiments presented in this paper were carried out us-


ing the Grid’5000 experimental testbed, being developed under the INRIA AL-
ADDIN development action with support from CNRS, RENATER and several
Universities as well as other funding bodies (see https://www.grid5000.fr).
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol 331

References
1. Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortı́, E.S.,
Quintana-Ortı́, G.: Parallelizing dense and banded linear algebra libraries using
smpss. Concurr. Comput. : Pract. Exper. (2009)
2. Besseron, X., Gautier, T.: Optimised recovery with a coordinated check-
point/rollback protocol for domain decomposition applications. In: MCO 2008
(2008)
3. Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou,
Y.: Cilk: An efficient multithreaded runtime system. Parallel and Distributed Com-
puting (1996)
4. Bongo, L.A., Vinter, B., Anshus, O.J., Larsen, T., Bjorndalen, J.M.: Using overde-
composition to overlap communication latencies with computation and take advan-
tage of smt processors. In: ICPP Workshops (2006)
5. Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V
project: a multiprotocol automatic fault tolerant MPI. High Performance Comput-
ing Applications (2006)
6. Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems.
In: IPDPS (2004)
7. Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of
distributed systems. ACM Transactions on Computer Systems (1985)
8. Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-
recovery protocols in message-passing systems. ACM Computing Surveys (2002)
9. Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building
data flow graph in a parallel language. In: PACT 1998 (1998)
10. Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent check-
point/restart for mpi programs over infiniband. In: ICPP 2006 (2006)
11. Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system
for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)
12. Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated
checkpointing without domino effect for send-deterministic message passing appli-
cations. In: IPDPS (2011)
13. Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and imple-
mentation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS
(2007)
14. Jafar, S., Krings, A.W., Gautier, T.: Flexible rollback recovery in dynamic hetero-
geneous grid computing. IEEE Transactions on Dependable and Secure Computing
(2008)
15. Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications
in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)
16. Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI Runtimes: Expe-
rience with MVAPICH. In: PGAS 2010 (2010)
17. Kale, L.V., Mendes, C., Meneses, E.: Adaptive runtime support for fault tolerance.
Talk at Los Alamos Computer Science Symposium 2009 (2009)
18. Kale, L.V., Zheng, G.: Charm++ and AMPI: Adaptive runtime strategies via
migratable objects. In: Advanced Computational Infrastructures for Parallel and
Distributed Applications. Wiley-Interscience (2009)
19. Naik, V.K., Setia, S.K., Squillante, M.S.: Processor allocation in multiprogrammed
distributed-memory parallel computer systems. Parallel Distributed Computing
(1997)
332 X. Besseron and T. Gautier

20. Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming
on clusters of multi-core SMP nodes. In: PDP 2009 (2009)
21. Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra
algorithms on distributed-memory multicore systems. In: SC 2009 (2009)
22. Tamir, Y., Séquin, C.H.: Error recovery in multicomputers using global checkpoints.
In: ICPP 1984 (1984)
23. Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based
fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)
UCHPC 2011: Fourth Workshop
on UnConventional
High Performance Computing

Anders Hast1 , Josef Weidendorfer2 , and Jan-Philipp Weiss3


1
University in Gävle, Sweden
2
Technische Universität München, Germany
3
Karlsruhe Institute of Technology, Germany

Foreword
As the word “UnConventional” in the title suggests, the workshop focuses on
hardware or platforms used for HPC, which were not intended for HPC in the
first place. Reasons could be raw computing power, good performance per watt,
or low cost in general. Thus, UCHPC tries to capture solutions for HPC which
are unconventional today but perhaps conventional tomorrow. For example, the
computing power of platforms for games recently raised rapidly. This motivated
the use of GPUs for computing (GPGPU), or even building computational grids
from game consoles. The recent trend of integrating GPUs on processor chips
seems to be very beneficial for use of both parts for HPC. Other examples for ”un-
conventional” hardware are embedded, low-power processors, upcoming many-
core architectures, FPGAs or DSPs. Thus, interesting devices for research in
unconventional HPC are not only standard server or desktop systems, but also
relative cheap devices due to being mass market products, such as smartphones,
netbooks, tablets and small NAS servers. For example, smartphones seem to
become more performance hungry every day. Only imagination sets the limit for
use of the mentioned devices for HPC. The goal of the workshop is to present
latest research in how hardware and software (yet) unconventional for HPC is
or can be used to reach goals such as best performance per watt. UCHPC also
covers corresponding programming models, compiler techniques, and tools.
It was the 4th time the UCHPC workshop took place, with previous workshops
held in 2008 in conjunction with the International Conference on Computational
Science and Its Applications 2008, in 2009 with the ACM International Confer-
ence on Computing Frontiers 2009, and in 2010 with Euro-Par 2010. This year,
the organizers were able to accept five submissions (out of ten). In addition, we
were proud to present speakers for two invited talks. Both the invited talks and
papers were grouped around three topics which also formed the structure of the
workshop sessions, and made up for a very exciting half-day program:
– Heterogeneous Systems, starting with an invited talk by Raymond Namyst
about ”Programming Heterogeneous, Accelerator-based Multicore Machines:
a Runtime System’s Perspective”, followed by two regular talks on efficient
processor allocation and workload balancing on heterogeneous systems,
334 A. Hast, J. Weidendorfer, and J.-P. Weiss

– Accelerator Usage for Applications, again starting with an invited talk by


Bertil Schmid about ”Algorithms and Tools for Bioinformatics on GPUs”,
followed by a regular talk on a study porting a solver for electromagnetics
to a multi-GPU system, and
– Upcoming Architectures, with two regular talks on a study using a Network-
on-Chip architecture, and on porting a data mining algorithm to the Intel
Many Integrated Core Architecture.
This post-workshop proceedings include the final versions of the presented
UCHPC papers, taking the feedback from reviewers and workshop audience into
account.
The organizers of the UCHPC workshop want to thank the authors of the
papers. Without them, the workshop would not have been able to come up with
the interesting topics for discussion. But also, we sincerely thank the EuroPar
organization for providing the opportunity to arrange the workshop in conjunc-
tion with the Euro-Par 2011 conference, and would like to thank them for a
very nice environment. And as in the last years, we especially appreciated the
hard work of the members of our International Program Committee. They did
a perfect job at reviewing the submissions. Last but not least, we thank the
large number of attendees this year. They contributed to a lively day, and we
hope that they found something of interest in the workshop. Based on the very
positive feedback, the organizers and the steering committee plan to continue
the UCHPC workshop in conjunction with EuroPar 2012.

September 2011
Anders Hast
Josef Weidendorfer
Jan-Philipp Weiss
PACUE: Processor Allocator Considering User
Experience

Tetsuro Horikawa1 , Michio Honda1, Jin Nakazawa2 , Kazunori Takashio2 ,


and Hideyuki Tokuda2,3
1
Graduate School of Media and Governance, Keio University
2
Faculty of Environment and Information Studies, Keio University,
5322, Endo, Fujisawa, Kanagawa 252-8520, Japan
3
JST-CREST, Japan
{techi,jin,kaz,hxt}@ht.sfc.keio.ac.jp,
micchie@sfc.wide.ad.jp

Abstract. GPU accelerated applications including GPGPU ones are commonly


seen in modern PCs. If many applications compete on the same GPU, the perfor-
mance will decrease significantly. Some applications have a large impact on user
experience. Therefore, for such applications, we have to limit GPU utilization
by the other applications. It might be straightforward to modify applications to
switch compute device dynamically for intelligent resources allocation. Unfortu-
nately, we cannot do so due to software distribution policy or the other reasons. In
this paper, we propose PACUE, which allows the end system to allocate compute
devices arbitrary to applications. In addition, PACUE guesses optimal compute
device for each application according to user preference. We implemented the
dynamic compute device redirector of PACUE including OpenCL API hooking
and device camouflaging features. We also implemented the frame of the resource
manager of PACUE. We demonstrate PACUE achieves dynamic compute device
redirecting on one out of two real applications and on all of 20 sample codes.

Keywords: Resource management, OpenCL, binary compatibility, GPU, GPGPU,


PC, user experience.

1 Introduction
Graphics Processing Unit (GPU) use has been extended to a wider range of comput-
ing purposes on the PC platform. GPU utilization purposes on PCs can be classified
into four purposes. The first is 3D graphics computation, such as 3D games and 3D-
graphics-based GUI shell (e.g., Windows Aero). The second is 2D graphics accelera-
tion, such as font rendering in modern web browsers. The third is video decoding and
encoding acceleration. Video player applications use the video decoding acceleration
function of the GPU to reduce CPU load and to increase the video quality. Also, some
of GPUs have video encoding acceleration units on the die of the GPU.The last purpose
is general-purpose computing, called General-Purpose computing on GPU (GPGPU).
On PCs, GPGPU is often used by video encoding applications and physics simulation
applications including 3D games.1
1
Some 3D games utilize GPU for general-purpose computing besides 3D graphics rendering.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 335–344, 2012.

c Springer-Verlag Berlin Heidelberg 2012
336 T. Horikawa et al.

In today’s PCs GPUs are utilized efficiently, because only a few of the applications
are accelerated at the same time; these applications do not compete each other on the
same GPU. Applications thus choose compute devices statically, such as by user selec-
tion in the application configuration menu of the GUI interface.
However, we envisage that more and more applications utilize GPUs. For example,
Open Computing Language (OpenCL) [2] allows applications to select the compute
device explicitly to execute some parts of the application. Therefore, efficient load bal-
ancing between compute devices consisting of CPUs and GPUs is essential for future
consumer PCs.
There are three technical challenges to achieve efficient compute device assignment
of heterogeneous processors in PCs. First, GPU acceleration is utilized for various pur-
poses, while GPUs are utilized mainly for general-purpose computing in super comput-
ers. In addition, some of tasks running in PCs strongly require specific processors. For
example, 3D rendering is normally processed by GPUs, and some of 3D graphics trans-
actions cannot be processed by CPUs, whereas some applications can be processed by
both CPUs and GPUs. When the GPU load is high, we could run the latter applications
explicitly on CPUs.
Second, we must not modify applications. Typically, most of applications installed
in major OSes such as Windows and Mac OS cannot be modified by a third person,
due to their software distribution policies. Application vendors may not be willing to
modify their applications either, because it will not benefit them straightforwardly. For
these reasons, existing runtime libraries or libraries to distribute tasks between compute
devices [6, 10, 7] proposed for HPC are not deployable on consumer PCs.
Third, performance metric for consumer PCs is complicated, because user preference
is one of the most important metrics for assigning compute devices to applications. It is
clearly different from general HPC’s metrics whose task distributing policy is usually
static, such as maximizing task transaction speed or maximizing performance per watt.
In PCs, task distributing policies and merits easily change depending on the use. For
example, when the user would like to play the 3D game smoothly, the other GPGPU
tasks should not be assigned to the GPU. On the other hand, sometimes the user might
be willing to transcode videos quickly rather than playing the trifling game smoothly.
The compute device selecting method must recognize user preferences to decide the
proper compute device to assign. However this is hard, thus user preference recognizing
cannot automate. Therefore, the resource management has to infer PC utilization and
the users have to be able to tell how they are using PC at that time.
In this paper, we propose PACUE which allocates compute devices to applications
efficiently. PACUE has two features, one is dynamic compute device redirecting feature
and the other is system-wide optimal device selecting feature. We strongly focus on
solving real problems which will occur when we distribute our system over the world
via web. Therefore, we prefer choosing politically safer method rather than technically
better method. Thus, first advantage of PACUE is the possibility of the deployment.
The second advantage of PACUE is designed to maximize PC users’ experience. Thus,
we bring a new metric for using accelerators, and it will be also beneficial for other
computers such as smart phones or game consoles.
PACUE: Processor Allocator Considering User Experience 337

Our experimental results show that PACUE can switch compute devices in 1 out of 2
applications, and all of 20 sample codes built with OpenCL. The reminder of this paper
is organized as follows: In Sec. 2, we describe the design of PACUE consisting of the
dynamic compute device redirecting and the system resource manager. In Sec. 3, we
evaluate our prototype implementation. The paper concludes with Sec. 4.

2 Designing PACUE
PACUE is constructed by two components; Dynamic Compute Device Redirector and
Resource Manager. We focus on applications built with OpenCL, a widely used frame-
work which supports many types of compute devices such as CPUs and GPUs.

2.1 Dynamic Compute Device Redirection


We design the Dynamic Compute Device Redirection (DCDR) method to meet the
“no application modification” requirement. DCDR implements OpenCL API hooking
that conceals actual compute devices from applications, and avoids error caused by
inconsistent information of devices.

OpenCL API Hooking. OpenCL abstracts compute devices and memory hierarchy
to utilize heterogeneous processors within its programming model. To utilize a com-
pute device, applications call OpenCL APIs and specify a compute device. Assigning
process are following: Secondly, select possible devices and create an OpenCL con-
text. Thirdly, select one device to use and create a command queue. Lastly, put tasks to
the queue created above. In the second and the third steps, the application specifies a
concrete device because OpenCL APIs needs device ID as its parameter, which makes
system-wide optimal device selection impossible. For optimal device selection, we re-
move the restriction that the applications need to choose the device by itself because the
decision is hard for applications and users. However, decisions by applications or users
are rarely optimal (See Sec. 2.2). PACUE hooks a part of OpenCL APIs which concern
device selecting, and implements asking function that asks which device to utilize.
There are several methods to hook APIs in Windows 7 where PACUE is imple-
mented. The first possibility is making a thread in the target application by calling a
Windows API CreateRemoteThread() [12]. With this method we implement an applica-
tion which make a thread in other applications and map external DLL containing over-
ridden target APIs. However, these applications and DLLs are hard to implement due to
complicated procedures. It has a risk being treated as malware by the anti-malware soft-
ware. The second possibility is Global Hook, the user application hooks specific APIs
of all application by calling Windows API SetWindowsHookEx() [13]. This method is
unsafe, because it has a risk of hooking unknown applications and causing unexpected
affect to them. The third possibility is making Wrapper DLL, which is a DLL with the
same file name of original DLL and has all APIs of original DLL. Wrapper DLL is
almost shell of original DLL, because most APIs are simply calls original DLL APIs
except APIs which actually need to do different transaction from original. This method
has the most chance of hooking APIs, because wrapper DLL located in the applica-
tion directory is always loaded prior to the other ones, such as DLLs located in system
338 T. Horikawa et al.

Fig. 1. Dynamic Compute Device Switching by OpenCL API Hooking

directories by default. In addition, when locating wrapper DLL in the directory which
target EXE located, only affects applications whose binary is located in the same di-
rectory. Therefore, this is really safe way to hook APIs. The last possibility is the use
of API hook libraries, such as [14]. These libraries are easy to use, however it has less
probability to success to hook APIs than Wrapper DLL. It also has a risk to be treated
as malware. From this comparison, we adopt the Wrapper DLL method. Fig. 1 illus-
trates the architecture to hook OpenCL APIs with this method. Other major PC OSes
such as MacOS or Linux do not provide any function like wrapper DLLs, still we can
implement a similar system by using API hooking functions offered by other OSes.
Another method to switch devices is making a virtual device. [5] On this method, ap-
plications will assign the virtual device and the resource management system choose a
real device. This method has a significant advantage that it can switch real devices at any
time, however it may conflict with Installable Client Driver(ICD) system of OpenCL.
Installer of OpenCL runtime libraries distributed by hardware vendors sometimes over-
write “OpenCL.dll” file, thus installing a virtual device or showing applications only
the virtual device is difficult on PCs.

Device Information Camouflaging. When a part of applications’ tasks are assigned to


PACUE selected OpenCL device, some applications show errors. This is because device
information is different from the application’s intended one, thus some applications
recognize it as an unusual event. To avoid these errors, PACUE camouflages OpenCL
device details when the desired OpenCL device has been changed dynamically.
However, camouflaging OpenCL device details is risky, because devices have differ-
ent specifications in the lower level. The first risk is application stability. The memory
size of each hierarchy is device dependent, hence the unexpected memory size can re-
sult in application crash or error. The second risk is execution speed. If an application
implements per-device optimization, mismatch between the intended device and the as-
signed device can result in unexpected performance degradation. From these reasons,
we should camouflages device details only when it is necessary. To minimize the risks,
PACUE camouflages devices in following levels.
1. Device type level camouflage
When an application tries to acquire an OpenCL device list, PACUE will over-
write the cl device type value. As far as possible, PACUE will change this value
for CL DEVICE TYPE ALL. Showing all devices instead of the specific type de-
vices is a reasonable choice, because it avoids forcing application using unknown
PACUE: Processor Allocator Considering User Experience 339

Table 1. Comparison of Device Camouflaging Methods

Overridden device type/ID Specified Type Specified ID Specified ID Crash/Error Compatibility


when getting when creating when creating Risk
device list a Context a Command
queue creation
A. Device type level CPUs or GPUs All CPUs or all \ Low Most applica-
GPUs tions
B. Context level \ CPUs or GPUs Low Low
C. Command queue level \ ALL One CPU or High Most applica-
one GPU tions
D. A + C ALL ALL One CPU or Normal High
one GPU

device. Occasionally, applications cannot execute their OpenCL code on some de-
vice types. In this case, PACUE sets the cl device type value to the desired type,
such as CL DEVICE TYPE CPU or CL DEVICE TYPE GPU.
2. Context level camouflage
When creating an OpenCL context, PACUE overrides the cl device id value and
force OpenCL framework to build OpenCL binaries for each compute device. If
PACUE recognize that the target application support only specific type of compute
devices, PACUE will overwrite the cl device id value and limit device types for
context. In addition, PACUE overrides the cl device id value when applications
requests detailed device information. Therefore, application will see information
of the device PACUE selected. This contributes to application’s stability, because
acquired device information, such as the memory size corresponds to that of the
device actually will be used.
3. Command queue level camouflage
When the application calls clCreateCommandQueue() API, this is the last chance
to change the device. Because of the stability issue described above, PACUE tries
not to change device this timing, but if necessary, PACUE changes cl device id in
arguments of this API. In this situation, the device is camouflaged completely, thus
the application recognizes the camouflaged device as the device application speci-
fied. This is a terribly dangerous way to change device, still it improves application
compatibility. This is risky in terms of device dependent characteristics, such as the
memory size, however, we can switch the processor in more applications with this
method. Hence, this method is ace in the hole.

As shown in Table 1, there are several device assignment overriding ways by the com-
bination of these steps. Because they have a trade-off between application compatibility
and application stability, we have to make a rule for applying these methods, and some
hints are figured out in Sec. 3.

2.2 System Resource Management

We need a system-wide resource manager for heterogeneous processors, because av-


erage PC users cannot choose proper compute device for each application, and it is
340 T. Horikawa et al.

inconvenient that they select compute device every time the application runs. Some ad-
vanced PC users can choose proper compute device manually, however it is terribly
inconvenient. Besides, many PC users do not know detailed construction of the PC they
are using. These users cannot choose the proper compute device which satisfies their
preference accurately, even if the application allows the user to select the compute de-
vice on its GUI configuration menu. For achieving high user-experience, the resource
manager should select a compute device automatically according to user’s preferences.
There are many studies in HPC area that build a resource manager to select compute
device automatically [7, 8]. They show task distributing algorithm for heterogeneous
processors environment that optimized for some specific purposes, such as maximizing
performance or maximizing performance-per-watt. However, they cannot be applied to
resource management on PC because the requirements are different between PC and
HPC. The other approach to differentiate tasks, such as device-driver level approach [9]
would be a possibility for our goal. However, we still need a system wide resource
manager to consider heterogeneous processors and applications. These are three re-
quirements of the resource manager especially for PCs.

– Considering user preference


A PC user’s preference often changes and they are not simple objects such as max-
imizing performance. In addition, it is difficult to recognize which application is
really important, because we rarely specify priority of the process explicitly. There-
fore, we have to build a resource manager, which infers user’s preference by col-
lecting PC utilization status and chooses compute devices for each application to
achieve user preference accurately.
– Supporting various hardware configurations
There are plenty of PC hardware components and applications. Because of this
reason, combination of hardware components and applications are innumerable.
In addition, the specifications of components depend on technology trends. For
instance, some new GPU virtualization technologies for PC such as Virtu GPU
virtualization [11] seamlessly use discrete GPU when specific APIs called. Thus,
we have to build resource manager that supports various hardware configurations.
– Supporting various runtime versions
Installed runtime libraries for parallel computing may vary in PCs. Application
execution speeds are not only depends on hardware, but also depends on runtime
libraries like OpenCL frameworks. Thus, a compute device selecting algorithm op-
timized for specific runtime version, such as designed for HPC, may not show good
results on the newer version runtime libraries. We have to build compute device se-
lecting algorithms that do not depend on a specific runtime version.

This resource manager has three features for satisfying the requirements explained
above. The first feature is information gathering. PACUE collects information about
how PC is utilized, such as whether an AC adapter is connected, temperatures and volt-
ages of components, and processor utilization level such as processor loads and the
running applications list. The second feature is the user preference inferring feature.
The user describes their requirements by creating several requirement patterns. PACUE
infers which pattern is the best for the present situation by using information acquired in
PACUE: Processor Allocator Considering User Experience 341

the first step. The third feature is compute device selection, which decides the OpenCL
device to be assigned to each application. We plan to implement a few compute device
selecting algorithms for several user preference patterns. PACUE will assign compute
devices to each application based on the algorithm which matches the inferred pattern
of user preference. The resource manager works as cycles of these steps:
1. Collect PC utilization information.
2. Guess which profile is the best for the present condition.
3. Wait an inquiry of application and answer which device should be used.
For evaluation purpose, we built a basic resource manager which has communication
function to order applications to utilize specific compute device. Because of lack of user
preference based compute device selecting algorithms, recent PACUE can only select
compute device by manual selection in the resource manager GUI. Still, it can receive
an inquiry of compute device selection and answer a compute device to utilize.

3 Evaluation
In this section we confirm PACUE provides compute devices redirection capability for
applications without modification on widely used applications. We first state the policy
of the evaluation, then show and analyze the results.

3.1 Evaluation Policy


We evaluate PACUE in a PC with Intel Core i7-920 CPU and AMD RADEON HD
4850 GPU. As OpenCL framework, we adopt x86 binary of ATI Stream SDK 2.2 [4].
This framework supports both CPUs and AMD RADEON GPUs as OpenCL devices.
As testing applications, we chose the followings. They are publicly released and
widely used for benchmarking, thus suites our purpose.
– DirectCompute & OpenCL Benchmark [1]
– SiSoftware Sandra 2011 [15]
– Sample code of ”OpenCL Introdouction” book [3]
We switch the device to utilize for these applications, and compare the methods for
device switching for each of these applications.

3.2 Results
DirectCompute & OpenCL Benchmark. Table 2 shows the results. PACUE can
redirect compute device perfectly on DirectCompute & OpenCL Benchmark, but only
with method D.

SiSoftware Sandra 2011. Device switching failed. When PACUE tried to switch the
device, Sandra 2011 exhibited strange behavior, such as showing the same device twice
in the GUI. Because Sandra 2011 is an information & diagnostic utility for PC, it gathers
device information by various APIs. Thus, the failure may be caused by the lack of
integrity between device information gathered by PACUE hooked OpenCL API and
information gathered by other APIs. However, PACUE do not make Sandra crashed.
342 T. Horikawa et al.

Table 2. Result of DirectCompute & OpenCL Benchmark

Override Method A-1 A-2 B-1 B-2 C-1 C-2 D-1 D-2

Specified Device Type CPU GPU \ \ \ \ ALL ALL


Specified Device ID for Context \ \ CPUs GPUs ALL ALL ALL ALL
Sp. Dev. ID for Command Queue \ \ \ \ CPU GPU CPU GPU
Application Recognized Devices CPU*2 GPU*2 CPU*1 GPU*1 CPU*1 CPU*1 CPU*1+GPU*1 CPU*1+GPU*1
Dynamic Device Switching Impossible Impossible Static Static Static Static Dynamic Dynamic

Sample Codes of “OpenCL Introduction” Book. These codes are a set of 20 sample
applications of OpenCL APIs. The device switching succeeded for all applications in
them. However, 1 sample uses device memory information for the optimized array size,
thus the result might depend on the device. The complete camouflaging device infor-
mation might thus be incompatible with the information expected by the sample. This
can cause the application crashing or errors, however it seemed to be working correctly
while the experiment.

3.3 Analysis
The results show that PACUE can switch the compute devices on real applications.
However, it fails for device dependent applications. They use detailed information of
the particular device, such as device memory size. Thus, they may crash or behave
strangely because of the information camouflaged by PACUE.
Among combinations of the device information overriding, we found the proper or-
der to apply on applications. Shown in Table 1, these methods have a trade-off between
application stability and application compatibility. In our evaluation, we found that the
complete camouflaging method significantly increase application compatibility for real
applications, such as DirectCompute & OpenCL Benchmark. However, it is realized
by giving applications the information of the device the application specified, instead
of giving the device information actually using. Original application creator is the only
one who knows if the application works correctly when using the complete camouflag-
ing method, thus we should avoid using this risky method if possible. In general, we
suggest the following method applying order;
1. Override device type ALL and override device id when creating context. (Table 1 B)
2. Override device type ALL and override device id when creating command queue.
(Table 1 D)
3. Keep original device type and override device id when creating command queue.
(Table 1 C)
4. Override device type CPU or GPU when application requests list of available de-
vices. (Table 1 A)
The first to the third methods similarly realize dynamic device selection. The upper is
safer, the lower has more compatibility. Applications that cannot switch devices with
the first method should use the second or the third method. The last one has the high-
est compatibility but it only provides static and restrictive device switching. Thus, this
method should be applied when all other methods fail.
PACUE: Processor Allocator Considering User Experience 343

4 Conclusions and Future Work


In this paper we presented PACUE. First, PACUE switches the compute devices dynam-
ically for applications on PCs with heterogeneous processors. Second, PACUE chooses
compute devices assigned to applications to meet the user’s requirement. We conducted
experiments of our implementation, and demonstrated that 1 out of 2 real OpenCL ap-
plications, and all of 20 sample programs can change the compute device dynamically
with the dynamic compute device redirector. In addition, we showed that a few de-
vice information camouflaging methods significantly increase application compatibil-
ity. From above work, we demonstrated potential availability of the dynamic compute
device redirecting without application modified. However, there are 2 technical disad-
vantages in PACUE. The first disadvantage is that PACUE can switch devices only when
creating command queue. This is because there is no support for dynamic device switch-
ing in OpenCL, thus the chances for switching devices are limited. We will investigate
other methods to expand the chances for switching devices, also we will investigate the
frequencies of the device switching timing on other APIs. The second disadvantage is
OpenCL kernel optimization. Because of device information camouflaging, there is a
possibility of executing kernels designed for other devices. This may decrease the per-
formance significantly, thus we should avoid making situations like that. One answer is
caching every type of kernel source codes by API hooking, and switch it according to
the device actually using. Another answer is applying just-in-time OpenCL code opti-
mization technique to improve performance. However, both of them can interfere the
copyright law or licenses of the applications. Therefore, it may be difficult to apply it for
PC applications. Because of this reason, we continue improving camouflage methods
and we will avoid showing different devices information as possible as we can.
For our research goals, we have these ongoing works:

Increase Compatibility for Applications. We will address the problem that PACUE
cannot switch compute devices in some applications. Also we will experiment applica-
tion stability tests on applications.

Evaluate in Many Hardware Environment. We will conduct experiments on more


hardware configuration such as Virtu, and improve hardware support of PACUE.

Implement the User Preferences Handler in the Resource Manager. We assume


that there are several patterns describing user predefined requirements (e.g., playing
important game with the AC adaptor, and hasty file compression with unremarkable
video encoding). PACUE infers matching pattern from the user’s activity and resource
utilization.

Implement Compute Device Selecting Algorithm. With user requirement recogni-


tion, we select compute devices to follow user preference accurately. We will imple-
ment some algorithms and parameter sets for each user requirement pattern. Also, we
will explore performance impact while redirecting compute device in real applications
and take measure against heavy performance degradation. Showing applications no
OpenCL device by overriding OpenCL APIs can be one of the answers. In this case,
344 T. Horikawa et al.

applications will use internal optimized assembly to execute its transaction and it is
often much faster than executing OpenCL code on CPUs. However, it has a disadvan-
tage that compute device cannot change until restarting the application, because the
application will never call OpenCL APIs again. Therefore, we will investigate each
application’s behavior concretely to decide how to let application to use CPUs.

Support for Other Parallel Computing Frameworks. We plan to implement mod-


ules for other APIs such as Fusion System Architecture Intermediate Layer Language
(FSAIL).

References
1. DirectCompute & OpenCL Benchmark, http://www.ngohq.com/graphic-cards/
16920-directcompute-and-opencl-benchmark.html (accessed on August
21, 2011)
2. OpenCL 1.1 Specification,
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
3. Fixtars Corporation: OpenCL Introduction - Parallel Programming for Multicore CPUs and
GPUs. Impress Japan (January 2010) (in Japanese)
4. AMD. ATI Stream Technology,
http://www.amd.com/US/PRODUCTS/TECHNOLOGIES/
STREAM-TECHNOLOGY/Pages/stream-technology.aspx (accessed on Au-
gust 21, 2011)
5. Aoki, R., Oikawa, S., Tsuchiyama, R., Nakamura, T.: Hybrid opencl: Connecting different
opencl implementations over network. In: Proc. IEEE CIT 2010, pp. 2729–2735 (2010)
6. Brodman, J.C., Fraguela, B.B., Garzarán, M.J., Padua, D.: New abstractions for data parallel
programming. In: Proc. USENIX HotPar, p. 16 (2009)
7. Diamos, G.F., Yalamanchili, S.: Harmony: an execution model and runtime for heteroge-
neous many core systems. In: Proc. ACM HPDC, pp. 197–200 (2008)
8. Gupta, V., Schwan, K., Tolia, N., Talwar, V., Ranganathan, P.: Pegasus: Coordinated Schedul-
ing for Virtualized Accelerator-based Systems. In: Proc. USENIX ATC, pp. 31–44 (2011)
9. Kato, S., Lakshmanan, K., Rajkumar, R., Ishikawa, Y.: TimeGraph: GPU Scheduling for
Real-Time Multi-Tasking Environments. In: Proc. USENIX ATC, pp. 17–30 (2011)
10. Liu, W., Lewis, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Luo, S., Saha, B.: A balanced pro-
gramming model for emerging heterogeneous multicore systems. In: Proc. USENIX HotPar,
p. 3 (2010)
11. Lucidlogix. Lucidlogix virtu,
http://www.lucidlogix.com/product-virtu.html (accessed on August 21,
2011)
12. Microsoft. CreateRemoteThread Function (Windows),
http://msdn.microsoft.com/en-us/library/ms682437.aspx (accessed
on August 21, 2011)
13. Microsoft. SetWindowsHookEx Function (Windows),
http://msdn.microsoft.com/en-us/library/ms644990.aspx (accessed
on August 21, 2011)
14. Microsoft Research. Detours - microsoft research,
http://research.microsoft.com/en-us/projects/detours/ (accessed
on August 21, 2011)
15. SiSoftware. Sisoftware zone, http://www.sisoftware.net/ (accessed on August
21, 2011)
Workload Balancing on Heterogeneous Systems:
A Case Study of Sparse Grid Interpolation

Alin Muraraşu, Josef Weidendorfer, and Arndt Bode

Technische Universität München


{murarasu,weidendo,bode}@in.tum.de

Abstract. Multi-core parallelism and accelerators are becoming com-


mon features of today’s computer systems, as they allow for computa-
tional power without sacrificing energy efficiency. Due to heterogeneity,
tuning for each type of compute unit and adequate load balancing is
essential. This paper proposes static and dynamic solutions for load bal-
ancing in the context of an application for visualizing high-dimensional
simulation data. The application relies on the sparse grid technique for
data compression. Its performance critical part is the interpolation rou-
tine used for decompression. Results show that our load balancing scheme
allows for an efficient acceleration of interpolation on heterogeneous sys-
tems containing multi-core CPUs and GPUs.

1 Introduction
Heterogeneous systems containing CPUs and accelerators allow us to reach
higher computational speeds while keeping power consumption at acceptable
levels. The most common accelerators nowadays, GPUs, are very different com-
pared to state-of-the-art general-purpose CPUs. While CPUs incorporate large
caches and complex logic for out-of-order execution, branch prediction, and spec-
ulation, GPUs contain significantly more floating point units. They have in-order
cores which hide pipeline stalls through interleaved multithreading, e.g. allow-
ing up to 1536 concurrent threads per core1 . Garland et al. [1] refer to CPUs
as latency oriented processors with complex techniques used for extracting In-
struction Level Parallelism (ILP) from sequential programs. In contrast, GPUs
are throughput oriented, containing a large number of cores (e.g. 16) with wide
SIMD units (e.g. 32 lanes), making them ideal architectures for vectorizable
codes. All applications can be run on CPUs but only a subset can be ported to
or deliver good performance on GPUs, making them special purpose processors.
In the following, we refer to GPUs and CPUs as processors, but of different type.
To support all kinds of heterogeneous systems in a portable way, we need to
make sure that even for GPU-friendly code parts, there is a fallback to execute
on CPU, as we also want to best exploit systems with powerful CPU parts. For
that, multiple code versions of the same function have to be provided. For multi-
core CPUs, OpenMP [2] is the de facto programming model. Nvidia GPUs on
1
In Nvidia terminology a core is called Streaming Multi-Processor.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 345–354, 2012.

c Springer-Verlag Berlin Heidelberg 2012
346 A. Muraraşu, J. Weidendorfer, and A. Bode

the other hand are best programmed using CUDA [3]. OpenCL [4] targets both
CPUs and GPUs. Still, for optimal performance, multiple versions are essential
to target the different hardware characteristics. Another crucial part for efficient
programming of heterogeneous systems is adequate workload distributing.
The main contribution of this paper consists of proposed solutions for load
balancing in the context of the decompression of high-dimensional data com-
pressed using the sparse grid technique [5]. This technique allows for an efficient
storage of high-dimensional functions. Sparse grid interpolation (or decompres-
sion) is the performance critical part. For realizing load balancing, we employ a
dynamic strategy in which the computation is decomposed at runtime into tasks
of a given size (the grain size) which are grabbed for execution by the CPU
and the GPU. We compare this strategy to a static approach, where the load
distribution is done at the beginning of the computation, according to the com-
putational power of the heterogeneous components. By this, we show that our
interpolation runs efficiently on heterogeneous systems. To the best of our knowl-
edge, this is the first implementation of sparse grid interpolation that optimally
combines code tuned for multi-core CPUs and Nvidia GPUs.

2 Related Work
Our work is complementary to the one described in [6]. There, space and time
efficient algorithms for the sparse grid technique are proposed. We use these
algorithms as basis for our implementation of sparse grid interpolation for CPU
and GPU. It is worth mentioning that in [6] the focus is on porting the sparse grid
technique to GPUs. While the GPU code is executed, the CPUs are idle. Instead
our goal is to avoid having idle processors and to further improve performance.
Similar to our approach, MAGMA [7] exploits heterogeneous systems by pro-
viding efficient routines for linear algebra. StarPU [8] is a framework that simpli-
fies the programming of heterogeneous systems. Programs are decomposed into
StarPU tasks (bundles of multi-version functions for every processor type) with
according task dependencies, and automatically mapped to available processors
(CPU / GPU). StarPU implements a distributed shared memory (DSM) over
the CPU and the GPU memory via software controlled coherence. This allows
for automatic data transfers to / from the GPU memory. Parameters exposed
by StarPU to programmers are e.g. task size, task priority, and schedulers.

3 Optimizing Programs for Heterogeneous Computing


Programming the CPU and the GPU is inherently different. Multi-core CPUs
are programmed using threads through pthreads or OpenMP. For GPU program-
ming, CUDA is also based on threads, but there are differences. For synchroniza-
tion, CUDA only provides barriers within thread groups running on the same
GPU core, and atomic operations. For performance, the architectural details of
GPUs have to be considered. Maximizing the number of threads running con-
currently on the GPU, coalescing accesses to global memory, eliminating bank
Workload Balancing on Heterogeneous Systems 347

conflicts, minimizing the number of branches, and utilizing the various memories
appropriately (global, shared, texture, constant) are important GPU optimiza-
tions. In contrast, CPU optimizations include cache blocking and vectorization.
When programming heterogeneous systems with CPUs and GPUs, we can
use an off-loading approach, as used in systems with co-processors for specific
tasks. We determine a mapping between each function and the type of processor
on which its execution time is minimal. As each function is executed by one
type of processor, there is a risk for idle compute resources2. The solution is to
move from off-loading to full function distribution. For this, we provide multi-
version functions. We design them such that the CPU and the GPU cooperate
for computing each function. Since this approach allows for a full utilization of
a heterogeneous system, we focus on it in the rest of the paper.
Multiple versions of the same function must be orchestrated by an upper layer
responsible for balancing the workload, either statically or dynamically. A static
approach distributes the workload according to the computational speed of the
processors. An initially determined distribution does not change during the ex-
ecution of the function. In contrast, dynamic load balancing allows for changing
the workload distribution after the computation has been started. It can be
triggered by overloaded (sender initiated) or underloaded (receiver initiated) re-
sources, can be executed in centralized or decentralized manner, and results in
direct rebalancing (e.g. work stealing) or in repartitioning the data mapped to
compute resources for the next iteration of the computation on that data. [9]
provides a good overview of dynamic load balancing strategies. A typical dy-
namic strategy is receiver initiated load balancing of pieces of work which are
not pre-mapped to given compute resources, but only distributed shortly before
execution (also known as self-scheduling). This is also found in the OpenMP
dynamic scheduling strategy for parallel for-loops. We call this the dynamic task
based approach. The computation is decomposed into tasks which are inserted
into a global queue. From there, the tasks are extracted by worker threads. Of-
ten, the tasks have dependencies, making the extraction more time-consuming.
Variations use multiple queues or scheduling strategies based on work stealing,
on greedy algorithms or algorithms that predict distribution costs. For hetero-
geneous systems, the worker threads invoke according versions of a function on
the CPU or the GPU.
While the dynamic task based approach adapts implicitly to different ma-
chines, different input parameters, and external system load, there is an overhead
for task queue management and distribution. Especially, the task size, called
grain size in the following, influences that overhead. If it is too large, load bal-
ancing may not be achievable. If it is too small, the overhead may dominate and
destroy any speedup. In contrast, the overhead of static balancing is minimal.
Obviously, there is no grain size problem, but it has to adapt to function input
parameters and machine type. If the workload depends not only on parameters
such as data size, but on data values, static balancing is not feasible.

2
Note that our objective is minimal execution time, not minimal energy consumption.
348 A. Muraraşu, J. Weidendorfer, and A. Bode

Fig. 1. Grain size impact. D/L/N = 6/12/5 × 105 (left), 20/6/3 × 106 (right)

We now focus on the importance of the grain size in the dynamic task based
approach. In addition to the previous general remarks, a highly tuned CPU
version of a function performs the best for a task size that matches or is a multiple
of the tile size used for cache blocking. On the GPU, the task size should match
or be a multiple of the maximum number of active threads. This would ensure
full utilization of the GPU cores, of the SIMD units, and of multithreading.
For sparse grid interpolation, we developed an according first-come first-served
scheduler strategy using OpenMP and CUDA (OMP + CUDA). Moreover, we
implemented our application with StarPU, using various schedulers available
there. Fig. 1 shows the performance of interpolation for different grain sizes with
different input parameters: number of dimensions (D), refinement level (L), and
number of interpolations (N). The measurements are done using a Quad-core
Nehalem and an Nvidia GTX480. Note that the optimal grain size depends on
these parameters, especially for StarPU eager and our OMP + CUDA scheduler.
The dmda scheduler assigns tasks based on a performance model that considers
execution history and PCIe transfer overheads. For more details we refer to [8].

4 Sparse Grid Interpolation


Our application is the visualization of compressed, high-dimensional data result-
ing from simulations [10]. Decompression is in our case a form of interpolation
based on the sparse grid technique described in [5]. Fig. 2 depicts an example
of 5d data, i.e. velocity field, obtained from simulating the lid driven cavity for
different Reynolds numbers (Re). The velocity of cavity’s upper wall can also be
transformed into a parameter, making this a 6d problem. For a high number of
dimensions, managing the data can pose serious challenges. Therefore, we com-
press the data using the sparse grid technique in order to reduce its size and we
decompress it afterwards for real-time visualization. This technique also enables
us to interpolate at points for which we do not have values from simulation.
Hence, it can provide hints on the simulation outside the initial data.
Workload Balancing on Heterogeneous Systems 349

Fig. 2. 5d (x, y, z, t, Re) data from a CFD simulation

Sparse grid interpolation has 5 input parameters: the number of dimensions


(D), the refinement level (L), the number of interpolations (N), the precision
(P) (single or double precision), and the adaptivity (A) (adaptive or regular).
In this paper we concentrate on the first 3, these being the most important as
they can take a wide range of values. Fig. 3 (left) shows that a sparse grid can
be represented as a sequence of regular grids [6]. Using this storage scheme,
we can explain the interpolation and the impact of the inputs on performance.
Interpolating (Fig. 3 (right)) at a given D-dimensional point means traversing
the set of regular grids and computing the contribution of each regular grid on
the result. For each regular grid a D-linear basis function (O(D)) is built and
evaluated at the point. Interpolating at one point uses exactly one value from
each regular grid for scaling the basis function.
D increases the computational intensity, i.e. the ratio between the on-chip
computation time and off-chip communication time. On GPU, a large D causes
an increased consumption of shared memory per thread reducing the benefits of
multithreading. A large L decreases the computational intensity since the size of
the regular grids increases exponentially, i.e. from 20 to 2L−1 . We can see this
in Fig. 3 (left) for L = 3 (regular grids of sizes between 20 and 23 ). As only one
regular grid value is used per interpolation, only a small percentage of the com-
pressed data transferred over PCIe to the GPU is actually used for computation.
N is proportional to the computational intensity, i.e the more interpolations we
perform, the more worthwhile is the data transfer over PCIe.
Our versions of interpolation are based on the iterative algorithm from [6].
The CPU version is optimized for best use of caches and vector units. Our GPU
implementation includes the following optimizations: coalesced memory accesses,
use of shared memory, no bank conflicts, etc. Having these two versions of inter-
polation, we combine them so that all the processors in a heterogeneous system
simultaneously work on interpolation. In general, on the systems where we mea-
sured the performance of interpolation, the GPU was faster than the CPU. But,
since our goal is performance portability, it makes sense to consider the situa-
tion in which the GPU is not faster than the multi-core CPUs available in the
system. This can be the case for instance with Intel’s Sandy Bridge processors
which have a SIMD unit [11] (256 bit AVX) twice as wide as the previous gen-
eration, Nehalem (128 bit SSE). The parallelization of sparse grid interpolation
is based on distributing the points for interpolation among threads.
350 A. Muraraşu, J. Weidendorfer, and A. Bode

Fig. 3. Left: 2d sparse grid decomposed as a sequence of regular grids. Group l (l =


l
0 . . . 3) contains CD+l−1 regular grids of size 2l . D expands the groups horizontally
while L expands them vertically. Right: simplified interpolation.

5 Interpolation and Heterogeneous Computing


Having two optimized versions for CPUs and GPUs, we want to interpolate si-
multaneously on all the processors of a heterogeneous system. For this, workload
balancing is essential. This section details our approaches for load balancing.

5.1 Dynamic Task Based Load Balancing


Dynamic load balancing offers a natural way to allow the fastest processor to
grab a number of tasks proportional with its speed. But, failing to determine the
optimal task can seriously reduce the performance. For maximum performance,
we treat the grain size as a tunable parameter. Finding its optimal value can be
difficult when it is influenced by the input parameters of the application (Fig. 1).
This is the case with sparse grid interpolation.
Each combination of values for the inputs can determine a different optimal
value for the grain size. This complicates the process of tuning this parameter.
The 3d space determined by D, L, and N (or 5d if we add P and A) can make the
search for the optimal grain size very time-consuming or even impractical. To
reduce the time spent by the search we use a performance model that returns in
an acceptable amount of time an approximation of the execution time for each
combination of values for the inputs. Our model is based on the following system
of linear equations:

Tcpu (w) = ncpu · tcpu (w) (1)

Tgpu (w) = ngpu · tgpu (w) + tpcie (2)
(ccpu · ncpu + cgpu · ngpu ) · w = N (3)
 
Tcpu (w) = Tgpu (w) (4)
Workload Balancing on Heterogeneous Systems 351

Fig. 4. Left: Execution time on the CPU as a function of workload. Dependence is


linear. Right: Execution time on the GPU as a function of workload. The steps result
from: large number of cores, wide SIMD units, and multithreading.

 
Tcpu (w), Tgpu (w), ncpu , and ngpu are the unknowns. The first equation builds

the approximation Tcpu of the execution time on the CPU, Tcpu , as the product
between the number of tasks grabbed by a worker thread (ncpu ) and the duration
of a task as a function of workload (tcpu (w)), i.e. the workload is equivalent to
the number of points at which we interpolate. Similarly, the approximation of the

execution time on the GPU, Tgpu , is the sum between the duration of all tasks
executed on the GPU (ngpu · tgpu (w)) and the one-time overhead (tpcie ) caused
by transferring the compressed data over PCIe. The third equation means that
the total workload equals the sum of the workload handled by CPUs and the
workload handled by GPUs. ccpu is the number of CPU cores or CPU worker
threads and ncpu is the number of interpolations allocated to a core. cgpu is the
number of GPUs and ngpu is the number of interpolations per GPU. Finally, the
fourth equation expresses that the CPU and the GPU finish at the same time.
We now have to find good approximations (linear or piecewise) for the tcpu (w)
and the tgpu (w) functions depicted in Fig. 4. These can be considered cheap
operations since the definition domain of these functions is relatively small, i.e.
from 1 to 35000, compared to the common values for N, i.e. 106 or more. The
approximations are computed once for each combination of values for D and L.
We can subsequently reuse these functions for determining the total execution
 
time, Tcpu (w) or Tgpu (w) for any value of N. It is worth mentioning that in
the case of the CPU, for D/L/N = 6/12/5 × 105 , the optimal performance is
reached for a grain size of 4096. At the opposite end, a grain size of 1 makes
the execution up to 6 times slower. The optimal grain size changes with the
input parameters, i.e. for D/L/N = 10/10/5 × 105, it is 1024. Now it is trivial to

discover the optimal grain size, g, that minimizes Tcpu (w). Note that without our
optimization we would have to search the grain size that minimizes the execution
time for each tuple (D, L, N) we get as input. This means that for every value of
the grain size considered in the search we interpolate at a potentially large set
of points (e.g. 3 × 106 ) which can be very time-consuming for a large D or L.
352 A. Muraraşu, J. Weidendorfer, and A. Bode

5.2 Static Load Balancing


Static workload balancing eliminates the problems of dynamic workload balanc-
ing. What we follow now is to decompose the workload in two partitions. The
partitions have sizes proportional to the computational speeds of the CPU and
the GPU, or inverse proportional to the execution times. As explained above, the
inputs of sparse grid interpolation have a great impact on performance. Hence,
they cannot be ignored when determining the speed of the processors.
It is easier to present our approach for static balancing if we consider the
execution time functions on the CPU and the GPU as functions of 3 parame-
ters: D, L, and N. To simplify the notations, let us consider that D and L are
fixed. We thus have the functions Tcpu (w) and Tgpu (w) that approximate the
execution times on the CPU and the GPU, and take as parameter the number of
interpolations. Fig. 4 depicts these 2 functions for various values for the inputs.
Statically solving the workload balancing problem for a given N means finding
the value f of w that minimizes max(Tcpu (w), Tgpu (N − w). If we approximate
 
Tcpu and Tgpu with 2 linear functions Tcpu and Tgpu (Fig. 4) then it is trivial to
find f in O(1) since it is equivalent to intersecting 2 linear functions. Even for
more advanced approximations, determining f can be achieved in linear time.
To achieve efficient static balancing, our goal is to determine the execution
time functions as accurate and fast as possible. Consequently, the problem must
be reduced to a size that allows us to build the approximations in a minimum
amount of time. To obtain accuracy, the reduced problem has to provide results
that expose a global behavior, i.e. they are applicable to larger problems. Note
that the search for f must be performed for each pair (D, L) so we can consider
it as the nest of two loops iterating over a range for D and a range for L.
On the CPU, approximating the execution time is straight-forward since the
maximum speed is reached for a relatively small number of interpolation points,
leading to the linear behavior visible in Fig. 4 (left). In contrast, on the GPU
the large number of active threads (approximately 23040) creates the stepping
effect from Fig. 4 (right). For an accurate approximation of the execution time
on the GPU, we consider two points: the execution time for N = 1 and the
execution time for N = maximum number of active threads + 1. Both measure-
ments include the initial transfer of the compressed data. This ensures a proper
approximation that covers the main characteristics of the GPU: the overhead
generated by transferring the compressed data to the GPU over PCIe, the high
throughput character of the GPU expressed through a large number of SIMD
units, and multithreading on the GPU that can improve the performance.

6 Evaluation
We now describe our experimental setup and results. The tested hardware is:
– a system containing a Quad-core Intel Nehalem i7-920 (2.67 GHz) and an
Nvidia GTX480 (1.4 GHz, 15 cores, 32-lane SIMD)
– a system with 8 Intel Xeon L5630 cores (2.13 GHz) arranged in two sockets
and an Nvidia Tesla X2500 (1.15 GHz, 14 cores, 32-lane SIMD).
Workload Balancing on Heterogeneous Systems 353

160 200

180
140
160
120
140
100
120
GFlops

GFlops
80 100

80
60
60
40
StarPU dmda 40 StarPU dmda
StarPU eager StarPU eager
20 OMP + CUDA OMP + CUDA
Static 20 Static
Max Max
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Number of Dimensions Number of Dimensions

Fig. 5. Left: GFlops rate on 2 × Intel Xeon Quad-core + Nvidia Tesla x2050. Right:
GFlops rate on Nehalem Quad-core + Nvidia GTX480

Our application is compiled using gcc 4.4 and nvcc 3.2.


Regarding the problem size, in each run of our application we perform 3 ×
106 interpolations. The number of dimensions, D is in the range from 1 to 20
while the refinement level, L is 6. The dynamic approach is implemented using
a combination of StarPU and CUDA and a mix of OpenMP and CUDA. From
StarPU we only use the fastest 2 schedulers for our application: eager and dmda.
The optimal grain size was determined for each value of D both through brute
search and through our optimized search described in Sec. 5. Both searches re-
turned similar optimal grain sizes decreasing from 44000 to 7500, for D between
1 and 20 respectively. These numbers follow to some extent the maximum num-
ber of active threads (on the GPU) for D in the range from 1 to 20. Remember
that increasing D causes the decrease of the number of active threads. It is worth
mentioning that setting the optimal grain size as the maximum number of active
threads cannot provide performance portability. On our heterogeneous systems,
interpolating on GPU is between 4 and 8 times faster than interpolating on
CPU. It is likely that on other systems where the CPU is faster than the GPU,
the optimal grain size does not match the maximum number of active threads
but instead has a value that permits for the best exploitation of CPU caches.
We can see in Fig. 5 that static workload balancing delivers better performance
than the dynamic approach. It is up to 25% times faster than the dynamic ver-
sion. We attribute this difference to the latency overhead resulting from invoking
a significantly larger number of copies to / from the GPU and a larger number
of launches of our CUDA program. The amount of transferred data is the same
in both approaches but in the static one, only one transfer is necessary.
The max line is a plot of the sum between the GFlops rates of the CPU,
GFlops cpu , and of the GPU, GFlops gpu . To obtain GFlops cpu we run ony the
CPU version of interpolation. Similarly, we compute GFlops gpu by executing
only on the GPU. Note that the line for the static approach is very close to the
max line. More exactly, our static approach reaches up to 98% efficiency defined
as: E = GFlops static /(GFlops cpu + GFlops gpu ). This confirms that the linear
approximations from the static approach are sufficiently accurate.
354 A. Muraraşu, J. Weidendorfer, and A. Bode

7 Conclusion

In this paper we addressed the workload balancing problem on systems with


CPUs and GPUs in the context of sparse grid interpolation. We described static
and dynamic task based approaches for load balancing. We showed that input
parameters strongly influence the performance of interpolation and the opti-
mal values for load balancing parameters. One such parameter of the dynamic
approach is the grain size that can severely reduce the performance on hetero-
geneous systems. We presented a performance model that helps us to determine
the optimal value of the grain size in an acceptable amount of time. Our static
approach also enables us to cope with the grain size problem and is built around
linear approximations of the execution times on CPU and GPU as functions of
workload. Results show that for interpolation, static balancing delivers up to
25% more performance than the dynamic task based strategy.

Acknowledgement. This publication is based on work supported by Award


No. UK-C0020, made by King Abdullah University of Science and Technology
(KAUST).

References
1. Garland, M., Kirk, D.B.: Understanding Throughput-oriented Architectures. Com-
mun. ACM 53, 58–66 (2010)
2. OpenMP Application Programming Interface (2008)
3. NVIDIA. CUDA Programming Guide 4.0 (2011)
4. Khronos. The OpenCL Specification 1.1 (2010)
5. Bungartz, H.-J., Griebel, M.: Sparse Grids. Acta Numerica 13(-1), 147–269 (2004)
6. Murarasu, A.F., Weidendorfer, J., Buse, G., Butnaru, D., Pflüger, D.: Com-
pact Data Structure and Scalable Algorithms for the Sparse Grid Technique. In:
PPOPP, pp. 25–34 (2011)
7. MAGMA, Matrix Algebra on GPU and Multicore Architectures,
http://icl.cs.utk.edu/magma/index.html
8. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: A Unified
Platform for Task Scheduling on Heterogeneous Multicore Architectures. In: Sips,
H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 863–874.
Springer, Heidelberg (2009)
9. Osman, A., Ammar, H.: Dynamic Load Balancing Strategies for Parallel Comput-
ers. In: ISPDC, Romania (July 2002)
10. Butnaru, D., Pflüger, D., Bungartz, H.-J.: Towards High-Dimensional Computa-
tional Steering of Precomputed Simulation Data using Sparse Grids. Procedia CS 4,
56–65 (2011)
11. Intel. Intel Advanced Vector Extensions Programming Reference (2011)
Performance Evaluation of a Multi-GPU
Enabled Finite Element Method for
Computational Electromagnetics

Tristan Cabel, Joseph Charles, and Stéphane Lanteri

INRIA Sophia Antipolis-Méditerranée Research Center, Nachos project-team


06902 Sophia Antipolis Cedex, France
Stephane.Lanteri@inria.fr

Abstract. We study the performance of a multi-GPU enabled numer-


ical methodology for the simulation of electromagnetic wave propaga-
tion in complex domains and heterogeneous media. For this purpose,
the system of time-domain Maxwell equations is discretized by a discon-
tinuous finite element method which is formulated on an unstructured
tetrahedral mesh and which relies on a high order interpolation of the
electromagnetic field components within a mesh element. The resulting
numerical methodology is adapted to parallel computing on a cluster of
GPU acceleration cards by adopting a hybrid strategy which combines
a coarse grain SPMD programming model for inter-GPU parallelization
and a fine grain SIMD programming model for intra-GPU paralleliza-
tion. The performance improvement resulting from this multiple-GPU
algorithmic adaptation is demonstrated through three-dimensional sim-
ulations of the propagation of an electromagnetic wave in the head of a
mobile phone user.

1 Introduction

Efforts to exploit GPUs, for non-graphical applications have been underway since
2003 and has evolved into programmable and massively parallel computational
units with very high memory bandwidth. From this time to the present day a
review of research works aiming at harnessing GPUs for the acceleration of sci-
entific computing applications would hardly fit into one page. In particular, the
development of GPU enabled high order numerical methods for the solution of
partial differential equations is a rapidly growing field. Focusing on contributions
that are dealing with wave propagation problems, GPUs have been considered
for the first time for computational electromagnetics and computational geoseis-
mics applications respectively by Klöckner et al. [3] and by Komatitsch et al.
[5]-[4]. The present work shares several concerns with [3] which describes the de-
velopment of a GPU enabled discontinuous Galerkin (DG) method formulated
on an unstructured tetrahedral mesh for the discretization of hyperbolic systems
of conservation laws. As it is the case with the DG method considered in [3], the
approximation of the unknown field in a tetrahedron relies on a high order nodal

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 355–364, 2012.

c Springer-Verlag Berlin Heidelberg 2012
356 T. Cabel, J. Charles, and S. Lanteri

interpolation method which is a key feature in view of exploiting the processing


capabilities of a GPU architecture. A recent evolution of the work described in
[3] is presented in Gödel et al. [2] where the authors discuss the adaptation of a
multirate time stepping based DG method for solving the time-domain Maxwell
equations on a multiple GPU system. Here, we study the performance of a multi-
GPU enabled numerical methodology for the simulation of electromagnetic wave
propagation.

2 The Physical Problem and Its Numerical Treatment


We consider the Maxwell equations in three space dimensions for heterogeneous
linear isotropic media. The electric field E(x, t) = t (Ex , Ey , Ez ) and the mag-
netic field H(x, t) = t (Hx , Hy , Hz ) verify:

∂t E − curlH = −J , μ∂t H + curlE = 0, (1)

where the symbol ∂t denotes a time derivative and J (x, t) is a current source
term. These equations are set on a bounded polyhedral domain Ω of R3 . The
electric permittivity (x) and the magnetic permeability coefficients μ(x) are
varying in space, time-invariant and both positive functions. The current source
term J is the sum of the conductive current J σ = σE (where σ(x) denotes the
electric conductivity of the media) and of an applied current J s associated to a
localized source for the incident electromagnetic field. Our goal is to solve system
(1) in a domain Ω with boundary ∂Ω = Γa ∪ Γm , where we impose the following
boundary conditions: n × E= 0 on Γm , and L(E, H) = L(E inc , H inc ) on Γa
μ
where L(E, H) = n × E − n × (H × n). Here n denotes the unit outward
ε
normal to ∂Ω and (E inc , H inc ) is a given incident field. The first boundary
condition is called metallic (referring to a perfectly conducting surface) while
the second condition is called absorbing and takes here the form of the Silver-
Müller condition which is a first order approximation of the exact absorbing
boundary condition. This absorbing condition is applied on Γa which represents
an artificial truncation of the computational domain.
For the numerical treatment of system (1), the domain Ω is triangulated into
a set Th of tetrahedra τi . We denote by Vi the set of indices of the elements which
are neighbors of τi (i.e. sharing a face). In the following, to simplify the presenta-
tion, we set J = 0. For a given partition Th , we seek approximate solutions to (1)
in the finite element space Vpi (Th ) = {v ∈ L2 (Ω)3 : v |τi ∈ (Ppi [τi ])3 , ∀τi ∈ Th }
where Ppi [τi ] denotes the space of nodal polynomial functions of degree at
most pi inside τi . Following the discontinuous Galerkin approach, the electric
and magnetic fields (Ei , Hi ) are locally approximated as combinations of lin-
early independent basis vector fields ϕij . Let Pi = span(ϕij , 1 ≤ j ≤ di )
where di denotes the number of degrees of freedom inside τi . The approximate
fields (Eh , Hh ), defined by (∀i, Eh|τi = Ei , Hh|τi = Hi ), are thus allowed to be
completely discontinuous across element boundaries. For such a discontinuous
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 357

field Uh , we define its average {Uh }ik through any internal interface aik , as
{Uh }ik = (Ui|aik + Uk|aik )/2. Because of this discontinuity, a global variational
formulation cannot be obtained. However, dot-multiplying (1) by ϕ ∈ Pi , inte-
grating over each single element τi and integrating by parts, yields a local weak
formulation involving volume integrals over τi and surface integrals over ∂τi .
While the numerical treatment of volume integrals is rather straightfoward, a
specific procedure must be introduced for the surface integrals, leading to the
definition of a numerical flux. In this study, we choose to use a fully centered
numerical flux, i.e., ∀i, ∀k ∈ Vi , E|aik  {Eh }ik , H|aik  {Hh }ik . The local
weak formulation can be written as:
  
1 1 
ϕ · i ∂t Ei = (curlϕ · Hi + curlHi · ϕ) − ϕ · (Hk × nik ),
τi 2 τi 2
k∈Vi aik
   (2)
1 1 
ϕ · μi ∂t Hi= − (curlϕ · Ei + curlEi · ϕ) + ϕ · (Ek × nik ).
τi 2 τi 2 aik
k∈Vi

Eq. (2) can be rewritten in terms of scalar 


unknowns. Inside each element,
 the
fields are re-composed according to Ei = Eij ϕij and Hi = Hij ϕij
1≤j≤d 1≤j≤d
and let us now denote by Ei and Hi respectively the column vectors (Eil )1≤l≤di
and (Hil )1≤l≤di . Then, (2) is equivalent to:
dEi  dHi 
Mi = K i Hi − Sik Hk , Miμ = −Ki Ei + Sik Ek , (3)
dt dt
k∈Vi k∈Vi

where the symmetric positive definite mass matrices Miη (η stands for or μ), the
symmetric stiffness matrix Ki (both of size di × di ) and the symmetric interface
matrix Sik (of size di × dk ) are given by:
 
1
(Miη )jl = ηi t
ϕij · ϕil , (Sik )jl = t
ϕij · (ϕkl × nik ).
2 aik
τi

1
(Ki )jl = t
ϕij · curlϕil + t ϕil · curlϕij .
2 τi
The set of local systems of ordinary differential equations for each τi (3) can be
formally transformed in a global system. To this end, we suppose that all electric
(resp. magnetic) unknowns are gathered in a column vector E (resp. H) of size
Nt
dg = di where Nt stands for the number of elements in Th . Then system (3)
i=1
can be rewritten as:
dE dH
M = KH − AH − BH + CE E , Mμ = −KE + AE − BE + CH H, (4)
dt dt
where we emphasize that M and Mμ are dg × dg block diagonal matrices. if we
set S = K − A − B then system (4) rewrites as:
dE dH
M = SH + CE E , Mμ = − t SE + CH H. (5)
dt dt
358 T. Cabel, J. Charles, and S. Lanteri

Finally, system (5) is time integrated using a second-order leap-frog scheme as:

n+1

⎪ E − En 1

⎨ M 
= SHn+ 2 + CE En ,
Δt
3 1 (6)

⎪ Hn+ 2 − Hn+ 2 n+ 12

⎩ M μ
= − t
SEn+1
+ C H H .
Δt

The resulting discontinuous Galerkin time domain method (DGTD-Ppi in the


sequel) is analyzed in [1] where it is shown that, when Γa = ∅, the method is
stable under a CFL-like condition.

3 Implementation Aspects
3.1 DGTD CUDA Kernels
We describe here the implementation strategy adopted for the GT200 gener-
ation of NVIDIA GPUs and for calculations in single precision floating point
arithmetic. We first note that the main computational kernels of the DGTD-Ppi
method considered in this study are the volume and surface integrals over τi and
∂τi appearing in (2). Moreover, we limit ourselves to a uniform order method
i.e. p ≡ pi is the same for all the elements of the mesh, and we present ex-
perimental results for the values p = 1, 2, 3, 4. At the discrete level, these local
computations translate into the matrix-vector products appearing in (3). The
discrete equations for updating the electric and magnetic fields are composed
of the same steps and only differ by the fields they are applied to. They both
involve the same kernels that we will refer to in the sequel as intVolume (com-
putation of volume integrals), intSurface (computation of surface integrals)
and updateField (update of field components). All these kernels stick to the
following paradigm: (1) load data from device memory to shared memory, (2)
synchronize with all the other threads of the block so that each thread can safely
read shared memory locations that were populated by different threads, (3) pro-
cess the data in shared memory, (4) synchronize again to make sure that shared
memory has been updated with the results, (5) write the results back to device
memory. This paradigm ensures that almost all the operations on data allocated
in global memory are performed in a coalesced way.
We outline below the main characteristics of these kernels and refer to [6]
for a more detailed description. In our implementation, some useful elementary
matrices, such as the mass matrix computed on the reference element, are stored
in constant memory because they are small and are accessed following constant
memory patterns. For the sequel, we introduce the following notations: NBTET
is the number of tetrahedra that are treated by a block of threads. It depends
of the chosen interpolation order and it is taken to be a multiple of 16 because
of the way one load and write data to and from device memory; NDL is the
number of degrees of freedom (d.o.f) in an element τi for each field component,
for a given interpolation order; finally, NDF is the number of d.o.f on a face aik
for each field component, for a given interpolation order.
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 359

Volume integral kernel : intVolume. This kernel operates on each d.o.f of a


tetrahedron. Since the number of d.o.f increases with the interpolation order,
resources needed by this kernel (registers and shared memory) also raise. Con-
sequently, we wrote two versions of this kernel: one kernel for p = 1 and 2, and
the other one for p = 3 and 4. However, these two versions have some common
features. First, each thread computes one d.o.f of one tetrahedron. The second
common feature is the data stored in shared memory, which are some geomet-
rical quantities associated to a tetrahedron, and the field and the flux balance
components. The last common feature is the number of tetrahedra operated by
a block (i.e.NBTET). The main difference is that when in the low order version
a block computes all the d.o.f (NDL) of the NBTET tetrahedra, the high order
volume kernel only computes a certain number of d.o.f of the NBTET tetrahe-
dra. Consequently, in the latter case, two or three instances of the kernel are
necessary to compute all the d.o.f of all the tetrahedra. This approach induces
a drawback because we have to load field data in two or three kernels instead
of one. Indeed, the dimension of a block is NBTET*NDL which leads to blocks
of more than 512 threads for high interpolation orders which is not possible in
CUDA. However, there is also a benefit because computing a lower number of
d.o.f in a kernel allows us to use less shared memory in the buffer storing field
data and less registers in a kernel thus increasing the occupancy of the GPU.
Surface Integral kernel : intSurface. For this kernel, one thread works on one
surface d.o.f of one tetrahedron. Similarly to the intVolume kernel, two versions
of this kernel have been implemented. For the low order version, a thread applies
the influence of its d.o.f to the four faces of its tetrahedron whereas for the
high order version, a thread only works on one face of its tetrahedron. So, for
the low order version, a block computes the numerical flux for four faces of
NBTET tetrahedra instead of one face of NBTET tetrahedra for the high order
version. Therefore, the high order version has to launch four kernels instead of
one for the low order version. Here, we work on the surface d.o.f (NDF) but
fields components are store using the volume d.o.f (NDL) so we need to use
a permutation matrix to link these different local numberings of these d.o.f.
Moreover, a face of a tetrahedron is also shared by another tetrahedron and
the corresponding field values are needed in the computation of the elementary
flux. Consequently, we cannot load field data in a coalesced way and we have
to use texture memory. Field values are loaded before each face computation.
Nevertheless, the high order version has a memory drawback compared to the
lower one. Indeed, because there are four launches of the function, data are
written four times to the flux table instead of once in the low order version.
Update kernel : updateField. There are four update kernels. First of all, update
kernels are a bit different according to the field they are working on (electric
or magnetic). Since in this case a thread works on one d.o.f of a tetrahedron,
the dimension of a block is NBTET*NDL. Consequently, as for the intVolume
kernel, we need a special version for the higher interpolation orders in order to
avoid exceeding the maximum number of threads per block. In the high order
version, we adopt an approach where a thread deals with two different d.o.f of a
360 T. Cabel, J. Charles, and S. Lanteri

tetrahedron which allows a block to compute all the d.o.f for NBTET tetrahedra.
This approach is less efficient for the lower interpolation orders. The two versions
of the electric field update kernels need only one shared memory table. Indeed,
in the first step, the flux computed by the previous kernels is loaded in this
table, used to do some computations and then stored in a register. Therefore,
the shared memory table is no longer used at the end of this part. In the second
step, we load the previous values of the electric field in it in a coalesced way. In
a third step, we update the value of the field in the shared memory, and in the
last step, we write the new value of the field in the global memory. The update
of the magnetic field follows the same pattern as the update of the electric field.

3.2 Multi-GPU Strategy


The multi-GPU parallelization strategy adopted in this study combines a coarse
grain SPMD model based on a partitioning of the underlying tetrahedral mesh,
with a fine grain SIMD model through the development of CUDA enabled DGTD
kernels. A non-overlapping partitioning of the mesh is obtained using a graph
partitioning tool such as MeTiS or Scotch and results in the definition of a set of
sub-meshes. The interface between neighboring sub-meshes is a triangular sur-
face. In the current implementation of this strategy, there is a one to one mapping
between a sub-mesh and a GPU. Then the CUDA kernels described previously
are applied at the sub-mesh level. The operations of the DGTD method are
purely local except for the computation of the numerical flux for the approxima-
tion of the boundary integral over ∂τi in (2) which requires, for a given element,
the values of the electromagnetic field components in the face-wise neighboring
elements. For those faces which are located on an interface between neighbor-
ing sub-meshes, the availability of the electromagnetic field components on the
attached elements is obtained thanks to point-to-point communications imple-
mented using non-blocking MPI send and receive operations in order to overlap
as much as possible communication operations by local computations of the
volume integrals in (2). Moreover, we also overlap most of the PCI-express com-
munications by using a CudaHostAlloc buffer which allows us to let the driver
manage this CPU-GPU communication.

4 Performance Results
We first note that GPU timings (for all the performance results presented here
and in the following subsections) are for single precision arithmetic computations
and include the data structures copy operations from the CPU memory to the
GPU device memory prior to the time stepping loop, and vice versa at the
end of the time stepping loop. Numerical experiments have been performed on
a hybrid CPU-GPU cluster with 1068 Intel CPU nodes and 48 Tesla S1070
GPU systems. Each Tesla S1070 has four GT200 GPUs and two PCI Express-
2 buses. The Tesla systems are connected to BULL Novascale R422 E1 nodes
with two quad-core Intel Xeon X5570 Nehalem processors operating at 2.93 GHz
themselves connected by an InfiniBand network.
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 361

4.1 Weak Scalability


We first present results for the assessment of the weak scalability properties of
the GPU enabled DGTD-Pp method. For that purpose, we consider a model
test problem which consists in the propagation of a standing wave in a perfectly
conducting unitary cubic cavity. For this simple geometry, we make use of reg-
ular uniform tetrahedral meshes respectively containing 3,072,000 elements for
the DGTD-P1 and DGTD-P2 methods, 1,296,000 elements for the DGTD-P3
method and 750,000 elements for the DGTD-P4 method for the experiments
involving one GPU. As usual in the context of a weak scalability analysis, the
size of each mesh is increased proportionally to the number of computational
entities. Moreover, since these meshes are regular discretizations of the cube, it
is possible to construct perfectly balanced partitions and this is achieved here
by constructing the tetrahedral meshes in parallel (i.e.on a sub domain basis)
given a box-wise decomposition of the domain. Table 1 summarizes the measured
timings measures for 1000 iterations of the leap-frog time scheme (6), and cor-
responding GFlops rates for 1 and 128 GPUs. These results illustrate an almost
perfect weak scalability of the GPU enabled DGTD-Pp method with p = 3 and
4 for up to 128 GPUs. It also appears from these results that, for the proposed
GPU implementation of the DGTD-Pp method and the hardware configuration
considered in the above numerical experiments, the third-order scheme yields
the best performance while, when increasing further the interpolation order, the
sustained performance decrease due to bandwidth-bound effects.

4.2 Strong Scalability


We now consider a more realistic physical problem which corresponds to the
simulation of the propagation of an electromagnetic wave in the head of mo-
bile phone user. For this problem, compatible geometrical models of the head
tissues have been constructed from magnetic resonance images. First, head tis-
sues are segmented and surface triangulations of a selected number of tissues
are obtained. In a second step, these triangulated surfaces together with a tri-
angulation of the artificial boundary (absorbing boundary) of the overall com-
putational domain are used as inputs for the generation of volume meshes. The
exterior of the head must also be meshed, up to a certain distance and the

Table 1. Weak scalability assessment: timings and sustained performance figures

# GPU DGTD-P1 DGTD-P2


1 104.7 sec/63 GFlops 325.1 sec/92 GFlops
128 104.9 sec/8072 GFlops 323.1 sec/11844 GFlops
# GPU DGTD-P3 DGTD-P4
1 410.3 sec/106 GFlops 759.8 sec/94 GFlops
128 408.4 sec/13676 GFlops 763.6 sec/12009 GFlops
362 T. Cabel, J. Charles, and S. Lanteri

Fig. 1. Geometrical model of head tissues and computed contour lines of the amplitude
of the electric field on the skin

Table 2. Characteristics of the fully unstructured tetrahedral meshes of head tissues

Mesh # elements Lmin (mm) Lmax (mm) Lavg (mm)


M1 815,405 1.00 28.14 10.69
M2 1,862,136 0.65 23.81 6.89
M3 7,894,172 0.77 22.75 3.21

Table 3. Head tissues exposure to an electromagnetic wave emitted by a mobile phone.


Strong scalability assessment: mesh M1. Elapsed time on 16 CPUs: 715 sec (DGTD-P1
method) and 3824 sec (DGTD-P2 method).

# GPU DGTD-P1 DGTD-P2


Time GFlops Speedup Time GFlops Speedup
1 620 sec 32 - 2683 sec 60 -
16 35 sec 566 17.8 145 sec 1110 18.5

computational domain is artificially bounded by a sphere surface correspond-


ing to the boundary Γa on which the Silver-Müller absorbing boundary condi-
tion is imposed. Moreover, a simplified mobile phone model (metallic box with
a quarter-wave length mounted on the top surface) is included and placed in
vertical position close to the right ear. The surface of this metallic box de-
fines the boundary Γm . Overall, the geometrical models considered here consist
of four tissues (skin, skull, CSF - Cerebro Spinal Fluid and brain). For the
numerical experiments, we consider a sequence of three unstructured tetrahe-
dral meshes whose characteristics are summarized in Table 2. The tetrahedral
meshes are globally non-uniform and the quantities Lmin , Lmax and Lavg in Ta-
ble 2 respectively denote the minimum, maximum and average lengths of mesh
edges. Performance results are presented in Tables 3 to 5. For the coarsest mesh
(i.e. mesh M1), the parallel speedup is evaluated for 16 GPUs relatively to the
Performance Evaluation of a Multi-GPU Enabled Finite Element Method 363

Table 4. Head tissues exposure to an electromagnetic wave emitted by a mobile phone.


Strong scalability assessment: mesh M2. Elapsed time on 64 CPUs: 519 sec (DGTD-P1
method) and 2869 sec (DGTD-P2 method).

# GPU DGTD-P1 DGTD-P2


Time GFlops Speedup Time GFlops Speedup
16 82 sec 699 - 407 sec 1137 -
32 46 sec 1239 1.8 201 sec 2299 2.0
64 33 sec 1747 2.5 116 sec 4007 3.5

Table 5. Head tissues exposure to an electromagnetic wave emitted by a mobile phone.


Strong scalability assessment: mesh M3. Elapsed time on 64 CPUs: 2786 sec (DGTD-P1
method) and 6057 sec (DGTD-P2 method).

# GPU DGTD-P1 DGTD-P2


Time GFlops Speedup Time GFlops Speedup
32 162 sec 146 - 816 sec 2370 -
64 97 sec 2470 1.7 416 sec 4657 2.0
128 69 sec 3469 2.4 257 sec 7522 3.2

simulation time using one GPU. Although the number of elements of this mesh is
well below the size of the mesh considered for the weak scalability analysis
(i.e. 3,072,000 elements for the DGTD-P1 and DGTD-P2 methods), superlinear
speedups are obtained. However, not surprisingly, the single GPU GFlops rates
are lower than the corresponding ones reported in Table 1 (32 instead of 63 for
the DGTD-P1 method, and 60 instead of 92 for the DGTD-P2 method). For the
two other meshes (i.e. M2 and M3), as expected the DGTD-P2 method is always
more scalable than the DGTD-P1 method because of a more favorable computa-
tion to communication ratio. Overall, acceleration factors ranging from 15 to 25
are observed between the multiple CPU and multiple GPU simulations. We note
however that this comparison is made with a CPU version whose parallel imple-
mentation relies on MPI only. In particular, we have not considered a possible op-
timization to hybrid shared-memory multi-core systems combining the OpenMP
and MPI programming models. Besides, an optimized CPU version in terms of
simulation times can be obtained by computing the surface integrals over ∂τi in
(2) through a loop over element faces and updating the flux balance of both ele-
ments τi and τj since the numerical flux between τj and τi is just the opposite of
that from τi and τj . Such an optimization would lower the simulation times of the
CPU version by approximately 30%. In the present implementation, each elemen-
tary numerical flux is computed twice (respectively for flux balances of τi and τj )
for maximizing the floating point performance in the CUDA SIMD framework.
364 T. Cabel, J. Charles, and S. Lanteri

5 Conclusion

We have presented a high performance numerical methodology to simulate elec-


tromagnetic wave propagation in complex domains and heterogeneous media.
This methodology is based on a high order discontinuous Galerkin time domain
method formulated on unstructured tetrahedral meshes for solving the system
of Maxwell equations. Due to its intrinsically local nature, this DGTD method
is particularly well suited to distributed memory parallel computing. Besides,
from the algorithmic point of view, the method mixes sparse linear algebra op-
erations (as usual with classical finite element or finite volume methods) with
dense linear algebra operations due to the use of a high order nodal interpola-
tion method at the element level. Therefore, the method is an ideal candidate for
exploiting the processing capabilities of GPU systems. In this work, this DGTD
method has been adapted to multi-GPU parallel computing by combining a
coarse grain SPMD programming model for inter-GPU parallelization and a fine
grain SIMD programming model for intra-GPU parallelization. Numerical exper-
iments presented in this paper clearly demonstrate the viability of the proposed
parallelization strategy and open the route for further investigation especially
in view of improving the GPU utilization as well as the overall scalability on
systems consisting of several hundreds of GPU nodes.

Acknowledgments. This work was granted access to the HPC resources of


CCRT under the allocation 2010-t2010065004 made by GENCI (Grand Equipe-
ment National de Calcul Intensif).

References
1. Fezoui, L., Lanteri, S., Lohrengel, S., Piperno, S.: Convergence and stability of
a discontinuous Galerkin time-domain method for the 3D heterogeneous Maxwell
equations on unstructured meshes. ESAIM: Math. Model. Num. Anal. 39(6),
1149–1176 (2005)
2. Gödel, N., Nunn, N., Warburton, T., Clemens, M.: Scalability of higher-order discon-
tinuous Galerkin FEM computations for solving electromagnetic wave propagation
problems on GPU clusters. IEEE. Trans. Magn. 46(8), 3469–3472 (2010)
3. Klöckner, A., Warburton, T., Bridge, J., Hesthaven, J.: Nodal discontinuous
Galerkin methods on graphic processors. J. Comput. Phys. 228, 7863–7882 (2009)
4. Komatitsch, D., Erlebacher, G., Göddeke, D., Michéa, D.: High-order finite-element
seismic wave propagation modeling with MPI on a large GPU cluster. J. Comput.
Phys. 229(20), 7692–7714 (2010)
5. Komatitsch, D., Göddeke, D., Erlebacher, G., Michéa, D.: Modeling the propagation
of elastic waves using spectral elements on a cluster of 192 GPUs. Comput. Sci. Res.
Dev. 25, 75–82 (2010)
6. Cabel, T., Charles, J., Lanteri, S.: Multi-GPU acceleration of a DGTD method for
modeling human exposure to electromagnetic waves. Tech. rep., INRIA Research
eport RR-7592 (2011), http://hal.inria.fr/inria-00583617
Study of Hierarchical N-Body Methods
for Network-on-Chip Architectures

Thomas Canhao Xu, Pasi Liljeberg, and Hannu Tenhunen

Turku Center for Computer Science, Joukahaisenkatu 3-5 B, 20520, Turku, Finland
Department of Information Technology, University of Turku, 20014, Turku, Finland
{canxu,pasi.liljeberg,hannu.tenhunen}@utu.fi

Abstract. In this paper, we study two hierarchical N-Body methods


for Network-on-Chip (NoC) architectures. The modern Chip Multipro-
cessor (CMP) designs are mainly based on the shared-bus communication
architecture. As the number of cores increases, it suffers from high com-
munication delays. Therefore, NoC based architecture is proposed. The
N-Body problem is a classical problem of approximating the motion of
bodies. Two methods, namely Barnes-Hut (Barnes) and Fast Multipole
(FMM), have been developed for fast simulation. The two algorithms
have been implemented and studied in conventional computer systems
and Graphics Processing Units (GPUs). However, as a promising uncon-
ventional multicore architecture, the evaluation of N-Body methods in
a NoC platform has not been well addressed. We define a NoC model
based on state-of-the-art systems. Evaluation results are presented using
a cycle accurate full system simulator. Experiments show that, Barnes
scales better (53.7x/Barnes and 36.6x/FMM for 64 processing elements)
and requires less cache than FMM. However, we observe hot-spot traffic
in Barnes. Our analysis and experiment results provide a guideline for
studying N-Body methods in a NoC platform.

1 Introduction

It is predictable that in the near future, hundreds or even more cores on a chip
will appear on the market. The number of circuits integrated on a chip have
been increasing continuously which leads to an exponential rise in the complex-
ity of their interaction. Traditional digital system design methods, e.g. bus-based
architectures will suffer from high communication delay and low scalability. To
address these problems, NoC communication backbone was proposed for future
multicore systems [1]. Network communication methodologies are brought into
on-chip communication. More transactions can occur simultaneously and thus
the delay of the packets is reduced and the throughput of the system is in-
creased. Moreover, as the links in NoC are based on point-to-point mechanism,
the communication among cores can be pipelined to further improve the system

This work is supported by Academy of Finland and Nokia Foundation. The authors
would like to thank the anonymous reviewers for their feedback and suggestions.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 365–374, 2012.
Springer-Verlag Berlin Heidelberg 2012
366 T.C. Xu, P. Liljeberg, and H. Tenhunen

1 1 1 1

1 1 1 1 1

: 5 (

1 1 1 1 1,
6
3(

1 1 1 1

Fig. 1. An example of 4×4 NoC using mesh topology

performance. Figure 1 shows a NoC with 4×4 mesh (16 nodes). The underly-
ing network is comprised of network links and routers (R), each of which is
connected to a processing element (PE) via a network interface (NI). The ba-
sic architectural unit of a NoC is the tile/node (N) which consists of a router,
its attached NI and PE, and the corresponding links. Communication among
PEs is achieved via network packets. Intel 1 has demonstrated an 80 tile, 100M
transistor, 275mm2 2D NoC under 65nm technology [2]. An experimental mi-
croprocessor containing 48 x86 cores on a chip has been created, using 4×6 2D
mesh topology with 2 cores per tile [2]. The TILE-Gx processor from Tilera,
containing 16 to 100 general-purpose processors in a single chip, is available for
commercial use [3].
The N-Body problem is a classical problem of approximating the motion of
bodies that interact with each other continuously. The bodies are usually galax-
ies and stars in an astrophysical system. The gravitational force of bodies is
calculated according to Newton’s Principia [4]. The N-Body problem is used in
other computations and simulations as well, e.g. the interference of wireless cells
and protein folding [5]. Several algorithms have been developed for N-Body sim-
ulation. In principle, to be precise, the simulation requires the calculation of all
pairs, since the gravitational force is a long range force. However the computa-
tion complexity of this method is O(n2 ) [6]. J. Barnes et al. and L. Greengard
introduced two fast hierarchical methods [7,8]. A tree is build firstly according to
the position of the bodies in the physical space. The interactions are calculated
by traversing this tree. The computation complexities in these algorithms are
reduced to O(nlogn ), or even O(n) in some cases.
The performance of these two algorithms has been studied in traditional cache-
coherent shared address space multiprocessors, e.g. Standford DASH, KSR-1
and SGI-Challenge [9]. A simulator is used for examining the implications of
the two algorithms in a multiprocessor architecture [10]. However, the previous
works are based on conventional architectures, e.g. bus-based multiprocessors,
1
Intel is a trademark or registered trademark of Intel or its subsidiaries. Other names
and brands may be claimed as the property of others.
Study of Hierarchical N-Body Methods for NoC Architectures 367

physically distributed main memory or cache-only memory architecture. NVIDIA


has presented a CUDA-based N-Body simulation by calculating the gravitational
attractions of all body-pairs [11]. Hierarchical methods for GPGPU-based sys-
tems have been implemented and compared in [12] and [13]. As a promising
unconventional multicore architecture in the future, the implementation of these
algorithms in a NoC platform has not been well studied. To design efficient
NoCs, designers need to understand the characteristics of the applications, e.g.
the amount of communication among cores, caches and memory controllers, as
well as the scaling of the application with the designated architecture. In our
paper, we study and discuss two hierarchical N-Body algorithms for the NoC
architecture. To validate our study, we model and analyze a 64-core NoC with
8×8 mesh, present the performance and network traces of the two algorithms
using a full system simulator.

2 Modeling of the Network-on-Chip

The Tilera TILE processor family includes TILE64, TILEPro and TILE-Gx
members. The basic architecture of these processor are the same: an array of 16 to
100 general purpose RISC processor cores (tiles) in a on-chip mesh interconnect.
Each tile consists a core with related L1 and L2 caches. The memory controllers
are integrated on the chip as well.
Figure 2 shows the architecture diagram of TILE-Gx processor [3]. Each tile
consists of a 64-bit VLIW core with private L1 cache (32KB instruction and
32KB data) and shared L2 cache (256KB per tile). Four 64-bit DDR3 memory
controllers, duplexed to multiple ports, connect the tiles to the main memory.

Fig. 2. The Tilera TILE multicore processor with 100 cores


368 T.C. Xu, P. Liljeberg, and H. Tenhunen

MC0 MC1

Tile  R 

Core/
NI
L1$

L2$

MC3 MC2

Fig. 3. An 8×8 mesh-based NoC with memory controllers attached to up and down
sides

The L2 caches and the memory are shared by all processors. The processor
operates at 1.0 to 1.5GHz, with typical power consumption of 10 to 55W. The
I/O controllers are integrated on chip to save costs of north and south bridges.
The mesh network provides bandwidth up to 200Tbps.
To analyze the low-level behavior of an application, we model a NoC similar
to the Tilera TILE architecture. The processing core of the NoC is a Sun SPARC
RISC core [14], the area is 14mm2 with 65nm fabrication technology. Scaled to
32nm technology, each core has an area of 3.4mm2 . We simulate the character-
istics of a 16MB, 64 banks, 64-bit line size, 4-way associative, 32nm cache by
CACTI [15]. Results show that the total area of cache banks is 64.61mm2 . Each
cache bank, including data and tag, occupies 1mm2 . Routers are quite small
compared with processors and caches, e.g. we calculate a 5-port router to be
only 0.054mm2 under 32nm. The number of transistors required for a memory
controller is quite small compared with a chip (usually billions). It is presented
that a DDR3 memory controller is about 2,000 LCs with Xilinx Virtex-5 Field-
Programmable Gate Array (FPGA) [16]. The total area of the chip is estimated
to be around 300mm2 , comparable to the TILE-Gx. Figure 3 illustrates the
architecture of the aforementioned NoC.

3 The Hierarchical N-Body Methods

In this section, we describe the two most important hierarchical N-Body al-
gorithms that we used for analysis: the Barnes-Hut method [7] and the Fast
Multiple Method (FMM) [8]. The two hierarchical methods build a structured
tree firstly. The tree is built by subdividing space cells until a certain condition,
e.g. reaching the maximum number of particles in a leaf cell. The physical space
is represented by a hierarchical tree. The computation of interactions is done by
Study of Hierarchical N-Body Methods for NoC Architectures 369

traversing this tree. The two algorithms differ in the steps they use to calculate
the interactions of particles.
In Barnes-Hut method, for each particle, the tree is traversed to compute
the forces. It starts at the root of the tree, and traverses every cell. To reduce
the computation complexity of long-range interactions, the subtree is approxi-
mated by the mass of the center cell, if the cell is far away from the particle.
The accuracy of this methods is thus dependent on the approximation metrics.
The Barnes-Hut method only computes the interactions for particle-particle and
particle-cell.
The FMM computes the interactions for cell-cell as well, compared with
Barnes-Hut. If two cells are far away from each other, the interaction between
them is computed by the multipole expansion of the cells. The computation
complexity is thus reduced. For uniform distributions, the complexity of FMM
is O(n), compared with O(nlogn ) in Barnes-Hut. To develop a multithreaded
program for both algorithms, the space is divided into several regions where
each core is assigned with a region. A tree for the regions is built for the respon-
sible core, and each core calculates its local tree. Most of the calculation time
is spent in traversals of the tree to compute the forces. In a NoC platform, the
performance of the algorithms will be affected by (a) long distance communica-
tion of nodes; (b) the initial distribution of particles; (c) the dynamic changing
of position of particles; (d) hot-spot traffic.

4 Experimental Evaluation
4.1 Experiment Setup
The simulation platform is based on a cycle-accurate NoC simulator which is
able to produce detailed evaluation results. The platform models the routers
and links accurately. State-of-the-art router in our platform includes a routing
computation unit, a virtual channel allocator, a switch allocator, a crossbar
switch and four input buffers. Deterministic XY routing algorithm has been
selected to avoid deadlocks.
We use a 64-core network which models a single-chip NoC for our experiments.
A full system simulation environment with 64 nodes, each with a core and related
cache, has been implemented. The simulations are run on the Solaris 9 operating
system based on the UltraSPARCIII+ instruction set in-order issue structure.
Each processor core is running at 2GHz, attached to a wormhole router and has
a private write-back L1 cache (split I+D, each 32KB, 4-way, 64-bit line, 3-cycle).
The 16MB L2 cache shared by all processors is split into banks (64 banks, each
256KB, 64-bit line, 6-cycle). The simulated memory/cache architecture mimics
SNUCA [17]. A two-level distributed directory cache coherence protocol called
MOESI based on MESI [18] has been implemented in our memory hierarchy in
which each L2 bank has its own directory. The protocol has five types of cache
line status: Modified (M), Owned (O), Exclusive (E), Shared (S) and Invalid
(I). We use Simics [19] full system simulator as our simulation platform. For
both methods, we use the Plummer model [20] for particle generation, instead
370 T.C. Xu, P. Liljeberg, and H. Tenhunen

of uniform distribution. The multithreaded part of the programs utilizes the


C/pthread model.

4.2 Result Analysis

We start by evaluating the computation time distribution and scalability of the


two algorithms. Both algorithm applies same parameters. The results are listed
in Table 1 and 2. The first five rows show the computation time from 4K to 64K
particles, with 64 processors. In Barnes-Hut, around 90% of the total time are
spent on force calculation (84.2% in 4K to 91.1% in 64K), while the time spent
on other tasks (e.g. tree building) are relatively small. The Barnes-Hut method
scales very well from 1 to 64 processors. The speedups for 64 processors are 53.7x
and 61.8x for total execution time and force calculation time, respectively.
In Figure 4, the network request rates of 64 cores are illustrated. We simulate
64K particles in 5 time steps. The horizontal axis is time, segmented in 12.1M-
cycle percentage fragments. The traffic trace has 165.2M packets. It is observed
that, several nodes, especially N0 and N34, generate more data traffic than
others (e.g. N0 14.18%, N34 12.19%, N12 5.3% and N20 2.76%). This introduces
heavy hot-spot traffic in certain regions of the NoC. Notice that, the traffic
patterns of other nodes are quite similar, they have a high traffic in the starting
phase, and drop to a lower traffic after that. There are several time slices, for
example 16% to 21%, when all processors are sending packets simultaneously.
The reason is, the simulation has executed for 5 time steps, the positions of
particles will change at the end of each time step. In terms of point-to-point
traffic, several source-destination pairs, specifically originated from N0 and N34,
generated a considerable amount of the traffic. We observe the top 5 pairs are:
34-62 (0.88%), 0-14 (0.63%), 0-58 (0.62%), 0-8 (0.60%) and 34-10 (0.51%). These
hot-spot traffic can be alleviated with, e.g. long links between nodes, or increase
the link bandwidth for hot-spot nodes.

Table 1. Time distribution and scalability of the Barnes-Hut Method

Configuration Total time Treebuild Forcecalc Others


64p/4K 19 1 16 2
64p/8K 41 2 35 4
64p/16K 87 5 76 6
64p/32K 184 8 168 8
64p/64K 385 15 351 19
4K/1p 1020 28 988 4
4K/2p 511 15 495 1
4K/4p 258 8 246 4
4K/8p 129 4 124 1
4K/16p 65 3 61 1
4K/32p 34 2 31 1
4K/64p 19 1 16 2
Study of Hierarchical N-Body Methods for NoC Architectures 371

160000
140000
120000
Packets100000
80000
60000
40000
20000
00
10
20
30
40 60
50 50
Time 60 40
70 30
80 20
90 10 Node ID
0

Fig. 4. Network request rate for 64-core NoC running Barnes

Table 2. Time distribution and scalability of the Fast Multipole Method

Configuration Total time Treebuild Forcecalc Barrier Listbuild Others


64p/4K 17 1 10 3 2 1
64p/8K 27 2 16 6 0 3
64p/16K 54 7 30 14 2 1
64p/32K 102 11 73 13 1 4
64p/64K 209 21 147 30 4 7
4K/1p 622 75 533 0 10 4
4K/2p 316 38 270 1 3 4
4K/4p 162 20 136 1 3 2
4K/8p 83 9 71 0 1 2
4K/16p 44 4 35 2 1 2
4K/32p 26 3 16 4 0 3
4K/64p 17 1 10 3 2 1

The time spent on force calculation in the Fast Multipole method is lower
than Barnes-Hut (Table 2), e.g. 58.8% in 4K to 70.3% in 64K. Nearly 10% of the
time are spent on tree building, and about 15% on barrier. The Fast Multipole
method scales worse than Barnes. The speedups for 64 processors are 36.6x and
53.3x for total execution time and force calculation time, respectively. This is
primarily due to the higher number of barriers in Fast Multipole method. It is
noteworthy that, in spite of poor scaling, the Fast Multipole method spends less
time for calculation. For example, it spends 54.3% of the total execution time
in 64p/64K, compared with Barnes. In consideration of better scalability, the
Barnes-Hut method could use shorter time in a systems with thousands of cores.
Figure 5 shows the network request rate of each processing core when running
FMM in a 64-core NoC. The horizontal axis is time, segmented in 1.69M-cycle
percentage fragments. The traffic trace has 57.4M packets. It is revealed that,
372 T.C. Xu, P. Liljeberg, and H. Tenhunen

70000
60000
50000
Packets 40000
30000
20000
10000
00
10
20
30
40 60
50 50
Time 60 40
70 30
80 20
90 10 Node ID
0

Fig. 5. Network request rate for 64-core NoC running FMM

several nodes (e.g. N0 7.6%, N46 4.15%, N13 2.72% and N7 2.71%) generate more
data traffic than others. The network traffic is relatively low in the starting phase
(before 30% of the time slice). After that time point, FMM shows similar traffic
patterns as in Barnes. However, the hot-spot traffic in FMM is not as significant
as Barnes. We note that, in terms of point-to-point traffic, a small portion of
source-destination pairs generated a sizable portion of the traffic. For example,
only 4 (19-60, 13-44, 60-19 and 0-29) of the pairs (in totally 642 = 4, 096)
generated 1.42% traffic.
We evaluate other performance metrics of the two algorithms in terms of L2
cache miss rate (L2MR), misses per thousand instructions (MissPKI), Average
Link Utilization (ALU) and Average Network Latency (ANL). ALU is calculated

Barnes−Hut
Fast−Multipole
1.1

0.9
Normalized value

0.8

0.7

0.6

0.5

0.4
L2MR MissPKI ALU ANL

Fig. 6. Performance for Barnes and FMM


Study of Hierarchical N-Body Methods for NoC Architectures 373

with the number of packets transferred between NoC resources per cycle. ANL
represents the average number of cycles required for the transmission of all mes-
sages. The number of required cycles for each message is calculated from the injec-
tion of the message header into the network at the source node, to the reception of
the tail flit at the destination node. Under the same configuration and workload,
lower values of these metrics are favorable. The results are shown in Figure 6. We
note that, in terms of L2MR and MissPKI, Barnes is lower than FMM (1.21% for
L2MR and 15.77% for MissPKI respectively). This reflects, FMM requires more
cache than Barnes. A system with limited cache could be unsuitable for FMM.
The ALU of Barnes is only 43.83% of FMM, which means an alleviated network
load. It is noteworthy that despite the fact that the value of Z axis in Figure 4 is
twice as larger than Figure 5, each time slice in Figure 4 represents 12.1M cycles,
compared with 1.69M cycles in Figure 5. Finally, the ANL of Barnes is 96.31%
that of FMM, indicating that the network performance of Barnes is better, and
hence having lower communication overhead.

5 Conclusion
The implementation of two hierarchical N-Body methods (Barnes-Hut and Fast
Multipole) in a NoC platform was studied in this paper. Both scalability and
network traffic for the two methods were analyzed. We studied an 8×8 NoC
model based on state-of-the-art systems. The time distribution of the two meth-
ods, with 1 to 64 processing cores, were explored. We investigated the advantages
and disadvantages of the two algorithms. The network requests rates of 64 pro-
cessing cores were illustrated for both methods. Our experiments have shown
that, the Barnes-Hut method generates more hot-spot traffic than Fast Multi-
pole. However, it scales better, and has lower overall pressure to the on-chip
network and caches, compared with Fast Multipole. The results of this paper
gave guidance for analyzing hierarchical N-Body methods in a NoC platform.

References
1. Dally, W.J., Towles, B.: Route packets, not wires: on-chip inteconnection networks.
In: Proceedings of the 38th Conference on Design Automation, pp. 684–689 (June
2001)
2. Intel: Intel research areas on microarchitecture (May 2011),
http://techresearch.intel.com/projecthome.aspx?ResearchAreaId=11
3. Tilera: Tile-gx processor family (May 2011),
http://www.tilera.com/products/processors/TILE-Gx_Family
4. Aarseth, S.J., Henon, M., Wielen, R.: A comparison of numerical methods for the
study of star cluster dynamics. Astronomy and Astrophysics 37, 183–187 (1974)
5. Perrone, L., Nicol, D.: Using n-body algorithms for interference computation in
wireless cellular simulations. In: Proc. of 8th Int. Symp. on Modeling, Analysis
and Simulation of Computer and Telecommunication Systems, pp. 49–56 (2000)
6. Salmon, J.: Parallel n log n n-body algorithms and applications to astrophysics.
In: Compcon Spring 1991, Digest of Papers, February-1 March, pp. 73–78 (1991)
374 T.C. Xu, P. Liljeberg, and H. Tenhunen

7. Barnes, J., Hut, P.: A hierarchical o(n log n) force-calculation algorithm. Nature
(1988)
8. Greengard, L.F.: The rapid evaluation of potential fields in particle systems. PhD
thesis, New Haven, CT, USA (1987) AAI8727216
9. Holt, C., Singh, J.P.: Hierarchical n-body methods on shared address space multi-
processors. In: Proc. of 7th SIAM Conf. on PPSC (1995)
10. Singh, J.P., Hennessy, J.L., Gupta, A.: Implications of hierarchical n-body methods
for multiprocessor architectures. ACM Tran. Comp. Sys. 13, 141–202 (1995)
11. Nyland, L., Harris, M., Prins, J.: Fast N-Body Simulation with CUDA. In: Nguyen,
H. (ed.) GPU Gems 3. Addison Wesley Professional (August 2007)
12. Jetley, P., Wesolowski, L., Gioachin, F., Kalé, L., Quinn, T.: Scaling hierarchical
n-body simulations on gpu clusters. In: SC 2010, pp. 1–11 (November 2010)
13. Hamada, T., Nitadori, K.: 190 tflops astrophysical n-body simulation on a cluster
of gpus. In: SC 2010, pp. 1–9 (November 2010)
14. Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-
scout-thread cmt sparc processor. In: ISSCC 2008, pp. 82–83 (February 2008)
15. Thoziyoor, S., Muralimanohar, N., Ahn, J.H., Jouppi, N.P.: Cacti 5.1. Technical
Report HPL-2008-20, HP Labs
16. Global, H.: Ddr 3 sdram memory controller ip core (May 2011),
http://www.hitechglobal.com/IPCores/DDR3Controller.htm
17. Kim, C., Burger, D., Keckler, S.W.: An adaptive, non-uniform cache structure for
wire-delay dominated on-chip caches. In: ACM SIGPLAN, pp. 211–222 (October
2002)
18. Patel, A., Ghose, K.: Energy-efficient mesi cache coherence with pro-active snoop
filtering for multicore microprocessors. In: Proceeding of the Thirteenth Interna-
tional Symposium on Low Power Electronics and Design, pp. 247–252 (August
2008)
19. Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg,
J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform.
Computer 35(2), 50–58 (2002)
20. Dejonghe, H.: A completely analytical family of anisotropic Plummer models. Royal
Astronomical Society, Monthly Notices 224, 13–39 (1987)
Extending a Highly Parallel Data Mining
Algorithm to the Intel
R
Many Integrated Core
Architecture

Alexander Heinecke1 , Michael Klemm3 , Dirk Pflüger1 , Arndt Bode2 ,


and Hans-Joachim Bungartz2
1
Technische Universität München, Boltzmannstr. 3, D-85748 Garching, Germany
2
Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften,
Boltzmannstr. 1, D-85748 Garching, Germany
3
Intel GmbH, Dornacher Str. 1, D-85622 Feldkirchen, Germany

Abstract. Extracting knowledge from vast datasets is a major chal-


lenge in data-driven applications, such as classification and regression,
which are mostly compute bound. In this paper, we extend our SG++
algorithm to the Intel
R
Many Integrated Core Architecture (Intel R
MIC
Architecture). The ease of porting an application to Intel MIC Architec-
ture is shown: porting existing SSE code is very easy and straightforward.
We evaluate the current prototype pre-release coprocessor board code-
named Intel R
“Knights Ferry”. We utilize the pragma-based offloading
programming model offered by the Intel R
Composer XE for Intel MIC
Architecture, generating both the host and the coprocessor code. We
compare the achieved performance with an NVIDIA C2050 accelerator
and show that the pre-release Knights Ferry coprocessor delivers better
performance than the C2050 and exceeds the C2050 when comparing the
productivity aspect of implementing algorithms for the coprocessors.

Keywords: Intel R
Many Integrated Core Architecture, Intel
R
MIC Ar-

R
chitecture, Intel Knights Ferry, NVIDIA Fermi*, GPGPU, accelerators,
coprocessors, data mining, sparse grids.

1 Introduction

Experts expect that future exascale supercomputers will likely be based on het-
erogeneous architectures that consist of a moderate amount of “fat” cores and
use a large number of accelerators or coprocessors to deliver a high ratio of
GFLOPS/Watt [21]. Today, Graphic Processing Units (GPU) are very popu-
lar for accelerating highly parallel kernels like dense linear algebra or Monte
Carlo simulations [20,8]. However, the performance increase is not for free and
requires the ability to rewrite compute kernels in GPU-specific languages such
as CUDA [13] or OpenCL [10]. This implies serious porting and tuning effort
for legacy compute-intensive applications (CPU-optimized codes), which are ex-
ecuted in thousands of compute centers every day.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 375–384, 2012.

c Springer-Verlag Berlin Heidelberg 2012
376 A. Heinecke et al.

Core Core Core Core


...

Memory & I/O interface


L1 L1 L1 L1
Ring network

L2 L2 ... L2 L2

L2 L2 ... L2 L2

Ring network
L1 L1 L1 L1
...
Core Core Core Core

Fig. 1. High-level view on the Intel MIC Architecture (left) and NVIDIA Fermi (right)
taken from [12]

The Intel R
Many Integrated Core Architecture (Intel R
MIC Architecture) is
a massively parallel coprocessor based on Intel Architecture (IA). The existing
tool chain for software development on IA can be used to implement applications
for the Intel MIC Architecture. All traditional HPC programming models such as
OpenMP* and MPI* on top of C/C++ and Fortran will be available. Developers
do not need to accept the high learning curve and implementation effort to
(partially) rewrite their source code to retrofit it for a GPU-based accelerator.
In this paper, we compare a pre-release Intel coprocessor (“Knights Ferry”) of
the Intel MIC Architecture with a recent NVIDIA Tesla* C2050 GPU (Sect. 2).
We focus on the performance of an existing highly parallel workload and assess
the programming productivity during implementation. We use the SG++ data-
mining algorithm (Sect. 3) as the workload for the evaluation. As with most
HPC applications, SG++ is already available as highly optimized code for pro-
cessors compatible to Intel R
Xeon R
. Hence, we use this as our starting point
for the evaluation. The paper carries on with comparing the implementations
and performance in Sect. 4. For the comparison, we restrict ourselves to genuine
compilers and toolkits to ensure that the optimal software stack for the compute
platforms is evaluated.

2 Intel MIC Architecture in Comparison to NVIDIA


Fermi Architecture

In this section, we will investigate the differences and similarities between the
Intel MIC Architecture [7] and the NVIDIA Tesla 2050 accelerator [12]. The
Intel MIC Architecture has been announced at the International Supercomputing
Conference [18] as a massively parallel coprocessor based on IA. It is currently
available as pre-release hardware code-named Knights Ferry (based on Intel’s
previous Larrabee design [9]).
Fig. 1 gives an overview of the respective architectures. Knights Ferry offers 32
general-purpose cores with a fixed frequency of 1200 MHz. The cores are based
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
377

on a refreshed Intel R
Pentium R
(P54C) processor design [3] and have been
extended with 64-bit instructions and a 512-bit wide Vector Processing Unit
(VPU). Each of the cores offers four-way round robin scheduling of hardware
threads, i. e., in each clock cycle a core switches to the next instruction stream.
The cores of the Knights Ferry coprocessor own a local L1 and L2 cache with
32 KB and 256 KB, respectively. With a total of 32 cores, this coprocessor offers
a total of 8 MB shared L2 cache. The cores are connected through a high-speed
ring-bus that interconnects the L2 caches for fast on-chip communication. An
L3 cache does not exist in this design because of the high-bandwidth GDDR5
memory (1800 MHz). In total, the memory subsystem delivers a peak memory
bandwidth of 115 GB/sec.
Since the Intel MIC Architecture is based on IA, it can support the pro-
gramming models that are available for traditional IA-based processors. The
compilers for the Intel MIC Architecture support Fortran (including Co-Array
Fortran) and C/C++. OpenMP [15] and Intel R
Threading Building Blocks [17]
may be used for parallelization as well as emerging parallel languages such as
Intel
R
CilkTM Plus [6] or IntelR
Array Building Blocks [5]. The VPU can be ac-
cessed through the auto-vectorization capabilities of the Intel compiler as well as
low-level programming through intrinsic functions. The Intel MIC Architecture
greatly simplifies programming, as well-known traditional programming models
can be utilized to implement codes for it.
In contrast to the Intel MIC Architecture, the Tesla 2050 architecture [12]
does not contain general purpose compute cores. Instead it consists of 14 multi-
processors with 32 processing elements each. The processing elements run at a
clock speed of 1.15 GHz and a memory-bandwidth of 144 GB/sec. A 768 KB L2
cache is shared across the 14 multiprocessors.
Because of its special architecture the 2050, it only supports a limited set of
programming models. The most important ones are CUDA and OpenCL, which
are data-parallel programming languages that do not support arbitrary (task-
)parallel programming patterns. Some production compilers, such as the PGI
compiler suite [19] or HMPP [2] support offloading of Fortran code to the GPU,
but restrict the language features in order to fit the GPU programming model.

3 Data Mining Using Sparse Grids

Regression and classification, considered as scattered data approximation prob-


lems, both start from m known observations, S = {(xi , yi ) ∈ Rd × R}i=1,...,m ,
with the aim to recover (learn) the functional dependency f (xi ) ≈ yi as accu-
rately as possible. Reconstructing a smooth function f then allows an estimate
f (x) for new properties x.
N
We aim for representations f = j=1 αj ϕj (x) as a linear combination of N
basis functions ϕj (x) with coefficients αj . To obtain an algorithm that scales
only linearly in m, we associate the basis functions to grid points on some grid,
rather than fitting their centers to the data. Here, we are considering piecewise
d-linear functions. Unfortunately, regular grid structures suffer from the curse of
378 A. Heinecke et al.

ϕ0,0 ϕ1,1 ϕ0,1 ϕ1,1


l =1 l =1
. .
x0,0 x1,1 x0,1 x1,1

ϕ2,1 ϕ2,3 ϕ2,1 ϕ2,3


l =2 l =2
. .
x2,1 x2,3 x2,1 x2,3

ϕ3,1 ϕ3,3 ϕ3,5 ϕ3,7 ϕ3,1 ϕ3,7


ϕ3,3 ϕ3,5
l =3 l =3
. .
x3,1 x3,3 x3,5 x3,7 x3,1 x3,3 x3,5 x3,7

Fig. 2. Classical basis functions for the first three levels of discretization in 1d (left)
and modified ones with different types of basis functions (right). For both, a selection
of 2d basis functions and a 2d sparse grid of level 3 is shown.

dimensionality: regular grid with equidistant meshes and k grid points in each
dimension contain k d grid points in d dimensions. The exponential growth typi-
cally prevents considering more than 4 dimensions for reasonable discretizations.
We rely on adaptive sparse grids (see [1,16] for details) to mitigate the curse
of dimensionality. They are based on a hierarchical grid structure with basis
functions defined on several levels of discretization in 1d, a hierarchical basis,
and d-dimensional basis functions as products of one-dimensional ones. We em-
ploy two kinds of basis functions: uniform and modified non-uniform. Uniform
basis functions lead to grids with a large number of grid points on the domain’s
boundary, whereas modified non-uniform ones extrapolate towards the domain’s
boundary, which lead to a smaller grid structure; see Fig. 2.
The hierarchical tensor product approach allows to represent a function on
several scales. Trying to find out which scales contribute most to the overall
solution, it can been seen that plenty of grid points can be omitted in the hier-
archical representation as they have only little contribution—at least for suffi-
ciently smooth functions. The costs are reduced from O(k d ) to O(k log(k)d−1 ),
maintaining a similar accuracy as for full grids.
The function f should be as close to the data S as possible, minimizing the
mean squared error. At the same time, close data points should very likely have
similar function values to generalize from the data. We minimize the trade-off
between both regularization parameter λ (Eq. 1, left) and the hierarchical basis
allowing for a simple generalization functional. This leads to a system of linear
equations (Eq. 1, right), with matrix B, Bi,j = ϕi (xj ), and identity matrix I.


1  
m N
2 1 1
arg min (f (x) − yi ) + λ α2j ⇒ T
BB + λI α = By. (1)
f m i=1 j=1
m m

Because of the storage required for the large matrices, the linear system is solved
iteratively, with repeated recomputation of B and B T . Both correspond to func-
tion evaluations, as (B T α)i = f (xi ). Unfortunately, from a parallelization point
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
379

Level Index α
Sum
Dataset y

Fig. 3. Data containers to manage adaptive sparse grids and datasets for streaming
access

of view, efficient algorithms for function evaluations on sparse grids are inherently
multi-recursive in both level and dimensionality. This imposes severe restrictions
on parallelization and vectorization, especially on accelerators.
A straightforward alternative approach evaluates all basis functions (even
those resulting to zero) for all data points and sums up the results as shown
in Fig. 3. This is less computationally efficient, but streaming access of the data
and the avoidance of recursive structures and branching easily pays back the
additional computation: it is arbitrarily parallelizable and can be vectorized.
In the following, we use two test scenarios, both with a moderate dimension-
ality of d = 5 and distinct challenges. The first dataset with 218 data points
classifies a regular 3 × · · · × 3 checkerboard pattern. The second one is a real-
world dataset from astrophysics, predicting spectroscopic redshifts of galaxies
based on more than 430,000 photometric measurements. For both, excellent nu-
merical results are obtained using our method, see [11] for details.

4 Implementation and Performance Measurements


We start from an optimized implementation of SG++ for the Intel Xeon Pro-
cessor as described in [11]. Two steps are needed to run this code on the Intel
MIC Architecture: (1) offload the data from Fig. 3 to the coprocessor, and (2)
transcribe the compute kernels from SSE to code for the MIC VPU. In contrast
to NVIDIA CUDA and OpenCL, the Intel MIC Architecture uses a simplified
offloading model without buffers, device contexts, and kernels that are executed
via command queues.

Intel MIC Architecture. Code to be offloaded to the coprocessor is quali-


fied by a single pragma (#pragma offload target(mic)) to mark a region of
code (listing 1.1). Data transfers are automatically handled through the offload
statement. Transfer options can be selected by adding in (host to coprocessor)
and out (coprocessor to host) clauses. Listing 1.1 shows the offloading of the
grid data (Level,Index,α), the training dataset (Data), and the result (Y ). For
arrays, in and out receive a pointer as the array’s address and a length specifier.
Please note that compilers without support for the Intel MIC Architecture (like
GCC) safely ignore the offload pragma.
380 A. Heinecke et al.

Listing 1.1. Offloading computation and data for execution on the Intel MIC Archi-
tecture coprocessor by preprocessor directives and calling the compute kernel.

#pragma o f f l o a d t a r g e t ( mic ) i n ( p t r L e v e l : l e n g t h ( dims∗ n o Gr i d ) ) \


i n ( p t r I n d e x : l e n g t h ( dims∗ n o Gr i d ) ) i n ( p t r A l p h a : l e n g t h ( n o Gr i d ) ) \
i n ( p t r D a t a : l e n g t h ( dims∗ n o I n s t ) ) i n ( n o Gr i d , n o I n s t , dims ) \
o u t ( ptrY : l e n g t h ( n o I n s t ) )
{
multBT ( p t r L e v e l , p t r I n d e x , ptrAlpha , ptrData , ptrY , n o Gr i d , n o I n s t , dims ) ;
}

In case of SG++ , the Xeon Processor implementation of multBT uses SSE


intrinsics and relies on OpenMP for parallelization. We can semi-automatically
(by searching and replacing) transcribe the kernel from SSE to MIC VPU intrin-
sics (e. g., substitute mm mul ps with mm512 mul ps) and adjust loop counters
to the new vector length. With these small changes, we reach the performance
in Table 1 (column simple). By applying minor tweaks such as inserting prefetch
instructions, the performance can be increased further (column optimized ).

NVIDIA Fermi C2050. We focus on the NVIDIA Fermi Architecture as the


most general-purpose GPU available due to its architecture featuring multiple
cache levels. Since every instance can be evaluated independently of each other,
the operators’ iterative formulation constitutes a highly parallel workload. We
choose OpenCL as the vehicle to implement SG++ , as it is an open standard for
both GPUs and CPUs and it optimally fits our algorithm. Note that OpenCL
is closely related to NVIDIA CUDA: kernels execute on the accelerator, and
the programming paradigm resembles the notion of shader-style, data-parallel
languages. Buffers and messages are used to communicate with the GPU, and the
memory model distinguishes global and local sections. The handling of offloaded
code and data requires some additional programming effort compared to the
offload model based on pragmas.
CUDA and OpenCL do not support dynamic memory allocation. We need
such capabilities to implement optimal prefetching of data. However, we solve
this issue by exploiting the JIT compiler of OpenCL. When the OpenCL run-
time invokes a kernel, all runtime parameters are known and the JIT compiler
can tailor the kernel to optimally fit the input. Our experiments with CUDA
and OpenCL have shown that the JIT-compiled OpenCL code outperforms the
CUDA version by about 2x. This dramatic difference is due to too small di-
mensional loops (five dimensions in our case). Because of the relatively high
loop overhead, these loops are very detrimental to performance. Because of the
OpenCL JIT compiler, the SG++ implementation can generate a fully unrolled
loop at runtime. To confirm this, we have tested a specialized CUDA kernel on a
specific data set and manually unrolled the loops over dimensions. In this case,
the CUDA performance was on par with OpenCL. In summary, OpenCL is the
better choice for SG++ , due to its competitive performance and higher flexibility.

Uniform basis functions. Due to the shader-style code (see [14]) the im-
plementation of B T α evaluates an instance of the dataset for each work item.
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
381

Table 1. Performance of both simple and optimized software port to Intel Knights
Ferry and to the NVIDIA C2050. Performance is measured in GFLOPS using single
precision floating point numbers.

Dataset Intel Knights Ferry** NVIDIA C2050**


GFLOPS simple optimized simple optimized
DR5 (std. grid) 423 441 276 345
5d checkerboard (std. grid) 435 442 325 418
DR5 (mod. grid) – 276 – 65
5d checkerboard (mod. grid) – 297 – 70
DR5 (std. grid), dual – 854 – 582
5d chk.brd. (std. grid), dual – 842 – 741

NVIDIA suggests the local size (number of work items in a work group) to be a
multiple of 32. Our tests have shown that a local size of 64 gives the best per-
formance on the Fermi GPU. Although the Fermi architecture offers standard
L1 and L2 caches, the performance of a simple straightforward port is clearly
behind the Intel MIC Architecture performance as shown in Table 1.
Several kernel optimization techniques can be applied to improve performance.
First, the local storage of a workgroup can be used to prefetch data into the
caches. Second, the runtime compilation of OpenCL can be instructed to perform
runtime code generation. At runtime, the compiler knows about the loop length
and thus the loop over the dimensions can be completely unrolled to reduce the
amount of control flow in the kernel.
Because the multiplication of B is parallelized along grid points, an imple-
mentation difficulty arises. The grid may contain an arbitrary number of grid
points, but we have to map the grid to workgroups with a discrete distribution
of points. There are two options to mitigate this issue: First, we could use a
workgroup size of one (i. e., a workgroup is mapped to a work item). Second,
we may split the operator into a GPU and a CPU part. The GPU then han-
dles all multiples of 64 that are smaller than the number of grid points and the
CPU takes on the remainder. We make use of the second approach as it exhibits
better performance. However, besides an optimized GPU implementation, an
optimized Xeon Processor-based implementation is also needed. The Intel MIC
Architecture does not require such padding. Its cores can handle odd numbers
of iterations efficiently because of their standard IA-based instruction set.

Modified non-uniform basis functions. Modified non-uniform basis func-


tions (Fig. 2) need a four-way if statement for each data point to check if
an extrapolation towards the boundary is needed. The ifs must be kept in the
inner-most loop since the kernel computes a non-linear function, which prohibits
hoisting the if statements out of the loop nest. Such code structures significantly
impact the performance of the algorithm in terms of GFLOPS (see Table 1).
382 A. Heinecke et al.

However, modified non-linear basis functions reduce grid sizes and memory con-
sumption. On Knights Ferry, this halves runtime, whereas the C2050 suffers from
a 63% higher runtime.
Since the Intel MIC Architecture relies on a mixture of traditional threading
and vectorization, a suitable vectorization for the modified linear basis functions
is as follows. As the if branches are independent of the evaluation point, several
instances can be loaded into a vector register and one grid point is broadcast into
vector registers. Depending on a grid point’s property in a certain dimension,
the if condition can be computed for all data points that are currently stored
in vector registers, since there is no need to evaluate the if statement for each
data point. Hence, the GFLOPS rate only drops by about 40%.
The root-cause analysis for the NVIDIA C2050 exhibits two reasons for the
increase in runtime. First, noticeably more time is spent executing on the accel-
erator due to the frequent evaluation of the if statements. The if statements
slow down the code, as the GPU’s streaming processor executes them through
predicates and parts of the processor may execute no-op instructions. Second,
the grid sizes are significantly smaller for the non-uniform basis functions. Since
the operator B is parallelized over the number of grid points, a smaller grid leads
to a smaller degree of parallelism that can no longer satisfy the high number of
processing elements of the NVIDIA C2050.

Multi-device configurations. The offload model of compiler for the Intel MIC
Architecture directly supports multiple coprocessors. All Intel MIC Architecture
devices in the system are uniquely identified by an integer number and can be
selected by their ID in the offload pragma. For streaming applications, only the
length of the offloaded arrays has to be adjusted according to the number of
available devices. This boils down to simple mathematical expressions involv-
ing the array length, number of devices, and device ID. OpenCL multi-device
support is based on a replication of API objects such as buffers, kernels, and
command queues. Instead of simple handles for arrays, a second level of handles
must be introduced to keep track of arrays on different devices. This complicates
the implementation as it requires additional boilerplate code.
As the grids with modified linear basis functions need more tuning and rewrit-
ing of the algorithm to fully exploit the GPU, we restrict ourselves to grids with
standard basis functions when evaluating the performance of the dual-device
configuration. Table 1 lists the measured results on both platforms in the last
two rows. It is obvious that the additional padding needed for the GPU has
negative effects on the dual-GPU version especially for the small grids in early
stages of the learning process. Since the Intel MIC Architecture implementation
does not need host-CPU padding, both coprocessors can unfold their full power
when dealing with small grids. For all input data, the Knights Ferry coprocessor
achieves a speed-up of at least 1.9x when adding a second device, whereas a
second NVIDIA C2050 yields a speed-up of about 1.7x.

Performance summary. The workload covered in this paper is neither com-


pute bound nor memory bound (it behaves similar to a band matrix multiplication
Extending a Highly Parallel Data Mining Algorithm to the Intel
R
383

but with a more compute-intensive kernel). The code can fully exploit the 16 times
bigger general-purpose L2 cache of Knights Ferry, which explains the better base-
line performance of Knights Ferry over the Tesla C2050. Table 1 shows that prefetch-
ing for MIC only slightly speeds up the compute kernel, whereas adding manual
local storage loads boosts performance of the C2050 kernel. For the smaller DR5
input data, the Fermi GPU is not able to utilize its full power, while the Intel MIC
Architecture is less sensitive to the size of the input data.

Productivity. In total, only two workdays were spent to enable SG++ for the
Intel MIC Architecture, since the tool chain of Intel
R
Composer XE, Intel R

R TM
Debugger and Intel VTune Amplifier XE helped root-cause and fix perfor-
mance issues in a well-known workflow. Additional implementation complexity
arose from the workgroup padding needed for the C2050. We used the Visual
Compute Profiler to optimize the C2050 kernel. In total, five workdays were
required to implement the C2050 kernel. To keep the development time for the
devices comparable, all code variants have been developed by the same person
who is also one of the main developers of SG++ and has deep insight into the
mathematical structure of SG++ . Hence, we exclude the time needed to analyze
SG++ and acquaint the developer with the existing host implementation. The
developer had access to the documentation for Knights Ferry and the develop-
ment tools. For NVIDIA, the developer had access to both the official OpenCL
documents as well as best-practice guides that can be found on the Internet.
On both platforms standard dense linear algebra benchmarks are clearly above
0.5 TFLOPS, which highlights the excellent performance of the implementations
presented.

5 Conclusion
We demonstrated that Intel MIC Architecture devices can easily be used to bring
highly parallel applications into, or even beyond, GPU performance regions. Us-
ing well-known programming models like OpenMP and vectorization, the Intel
MIC Architecture minimizes the porting effort for existing high-efficiency pro-
cessor implementations. Moreover, programming for the Intel MIC Architecture
does not require any special tools since its support is integrated into the com-
plete Intel tool chain ranging from compilers over math libraries to performance
analysis tools. As future HPC systems will most likely be hybrid machines with
fat cores and coprocessors, programming for the Intel MIC Architecture eases
the burden for developers; codes developed for the CPU portion of the system
can be re-used on the coprocessor without too much of a porting effort, while
achieving a better level of performance than with GPU-based accelerators.

References
1. Bungartz, H.-J., Griebel, M.: Sparse Grids. Acta Numerica 13, 147–269 (2004)
2. CAPS Enterprise. Rapidly Develop GPU Accelerated Applications (2011)
384 A. Heinecke et al.

3. Intel Corporation. Pentium Processor 75/90/100/120/133/150/166/200, Order


Number 241997-010 (1997)
4. Intel Corporation. Intel Xeon Processor X5680 (2010), http://ark.intel.com
(last accessed August 18, 2011)
5. Intel Corporation. Intel Array Building Blocks (2011),
http://software.intel.com/en-us/articles/intel-array-building-blocks/
(accessed June 15, 2011)
6. Intel Corporation. Intel CilkTM Plus Language Specification, Document Number
324396-001US (2011)
7. Intel Corporation. Introducing Intel Many Integrated Core Architecture (2011),
http://www.intel.com/technology/architecture-silicon/mic/index.htm
(accessed June 15, 2011)
8. Lee, A., et al.: On the Utility of Graphics Cards to Perform Massively Parallel
Simulation of Advanced Monte Carlo Methods. Journal of Computational and
Graphical Statistics 19(4), 769–789 (2010)
9. Seiler, L., et al.: Larrabee: a Many-core x86 Architecture for Visual Computing.
ACM Trans. Graph. 27(3), 18:1–18:15 (2008)
10. Khronos OpenCL Working Group. The OpenCL Specification, Version 1.1 (2010)
11. Heinecke, A., Pflüger, D.: Multi- and many-core data mining with adaptive sparse
grids. In: Proc. of the 2011 ACM Intl. Conf. on Computing Frontiers (2011)
12. NVIDIA. Next Generation CUDATM Compute Architecture: FermiTM (2010)
13. NVIDIA. NVIDIA CUDATM C Programming Guide (2011)
14. NVIDIA. OpenCLTM Best Practices Guide (2011)
15. OpenMP Architecture Review Board. OpenMP Application Program Interface,
Version 3.0 (2008)
16. Pflüger, D.: Spatially Adaptive Sparse Grids for High-Dimensional Problems. Dis-
sertation, Institut für Informatik, TUM, München (2010)
17. Reinders, J.: Intel Threading Building Blocks. O’Reilly, Sebastopol (2007)
18. Skaugen, K.: Petascale to Exascale. Keynote speech at the Intl. Supercomputing
Conf. 2010 (2010)
19. The Portland Group. PGI Accelerator Compilers (2011),
http://www.pgroup.com/resources/accel.htm (accessed June 15, 2011)
20. Volkov, V., Demmel, J.W.: Benchmarking GPUs to Tune Dense Linear Algebra.
In: Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, pp. 31:1–31:11 (2008)
21. Yelick, K.: Exascale Computing: More and Moore? 2011. Keynote speech at the
2011 ACM Intl. Conf. on Computing Frontiers (2011)

Intel, Pentium, and Xeon are trademarks or registered trademarks of Intel Corporation
or its subsidiaries in the United States and other countries.
* Other brands and names are the property of their respective owners.
** Performance tests are measured using specific computer systems, components, soft-
ware, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to as-
sist you in fully evaluating your contemplated purchases, including the performance of
that product when combined with other products. System configuration: Intel Shady
Cove Platform with 2S Intel Xeon processor X5680 [4] (24GB DDR3 with 1333 MHz,
SLES11.1) and single Intel 5520 IOH, Intel Knights Ferry with D0 ED silicon (GDDR5
with 3.6 GT/sec, driver v1.6.501, flash image/micro OS v1.0.0.1140/1.0.0.1140-EXT-
HPC, Intel Composer XE for MIC v048), and NVIDIA C2050 (GDDR5 with 3.0
GT/sec, driver v270.41.19, CUDA 4.0).
VHPC 2011: 6th Workshop on Virtualization
in High-Performance Cloud Computing

Michael Alexander1 and Gianluigi Zanetti2


1
IBM, Austria
2
CRS4, Italy

Virtualization has become a common abstraction layer in modern data centers,


enabling resource owners to manage complex infrastructure independently of
their applications. Conjointly virtualization is becoming a driving technology
for a manifold of industry grade IT services. The cloud concept includes the no-
tion of a separation between resource owners and users, adding services such as
hosted application frameworks and queuing. Utilizing the same infrastructure,
clouds carry significant potential for use in high-performance scientific comput-
ing. The ability of clouds to provide for requests and releases of vast computing
resource dynamically and close to the marginal cost of providing the services is
unprecedented in the history of scientific and commercial computing. Distributed
computing concepts that leverage federated resource access are popular within
the grid community, but have not seen previously desired deployed levels so far.
Also, many of the scientific datacenters have not adopted virtualization or cloud
concepts yet. This workshop aims to bring together industrial providers with the
scientific community in order to foster discussion, collaboration and mutual ex-
change of knowledge and experience. This year’s workshop featured 9 papers on
diverse topics in HPC virtualization. Papers of note include Kim et al. propos-
ing group-based cloud memory deduplication along with Nanos et al. presenting
results from a high-performance cluster interconnect prototype for VMs with a
user-level RDMA protocol over standard 10Gbps Ethernet. The chairs would like
to thank the Euro-Par organizers and the members of the program committee
along with the speakers and attendees, whose interaction contributed to a stim-
ulating environment. VHPC is planning to continue the successful co-location
with Euro-Par in 2012.
Group-Based Memory Deduplication
for Virtualized Clouds

Sangwook Kim1 , Hwanju Kim2 , and Joonwon Lee1


1
Sungkyunkwan University, Suwon, Gyeonggi-do, Korea
2
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
swkim@csl.skku.edu,joonwon@skku.edu,
hjukim@calab.kaist.ac.kr

Abstract. In virtualized clouds, machine memory is known as a re-


source that primarily limits consolidation level due to the expensive cost
of hardware extension and power consumption. To address this limita-
tion, various memory deduplication techniques have been proposed to
increase available machine memory by eliminating memory redundancy.
Existing memory deduplication techniques, however, lack isolation sup-
port, which is a crucial factor of cloud quality of service and trustworthi-
ness. This paper presents a group-based memory deduplication scheme
that ensures isolation between customer groups colocated in a physical
machine. In addition to isolation support, our scheme enables per-group
customization of memory deduplication according to each group’s mem-
ory demand and workload characteristic.

Keywords: Memory deduplication, Isolation, Cloud computing.

1 Introduction
Intrinsic trade-off between efficient resource utilization and performance isola-
tion arises in cloud computing environments where various services are provided
based on a shared pool of computing resources. For high resource utilization,
cloud providers typically service a virtual machine (VM) as an isolated compo-
nent and enable multiple VMs to share underlying physical resources. Although
aggressive resource sharing among customers gives a provider more profit, perfor-
mance interference from the sharing could degrade quality of service customers
expect. Performance isolation that ensures quality of service makes it difficult for
providers to increase VM consolidation level. Many researchers have addressed
this conflicting goal focusing on several sharable resources [2,4,5,7,12].
Among those sharable resources, machine memory is known as a resource
that primarily inhibits high degree of consolidation due to the expensive cost
of hardware extension and power consumption [6]. In order to deal with the
memory space restriction, memory deduplication has drawn traction as a way
of increasing available memory by eliminating redundant memory. Since the
memory deduplication was introduced by the VMware ESX server [15], it has

This work was supported by the National Research Foundation of Korea(NRF) grant
funded by the Korea government(MEST) (No. 2011-0000371).

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 387–397, 2012.

c Springer-Verlag Berlin Heidelberg 2012
388 S. Kim, H. Kim, and J. Lee

been well-studied on how effectively to find redundant memory and how to take
advantage of saved memory [6,11]. Due to its effectiveness in reducing memory
footprint for hosting requested VM instances, memory deduplication has been
appealing to cloud providers who aim to save the total cost of ownership.
Existing memory deduplication techniques, however, lack the functionality of
performance isolation in spite of their efficiency. The problem stems from the
system-wide operation of memory deduplication across all VMs that reside in
a physical machine. In virtualized clouds, a physical machine can accommodate
several VMs that belongs to different customers who do not want their sensitive
memory contents to be shared with other customers’ VMs. Existing schemes
do not provide a knob to confine the deduplication process to a group of VMs
that want to share their memory one another (e.g., VMs in the same customer or
cooperative customers). In addition, the resource usage for system-wide dedupli-
cation cannot be properly accounted to corresponding VMs that are involved in
sharing. Since resource usage for memory deduplication itself is nontrivial [11,8],
appropriate accounting for the expense of deduplication is a requisite support
for cloud computing, which typically employs pay-per-use model.
This paper proposes a group-based memory deduplication scheme that allows
the hypervisor to run multiple deduplication threads, each of which is in charge
of its designated group. Our scheme provides an interface for a group of VMs,
which want to share their memory, to be managed by a dedicated deduplication
thread. The group-based memory deduplication has the following advantages.
Firstly, memory contents of one group are securely protected from another group.
This feature prevents security breaches that exploit memory deduplication [14].
Secondly, the resource usage of deduplication is properly accounted to a cor-
responding group. Thirdly, a deduplication thread can be customized based on
the characteristics and demands of its group. For example, deduplication rates
(i.e., scanning rates) can be differently set for each group based on workloads.
Finally, memory pages that are reclaimed by a per-group deduplication thread
can be readily redistributed to their corresponding group.
The rest of this paper is organized as follows: Section 2 describes the back-
ground and motivation behind this work. Section 3 explains the design and
implementation of the group-based memory deduplication. Then, Sect. 4 shows
experimental results and Sect. 5 discusses issues arising in our scheme and fur-
ther improvement. Finally, Sect. 6 presents our conclusion and future work.

2 Background and Motivation


2.1 Memory Deduplication
Memory deduplication is a well-known technique that condenses physical mem-
ory space by eliminating redundant data loaded in memory. In VM-based consol-
idated environments, considerable amount of memory can be duplicated across
VMs especially when they have homogeneous software stacks such as OSes and
applications or work on common working set on a shared storage. By reclaim-
ing redundant memory, the hypervisor can give more memory to a VM whose
Group-Based Memory Deduplication for Virtualized Clouds 389

working set exceeds its memory in order to improve performance. In addition,


increase in available memory allows more VMs to run in a physical machine,
thereby increasing consolidation level. One representative scheme of memory
deduplication is a transparent content-based page sharing [15], which was firstly
introduced by the VMware ESX server. This scheme periodically scans physical
memory, compares scanned pages based on their contents, merges them if they
are identical, and reclaims redundant memory. In order to ensure transparency, a
shared page is marked as copy-on-write, by which the shared page will be broken
to private copies in response to a write attempt to it.

2.2 Performance Isolation in Clouds


Cloud computing is an emerging technology trend from the perspective of elastic
and utility computing on a large shared pool of resources. Among various types
of cloud computing, Infrastructure-as-a-Service (IaaS) platform provides a cus-
tomer with the entire control of software stack in the form of a VM. Provisioned
VMs could share the resources of a physical machine according to their service
level agreement (SLA). Transparently enabling multiple VMs to share physical
resources, cloud providers reduce the number of machines that host requested
VM instances, thereby saving the total cost of ownership.
Despite the cost saving, sharing cloud resources intrinsically causes perfor-
mance interference between individual VMs. Since cloud computing typically
complies with a pay-per-use model, the performance a customer expects should
not be interfered by other customers’ instances. Many researchers have empha-
sized that sharable hardware resources such as last-level CPU caches [12], ma-
chine memory [4], and even entire components of hardware [7], should be properly
isolated from each VM.

2.3 Limitations of Memory Deduplication in Clouds


Although memory deduplication improves the performance and consolidation
level by exploiting saved memory, existing schemes lack the functionality that
ensures isolation among customer instances. The memory deduplication process
of the current techniques is globally conducted by the hypervisor. This system-
wide memory deduplication poses several issues on performance isolation and
trustworthiness.
Firstly, memory contents that come from different customer VMs can be shared.
This type of sharing across customer boundary may be unwanted because
memory contents could contain sensitive information. In fact, attacks that exploit
security breaches of memory deduplication were addressed [11,14]. Secondly, com-
putational overheads for memory deduplication are not properly accounted to each
corresponding customer. Memory deduplication entails computational tasks in-
cluding scanning, hashing, byte-by-byte comparison, and copy-on-write breaking.
Since these tasks are done in a single execution context, their CPU usages can-
not be billed to an appropriate VM whose memory is involved in deduplication.
Thirdly, a deduplication rate should be globally set without considering
390 S. Kim, H. Kim, and J. Lee

customers’ demands or workload characteristics. The pace at which identical pages


are shared determines a reprovisioning rate of reclaimed memory, which contributes
to performance improvement.

3 Group-Based Memory Deduplication


3.1 Design
Our group-based memory deduplication scheme provides a mechanism that sup-
ports multiple deduplication threads, each of which is dedicated to each group de-
fined by administrators. The interface for grouping VMs is exposed to user space so
that administrators can readily adjust grouping policies and per-group customiza-
tion on the fly. Figure 1 illustrates the architecture overview of our mechanism.
In order to ensure isolation between groups, a deduplication thread is involved
in virtual address spaces that are registered to its group. Accordingly, all dedu-
plication operations are solely carried out on a designated memory space of each
group. Since a deduplication thread is bound to its group, the resource usage
for deduplication can simply be accounted to its group. Besides deduplication,
redistributing reclaimed memory is done within each group. As shown in Fig. 1,
a per-group deduplication thread notifies its corresponding memory redistrib-
utor of how many pages are reclaimed by its group. Each per-group memory
redistributor supplies the reclaimed memory to its group so that the VMs of the
group take advantage of increased memory.
In addition, an administrator can differently set scan rates according to each
group’s demand or workload characteristics. For example, a high scan rate can
be set to a group if its customer wants aggressive scanning in favor of addi-
tional memory, by which performance benefits outweigh scanning overheads.
Conversely, a group that has CPU-intensive workloads with enough memory
may desire a low scan rate. This per-group scan rate gives more flexibility by
allowing group-specific customization.

Memory Memory
Redistributor Redistributor

VM 1 VM 2 VM 3 VM 4 VM 5

Machine Reclaim
Reclaim
Memory

Deduplication Deduplication
Thread Thread

Register user A Register user B


with scan rate S1 with scan rate S2

Cgroup interface
Administrator

Fig. 1. Architecture overview of the group-based memory deduplication


Group-Based Memory Deduplication for Virtualized Clouds 391

3.2 Implementation
We implemented the prototype of our scheme by extending the Linux KSM [1].
The current KSM conducts system-wide memory deduplication over virtual ad-
dress spaces that are registered via the madvise system call. When Kernel Virtual
Machine (KVM) [9] creates a VM instance, it automatically registers the VM’s
entire memory regions to KSM. Once KSM is initiated, a global deduplication
thread, named ksmd, performs deduplication with respect to all VM’s memory
regions. For group-based memory deduplication, we modified this system-wide
deduplication algorithm by splitting the global ksmd into per-group ksmds. Each
per-group ksmd operates with its own data structures that are completely iso-
lated from other ksmds.
For a grouping interface, we used the cgroup [10], which is a general component
to group threads via the Linux VFS. We added the KSM cgroup subsystem for
administrators or user applications to easily define deduplication groups. Each
group directory includes several logical files, which indicate a scan rate and the
number of shared pages to interact with its per-group ksmd.
Taking advantage of the cgroup interface, the memory redistributor is simply
implemented as a user-level script. This script periodically checks the number
of reclaimed pages for each group and reprovisions them to a corresponding
group by interacting with a guest-side balloon driver. Regarding intra-group
reprovisioning, our current version evenly supplies given memory to VMs within
a group. However, more sophisticated policies can be applied by using working
set estimation techniques.
4 Evaluation
In this section, we present preliminary evaluation results to show how the group-
based memory deduplication scheme impacts on memory sharing and redistri-
bution behaviors.
4.1 Experimental Environments
Our prototype is installed on a machine with Intel i5 quad core CPU 760
2.80GHz, 4GB of RAM, and two 1TB HDDs. This host machine runs Ubuntu
10.10 with the qemu-kvm 0.14.0 and our modified Linux kernel 2.6.36.2. We com-
pared our scheme, called GRP, with two baseline schemes: NOGRP-equal and
NOGRP-SE. While the two baselines have non-group memory deduplication in
common, they have different reprovisioning policies. NOGRP-equal reprovisions
reclaimed memory evenly to existing VMs, whereas NOGRP-SE gives a VM
reclaimed memory in proportion to its sharing entitlement, which means how
much contribution a VM makes to save memory; this reprovisioning scheme was
proposed by Milos et al [11]. For example, if two VMs make all reclaimed pages,
they deserve to receive all additional memory they contribute. From the per-
spective of isolation, we believe that this scheme is more suitable than the equal
reprovisioning for cloud environments.
We evaluated a two-group scenario where each group has two VMs configured
as follows:
392 S. Kim, H. Kim, and J. Lee

– MR group includes two VMs that run a distributed wordcount on the


Hadoop MapReduce framework. Hadoop slave instances concurrently com-
pute with a 200MB input file in the two VMs, one of which is also in charge
of the master for controlling the slaves. This group uses Ubuntu 10.10 as a
guest OS.
– FIO group includes two VMs each of which run a random read workload on
700MB common data set. We used sysbench and measured average through-
put for 400 seconds. This group uses Fedora 14 as a guest OS.

To minimize interferences between groups, we used cpu, cpuset, and blkio cgroup
subsystems for both NOGRP and GRP. NOGRP baselines allow a global ksmd to
belong to its own group, while our scheme makes each per-group ksmd belong to
its corresponding group so that deduplication cost is accounted to its group. The
groups of main workloads including ksmd group (NOGRP case) has sufficiently
higher CPU shares than other system threads in order to minimize the effect of
system daemon activities.

4.2 Effects of Group-Based Memory Deduplication

We evaluated the performance and memory changes with sharing trends for two
configurations, in which one group has enough memory to cover working set
while the other does not. F IOlow indicates that the FIO group does not have
enough memory to cover its working set (MR-VM:FIO-VM=640MB:640MB),
whereas M Rlow indicates that the MR group lacks memory for its working set
(MR-VM:FIO-VM=384MB:896MB). With respect to our scheme, we varied scan
rates for each group; GRP-x:y means the ratio of scan rates for MR and FIO.
To compare the performance across all policies, we make the sum of scan rates
for each policy equal (10,000 pages/sec).
Figure 2 shows the normalized throughput of each group for different policies.
The first thing to note is two NOGRPs show different performance. In the case of
F IOlow , the FIO group of NOGRP-equal shows much higher performance than
that of NOGRP-SE. To investigate this difference, Fig. 3 shows the changes in
memory for each VM with the amount of reclaimed memory as time progresses.
For both cases, the MR group emits a large amount of reclaimed memory for
25–60 seconds. Although the MR group has the contribution for the reclaimed
pages during the period, NOGRP-equal reprovisions them evenly to the two
groups. Since the FIO group lacks the memory in F IOlow , such aid of additional
memory boosts its performance. Furthermore, the increased memory helps the
FIO group make more reclaimed memory by sharing more pages. On the other
hand, NOGRP-SE reprovisions the initial reclaimed memory to only the MR
group based on its sharing entitlement, so that the FIO group cannot benefit
from any additional memory during the initial period.
Conversely, in the case of M Rlow , the MR group of NOGRP-SE achieves
higher performance than that of NOGRP-equal. As shown in Fig. 4, NOGRP-
SE makes the MR group quickly receive more memory contributed by its own
sharing during the initial period, thereby boosting the performance of the MR
Group-Based Memory Deduplication for Virtualized Clouds 393

1.4 1.4
MR group MR group
Normalized performance

Normalized performance
1.2 FIO group 1.2 FIO group

1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
N

G
O

O
R

R
G

P-

P-

P-

P-

P-

P-

P-

P-

P-

P-
R

R
1:

3:

5:

7:

9:

1:

3:

5:

7:

9:
P-

P-

P-

P-
9

1
eq

SE

eq

SE
ua

ua
l

l
(a) F IOlow (b) M Rlow

Fig. 2. Normalized performance for NOGRP-equal, NOGRP-SE, and GRP with vari-
ous scan rates (x :y is the scan rates of MR:FIO)

1200 1200 1200 1200


Reclaimed Meory (MB)

Reclaimed Meory (MB)


1000 1000 1000 1000
VM Memory (MB)

VM Memory (MB)

800 800 800 800

600 600 600 600


VM1(MR) VM1(MR)
400 VM2(MR) 400 400 VM2(MR) 400
VM3(FIO) VM3(FIO)
200 VM4(FIO) 200 200 VM4(FIO) 200
Total reclaimed meory Total reclaimed meory
0 0 0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Time (sec) Time (sec)

(a) NOGRP-equal (b) NOGRP-SE

Fig. 3. Memory changes in the NOGRP cases with reclaimed memory (F IOlow )

1200 1200 1200 1200


Reclaimed Meory (MB)

Reclaimed Meory (MB)


VM Memory (MB)

VM Memory (MB)

1000 1000 1000 1000


800 800 800 800
600 600 600 600
VM1(MR) VM1(MR)
400 VM2(MR) 400 400 VM2(MR) 400
VM3(FIO) VM3(FIO)
200 VM4(FIO) 200 200 VM4(FIO) 200
Total reclaimed meory Total reclaimed meory
0 0 0 0
0 50 100 150 200 250 300 350 400 450 0 50 100 150 200 250 300 350 400
Time (sec) Time (sec)

(a) NOGRP-equal (b) NOGRP-SE

Fig. 4. Memory changes in the NOGRP cases with reclaimed memory (M Rlow )

group. The results of F IOlow and M Rlow imply that neither of the non-group
schemes (NOGRP-equal and NOGRP-SE) always achieves the best performance,
since each group’s memory demands are different.
Figure 2 also shows the results of the group-based memory dedupication with
various scan rate settings. As shown in the figure, the best performance results
are achieved on certain scan rate ratios: 1:9 for F IOlow and 9:1 for M Rlow . It is
394 S. Kim, H. Kim, and J. Lee

1200 1200 1200 1200


VM1(MR) VM1(MR)

Reclaimed Meory (MB)

Reclaimed Meory (MB)


1000 VM2(MR) 1000 1000 VM2(MR) 1000
VM Memory (MB)

VM Memory (MB)
VM3(FIO) VM3(FIO)
800 VM4(FIO) 800 800 VM4(FIO) 800

600 600 600 600

400 400 400 Reclaimed meory(MR) 400


Reclaimed meory(FIO) Reclaimed meory(MR)
200 200 200 Reclaimed meory(FIO) 200

0 0 0 0
0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400
Time (sec) Time (sec)

(a) GRP-1:9 (F IOlow ) (b) GRP-9:1 (M Rlow )

Fig. 5. Memory changes in the best performance cases of GRP with reclaimed memory

intuitive that a higher scan rate makes a group that lacks memory quickly reap
additional memory, thereby improving its performance. Figure 5 shows the two
cases of the best performance. As expected, a high scan rate quickly produces
reclaimed memory, which is then reprovisioned to a group that desires more
memory. Although a low scan rate slowly emits a small amount of reclaimed
memory, the performance of a group that has enough memory is not affected.
As a result, the group-based memory deduplication can achieve the best per-
formance if a scan rate for each group is appropriately chosen. Considering that
NOGRP-SE is currently the most suitable approach for clouds, due to its cap-
italism, it does not have room for customization on the basis of each group’s
memory demand and workload characteristic. In Sect. 5.3, we discuss our plan
to devise the dynamic adjustment of per-group scan rates.

5 Discussion
In this section, we discuss promising applicability of the group-based dedupli-
cation focusing on VM colocation, various grouping policies, and feasible cus-
tomization of per-group deduplication.

5.1 VM Colocation
For the group-based memory deduplication to be effective, multiple VMs within
the same group should be colocated in a physical machine. Assuming that a
group is established based on a customer, there are several cases to colocate
VMs from the same customer. Firstly, as novel hardware (e.g., many core proces-
sors and SR-IOV network cards) has been increasingly supporting consolidation
scalability [7], a physical machine becomes capable of colocating the increasing
number of VMs. This trend increases the likelihood that VMs from the same
customer are colocated. Secondly, VM colocation policies that favor cloud-wide
resource efficiency (e.g., memory footprint [16] and network bandwidth [13])
would encourage a cloud provider to colocate VMs from the same customer.
For example, if a cloud customer leases VMs for distributed computing on the
MapReduce framework, the VMs have homogeneous software stack, common
Group-Based Memory Deduplication for Virtualized Clouds 395

working set, and much communication traffic among them. In this case, a cloud
provider seeks to colocate such VMs in a physical machine for efficiency as long
as their SLAs are satisfied.
Although the same customer’s VMs are not colocated, there are still chances
to take advantage of the group-based memory deduplication. As cloud computing
has been embracing various services, there are growing opportunities to share
data among related services. CloudViews [3] presents a blueprint of rich data
sharing among cloud-based Web services. We expect that such direction allows
our scheme to group cooperative customers who agree with data sharing. In
addition, intra-VM memory deduplication may not be negligible depending on
workloads when a VM is solely located in a group. Some scientific workloads
have a considerable amount of duplicate pages in native environments [1].

5.2 Grouping Policies


We are currently considering various grouping policies other than the customer-
based isolation policy. Intuitively, VMs can be grouped based on their sharing
opportunities likely attained by the common software stack and working set [15].
To this end, we can statically group the same virtual appliances or distributed
computing nodes. For dynamic grouping, a cloud provider can figure out sharing
opportunities by keeping track of memory fingerprint on the fly. In the case of
clouds, which do not allow arbitrary grouping across independent customers,
providers can offer their customers a grouping option that benefits from more
available memory by sharing in a symbiotic manner.
Similarly, a cloud provider can service a pricing model that offers best-effort
available memory with the lower bound guarantee. Note that the additional
memory reprovisioned via deduplication can be returned by copy-on-write break-
ing at any time. For this reason, such additional memory is provided to customers
in a best-effort manner. The group-based memory deduplication can group VMs
that participate in this type of memory provisioning. Nathuji et al. [12] proposed
this type of pricing model with respect to CPU capacity offering.

5.3 Per-group Deduplication Customization


The group-based memory deduplication enables per-group customization for
deduplication process. As shown in Sect. 4, the performance of applications that
require more memory for covering their working set relies on memory repro-
visioning rates. Based on the results, we are extending our scheme to support
dynamic deduplication rates by monitoring workloads for each group. Currently,
we take two metrics into account for scanning rate adjustment.
Firstly, the hypervisor can monitor how many pages are being reclaimed for
each group during a certain time window. When VMs in a group abruptly start
loading a large amount of identical pages, the number of pages shared will rapidly
increase. In this case, a higher scanning rate boosts the reprovisioning rate of ad-
ditional memory, thereby improving the performance. Secondly, when workloads
in a group become CPU-intensive, a high deduplication rate may degrade their
396 S. Kim, H. Kim, and J. Lee

performance due to deduplication overheads. Since the deduplication process


may pollute CPU caches and consume memory bandwidth, these overheads may
offset or outweigh the benefits of deduplication with regard to CPU-intensive
workloads. In this case, it is important to determine an appropriate rate for
overall performance by considering CPU usage and memory demands.

6 Conclusions and Future Work


In this paper, we devise a knob to group VMs that allow their memory to be
shared one another. The proposed scheme enables the memory deduplication
process to be isolated between groups and customized based on each group’s
demand and characteristic. We believe that the group-based isolation is an es-
sential feature of memory deduplication in cloud computing environments, which
regard performance isolation and trustworthiness as crucial factors.
As discussed, we plan to explore various grouping policies and dynamic ad-
justment of deduplication rates on the basis of workload characteristics. Fur-
thermore, we are investigating a flexible reprovisioning scheme that effectively
exploits reclaimed memory to improve overall performance in the same group.

References
1. Arcangeli, A., Eidus, I., Wright, C.: Increasing memory density by using ksm. In:
Proc. OLS (2009)
2. Cucinotta, T., Giani, D., Faggioli, D., Checconi, F.: Providing Performance
Guarantees to Virtual Machines Using Real-Time Scheduling. In: Guarracino,
M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla,
F., Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010.
LNCS, vol. 6586, pp. 657–664. Springer, Heidelberg (2011)
3. Geambasu, R., Gribble, S.D., Levy, H.M.: Cloudviews: Communal data sharing in
public clouds. In: Proc. HotCloud (2009)
4. Gordon, A., Hines, M.R., da Silva, D., Ben-Yehuda, M., Silva, M., Lizarraga, G.:
Ginkgo: Automated, application-driven memory overcommitment for cloud com-
puting. In: Proc. RESoLVE (2011)
5. Gupta, D., Cherkasova, L., Gardner, R., Vahdat, A.: Enforcing Performance Iso-
lation Across Virtual Machines in Xen. In: van Steen, M., Henning, M. (eds.)
Middleware 2006. LNCS, vol. 4290, pp. 342–362. Springer, Heidelberg (2006)
6. Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A.C., Varghese, G., Voelker,
G.M., Vahdat, A.: Difference engine: Harnessing memory redundancy in virtual
machines. In: Proc. OSDI (2008)
7. Keller, E., Szefer, J., Rexford, J., Lee, R.B.: Nohype: Virtualized cloud infrastruc-
ture without the virtualization. In: Proc. ISCA (2010)
8. Kim, H., Jo, H., Lee, J.: XHive: Efficient cooperative caching for virtual machines.
IEEE Transactions on Computers 60(1), 106–119 (2011)
9. Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: KVM: The Linux virtual
machine monitor. In: Proc. OLS (2007)
10. Menage, P.B.: Adding generic process containers to the Linux kernel. In: Proc.
OLS (2007)
Group-Based Memory Deduplication for Virtualized Clouds 397

11. Milós, G., Murray, D.G., Hand, S., Fetterman, M.A.: Satori: Enlightened page
sharing. In: Proc. USENIX ATC (2009)
12. Nathuji, R., Kansal, A., Ghaffarkhah, A.: Q-clouds: Managing performance inter-
ference effects for qos-aware clouds. In: Proc. EuroSys (2010)
13. Sonnek, J., Greensky, J., Reutiman, R., Chandra, A.: Starling: Minimizing commu-
nication overhead in virtualized computing platforms using decentralized affinity-
aware migration. In: Proc. ICPP (2010)
14. Suzaki, K., Iijima, K., Yagi, T., Artho, C.: Memory deduplication as a threat to
the guest OS. In: Proc. EuroSec (2011)
15. Waldspurger, C.A.: Memory resource management in VMware ESX server. In:
Proc. OSDI (2002)
16. Wood, T., Tarasuk-Levin, G., Shenoy, P., Desnoyers, P., Cecchet, E., Corner, M.D.:
Memory buddies: Exploiting page sharing for smart colocation in virtulized data
centers. In: Proc. VEE (2009)
A Smart HPC Interconnect for Clusters
of Virtual Machines

Anastassios Nanos1 , Nikos Nikoleris2, Stratos Psomadakis1,


Elisavet Kozyri3 , and Nectarios Koziris1
1
Computing Systems Laboratory, National Technical University of Athens
{ananos,psomas,nkoziris}@cslab.ntua.gr
2
Uppsala Architecture Research Team, Uppsala University
nikos.nikoleris@it.uu.se
3
Cornell University
ekozyri@cs.cornell.edu

Abstract. In this paper, we present the design of a VM-aware, high-


performance cluster interconnect architecture over 10Gbps Ethernet. Our
framework provides a direct data path to the NIC for applications that
run on VMs, leaving non-critical paths (such as control) to be handled
by intermediate virtualization layers. As a result, we are able to multi-
plex and prioritize network access per VM. We evaluate our design via
a prototype implementation that integrates RDMA semantics into the
privileged guest of the Xen virtualization platform. Our framework al-
lows VMs to communicate with the network using a simple user-level
RDMA protocol. Preliminary results show that our prototype achieves
681MiB/sec over generic 10GbE hardware and relieves the guest from
CPU overheads, while limiting the guest’s CPU utilisation to 34%.

1 Introduction

Nowadays, Cloud Computing infrastructures provide flexibility, dedicated exe-


cution, and isolation to a vast number of services. These infrastructures, built on
clusters of multicores, offer huge processing power; this feature makes them ideal
for mass deployment of compute-intensive applications. However, I/O operations
in virtualized environments are usually handled by software layers within the hy-
pervisor. These mechanisms multiply the numerous data paths and complicate
the way data flow from applications to the network.
In the HPC world, applications utilize adaptive layers to overcome limita-
tions that operating systems impose in order to ensure security, isolation, as
well as fairness in resource allocation and usage. To avoid the overhead asso-
ciated with user-to-kernel–space communication, cluster interconnects adopt a
user-level networking approach. However, when applications access I/O devices
without regulation techniques, security issues arise and hardware requirements
increase. Currently, only a subset of the aforementioned layers is implemented
in virtualization platforms.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 398–406, 2012.

c Springer-Verlag Berlin Heidelberg 2012
A Smart HPC Interconnect for Clusters of Virtual Machines 399

In this paper, we propose a framework capable of providing VMs with HPC


interconnect semantics. We examine the implications of bypassing common net-
work stacks and explore direct data paths to the NIC. Our approach takes advan-
tage of features found in cluster interconnects in order to decouple unnecessary
protocol processing overheads from guest VMs and driver domains. To evaluate
our design, we develop a lightweight RDMA protocol over 10G Ethernet and in-
tegrate it in the Xen virtualization platform. Using network microbenchmarks,
we quantify the performance of our prototype. Preliminary results indicate that
our implementation achieves 681MiB/sec with negligible CPU involvement on
the guest side, while limiting CPU utilization on the privileged guest to 34%.

2 Background and Related Work


In virtualization environments, the basic building blocks of the system (i.e. CPUs
and memory) are multiplexed by the Virtual Machine Monitor (VMM). In Par-
aVirtualized (PV) [1] VMs, only privileged instructions are trapped into the
VMM; unprivileged operations are carried out directly on hardware. Since this
is the common case for HPC applications, nearly all overheads from intermedi-
ate virtualization layers in an HPC context are associated with I/O and memory
management. Data access is handled by privileged guests called driver domains
that help VMs interact with the hardware via a split driver model. Driver do-
mains, host a backend driver, while guest VM kernels host frontend drivers,
exposing a per-device class API to guest user– or kernel–space.
With SR/MR-IOV [2] VMs exchange data with the network using a direct
VM-to-NIC data path provided by a combination of hardware and software tech-
niques: thus, device access by multiple VMs is multiplexed in firmware running
on the hardware itself, bypassing the VMM on the critical path.
Overview of the Xen Architecture Xen [3] is a popular VMM that uses PV. It
consists of a small hypervisor, driver domains, and the VMs (guest domains).
Xen Memory Management: In Xen, memory is virtualized in order to provide con-
tiguous regions to OS’s running on guest domains. This is achieved by adding
a per-domain memory abstraction called pseudo-physical memory. So, in Xen,
machine memory refers to the physical memory of the entire system whereas
pseudo-physical memory refers to the physical memory that the OS in any guest
domain is aware of. Xen PV Network I/O: Xen’s PV network architecture is based
on a split driver model. Guest VMs host the netfront driver, which exports a
generic Ethernet API to kernel-space. The driver domain hosts the hardware spe-
cific driver and the netback driver, which communicates with the frontend via an
event channel mechanism and injects frames to the NIC via a software bridge.
Xen Communication Mechanisms: As communication between the frontend and
the backend is a major part of PV, we briefly describe Xen’s doorbell mecha-
nisms. Grant Mechanism: To efficiently share pages across guest domains, Xen
exports a grant mechanism. Xen’s grants are stored in grant tables and provide a
generic mechanism for memory sharing between domains. Event Channels: Two
guests can initialize an event channel between them and then exchange events
that trigger the execution of the corresponding handlers.
400 A. Nanos et al.

High-performance Interconnects: Typical HPC applications utilize mechanisms


to overcome limitations imposed by general purpose operating systems. These lay-
ers are usually: (a) communication libraries (MPI), (b) mechanisms that bypass
OS kernels to optimize process scheduling and device access (user-level network-
ing, zero-copy, page-cache bypass, etc.). High-performance communication pro-
tocols comprise the backend layers of popular parallel programming frameworks
(e.g. MPI). These protocols run on adapters that export part of the network in-
terface to user–space via endpoints. 10G Ethernet: While VMs can communicate
with the network over TCP, UDP, or even IP protocol layers, this choice entails
unwanted protocol processing. In VM environments, HPC protocol messages are
encapsulated into TCP/IP datagrams, so significant latency ensues.
10G Ethernet and its extensive use in cluster interconnects has given rise
to a large body of literature on optimizing upper-level protocols, specifically,
protocol handling and processing overheads [4,5,6]. Recent advances in virtual-
ization technology have minimized overheads associated with CPU or memory
sharing. However, I/O is a completely different story: intermediate virtualiza-
tion layers impose significant overheads when multiple VMs share network or
storage devices [7,8]. Previous work on this limitation has mainly focused on
PV. Menon et al. [9] propose optimizations of the Xen network architecture
by adding new capabilities to the virtualized network interface (scatter/gather
I/O, TCP/IP checksum offload, TCP segmentation offload). [10] enhances the
grant mechanism, while [11] proposes the extension of the VMM scheduler for
real-time response support. The authors in [12] and [13] present memory-wise
optimizations to the Xen networking architecture. While all the aforementioned
optimizations appear ideal for server-oriented workloads, the TCP/IP stack im-
poses a significant overhead when used for a message passing library, which is
standard practice in HPC applications. Contrary to the previous approaches, Liu
et al. [14] describe VMM-bypass I/O over Infiniband. Their approach is novel
and based on Xen’s split driver model. In [15], the authors present the design of
a similar framework using Myrinet interfaces. We build on this idea, but instead
of providing a virtualized device driver for a cluster interconnect architecture,
we develop a framework that forwards requests from the VM’s application space
to the native device driver.

3 Design and Implementation


Our approach is built on the following pillars: (a) a library which provides an
application interface to guest’s user-space; (b) a frontend that forwards guest’s
applications requests to lower-level virtualization layers; (c) a backend that mul-
tiplexes requests to access the network.
Main Components. The user-space library exports the basic API which defines
the primitive operations of our protocol. Processes issue commands via their
endpoints (see section 2), monitor the endpoints’ status and so on.
The API defines functions to handle control messages for opening / closing an
endpoint, memory registration and RDMA read / write. These primitive oper-
ations can be used to implement higher level communication substacks, such as
A Smart HPC Interconnect for Clusters of Virtual Machines 401

MPI or shared memory libraries. Our approach exports basic RDMA semantics
to VM’s user-space using the following operations:
Initialization: The guest side of our framework is responsible for setting up an
initial communication path between the application and the backend. Frontend-
Backend communication: This is achieved by utilizing the messaging mechanism
between the VM and the backend. This serves as a means for applications to
instruct the backend to transmit or wait for communication, and for the back-
end to inform the guest and the applications of error conditions or completion
events. We implemented this mechanism using event channels and grant refer-
ences. Export interface instance to user-space: To support this type of mechanism
we utilize endpoint semantics. The guest side provides operations to open and
close endpoints, in terms of allocating or deallocating and memory mapping
control structures residing on the backend.
Memory registration: In order to
perform RDMA operations from
user-space buffers, applications
have to inform the kernel to ex-
clude these buffers from memory
handing / relocation operations.
To transfer data from application
buffers to the network, the back-
end needs to access memory ar-
eas. This happens as follows: the
frontend pins the memory pages,
grants them to the backend and
the latter accepts the grant in or- Fig. 1.
der to gain access to these pages.
An I/O Translation Look-aside Buffer (IOTLB) is used to cache the translations
of pages that will take part in communication. This approach ensures the valid-
ity of source and destination buffers, while enabling secure and isolated access
multiplexing. Guest-to-Network: The backend performs a look-up in the IOTLB,
finds the relevant machine address and informs the NIC to program its DMA
engines to start the transfer from the guest’s memory. The DMA transfer is
performed directly to the NIC and as a result, packets are encapsulated into
Ethernet frames, before being transmitted to the network. We use a zero-copy
technique on the send path in order to avoid extra, unnecessary copies. Packet
headers are filled in the backend and the relevant (granted) pages are attached
to the socket buffer. Network-to-Guest: When an Ethernet frame is received from
the network, the backend invokes the associated packet handler. The destination
virtual address and endpoint are defined in the header so the backend performs
a look-up on its IOTLB and is performs the necessary operations. Data are
then copied (or DMA’d) to the relevant (already registered) destination pages.
Wire protocol: Our protocol’s packets are encapsulated into Ethernet frames con-
taining the type of the protocol (a unique type), source and destination MAC
addresses.
402 A. Nanos et al.

Data Movement: Figure 1 shows the data paths either for control or data
movement: Proposed approach: Applications issue requests for RDMA operations
through endpoints. The frontend passes requests to the backend using the event
channel mechanism (dashed arrow, b1 ). The backend performs the necessary
operation, either registering memory buffers (filling up the IOTLB), or issuing
transmit requests to the Ethernet driver (dashed arrow, c1 ). The driver, then,
informs the NIC to DMA data from application to the on-chip buffers (dashed
arrow, d1 ). Ideal approach: Although the proposed approach relaxes the system
from processing and context-switch overheads, ideally, VMs could communicate
directly with the hardware, lowering the multiplexing authority to the NIC’s
firmware (solid arrows).

4 Performance Evaluation
We use a custom synthetic microbenchmark to evaluate our approach over our
interconnect sending unidirectional RDMA write requests. To obtain a baseline
measurement, we implement our microbenchmark using TCP sockets. TCP/IP
results were verified using netperf [16] in TCP STREAM mode and varying message
sizes. As a testbed, we used two Quad core Intel Xeon 2.4GHz with an Intel 5500
chipset, with 4GB main memory. The network adapters used are two PCIe-4x
Myricom 10G-PCIE-8A 10GbE NICs in Ethernet mode, connected back-to-back.
We used Xen version 4.1-unstable and Linux kernel version 2.6.32.24-pvops both
for the privileged guest and the VMs. The MTU was set to 9000 for all tests. We
use 1GB memory for each VM and 2GB for the privileged guest. CPU utilization
results are obtained from /proc/stat. To eliminate Linux and Xen scheduler
effects we pinned all vCPUs to physical CPUs and assigned 1 core per VM and
2 cores for the privileged guest, distributing interrupt affinity to each physical
core for event channels and the Myrinet NICs In the following, TCP SOCK refers
to the TCP/IP network stack and ZERO COPY refers to our proposed framework.

4.1 Results
To obtain a baseline for our experiments, we run the pktgen utility of the Linux
kernel. This benchmark uses raw Ethernet and, thus, this is the upper bound
of all approaches. Figure 2(a) plots the maximum achievable socket buffer pro-
duction rate when executed in vanilla Linux (first bar), inside the Privileged
Guest (second bar) and in the VM (third bar). Clearly, the PVops Linux kernel
encounters some issues with Ethernet performance, since the privileged guest
can achieve only 59% of the vanilla Linux case. As mentioned in Section 2, Xen
VMs are offered a virtual Ethernet device via the netfront driver. Unfortunately,
in the default configuration, this device does not feature specific optimizations
or accelerators and, as a result, its performance is limited to 416MiB/sec (56%
of the PVops case)1 .
1
For details on raw Ethernet performance in Xen PVops kernel see
http://lists.xensource.com/archives/html/xen-users/2010-04/msg00577.html
A Smart HPC Interconnect for Clusters of Virtual Machines 403

800
PKTGEN
ZERO COPY
TCP SOCKETS
700

600

1400 Linux
Xen Driver Domain 500

Bandwidth (MB/s)
Xen VM
1200
Throughput (MiB/sec)

400
1000
300
800

600 200

400
100

200
0
128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
0
pktgen Message Size (Bytes)

(a) Maximum achievable socket buffer (b) Aggregate bandwidth


production rate
140
ZERO COPY
DUMMY
TCP SOCKETS
120
2000
VM
100
Driver Domain
1500
Time (usec)
Latency (usec)

80
1000
60
500
40

0
ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK
20

1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K


Message Size (Bytes) 4K 8K 16K 32K 64K 128K 256K

(c) One-way Latency (d) Aggregate CPU time vs. RDMA


message size (send and receive path)

Fig. 2.

Bandwidth and Latency:


Figure 2(b) plots the aggregate throughput of the system over TCP/IP (filled
circles) and over our framework (filled squares) versus the message size. We also
plot the Driver domain’s pktgen performance as a reference. For small messages
(<4KB) our framework outperforms TCP by a factor of 4.3 whereas for medium-
sized messages (i.e. 128KB) by a factor of 3. For large messages (>512K) our
framework achieves nearly 92% of the pktgen case (for 2MB messages) and is
nearly 3 times better than the TCP approach. The suboptimal performance of
the microbenchmark over TCP is due mainly to: (a) the complicated proto-
col stack (TCP/IP) (see Section 4.1) and (b) the unoptimized virtual Ethernet
interface of Xen.
From a latency point of view (Figure 2(c)), an RDMA message over TCP
sockets takes 77μsec to cross the network, whereas over our framework it takes
28μsec. To set a baseline latency-wise, we performed a DUMMY RDMA write: 1
byte originating from an application inside the VM gets copied to the privileged
guest, but instead of transmitting it to the network, we copy it to another VM
on the same VM container. Results from this test show that 14μsecs are needed
for 1 byte to traverse the intermediate virtualization layers.
404 A. Nanos et al.

800
700 steal_time steal_time
softirq 250 softirq
600 irq irq
iowait iowait
500 200
Time (us)

system system

Time(usec)
400 nice nice
user 150 user
300
100
200
50
100
0 0

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK
ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK

ZERO_COPY
TCP_SOCK
4K 8K 16K 32K 64K 128K 256K 4K 8K 16K 32K 64K 128K 256K

(a) CPU time breakdown for the (b) CPU time breakdown for the
driver domain VM (send and receive path)

Fig. 3. CPU time breakdown for both the driver domain and the guests

CPU time for RDMA writes: In the HPC world, nodes participating in clusters
except for low-latency and high-bandwidth communication, require computa-
tional power.
Our approach bypasses the TCP/IP stack; we assume that, in this case, the
CPU utilization of the system is relaxed. In order to validate this assumption
we examine the CPU time spent in both approaches. We measure the total CPU
time when two VMs perform RDMA writes of varying message sizes over the
network (TCP and ZERO COPY approach). In Figure 2(d), we plot the CPU
time both for the driver domain and the VM. It is clearly shown that for 4K
to 32K messages the CPU time of our framework is constant, as opposed to the
TCP case where CPU time increases proportionally to the message size. When
the 64K boundary is crossed, TCP CPU time increases by an exponential factor
due to intermediate switches and copies both on the VM and the driver domain.
Our framework is able to sustain low CPU time on the Privileged Guest and
almost negligible CPU time on the VM. To further investigate the sources of
CPU time consumption, we plot the CPU time breakdown for the Privileged
Guest and the VM in Figures 3(a) and 3(b), respectively.
In the driver domain (Figure 3(a)): (a) Our framework consumes more CPU
time than the TCP case for 4KB and 8KB messages. This is due to the fact
that we use zero-copy only on the send side; on the receive side, we have to copy
data from the socket buffer provided by the NIC to pages originating from the
VM. (b) For messages larger than 32KB, our approach consumes at most 30%
CPU time of the TCP case, reaching 15% (56 vs. 386) for 32K messages. (c)
In our approach, system time is non-negligible and varying from 20% to 50%
of the total CPU time spent in the Privileged Guest. This is due to the fact
that we haven’t yet implemented page swapping on the receive path. In the VM
(Figure 3(b)): (d) Our approach consumes constant CPU time for almost all
message sizes (varying from 30μsecs to 60μsecs). This constant time is due to
the way the application communicates with the frontend (IOCTLs). However, in
the TCP case, for messages larger than 64K, CPU time increases significantly.
This is expected, as all the protocol processing (TCP/IP) is done inside the
VM. Clearly, system time is almost 60% of the total VM CPU time for 256K
messages reaching 75% for 128K. (e) Our approach exhibits negligible softirq
A Smart HPC Interconnect for Clusters of Virtual Machines 405

time (apparent mostly in the receive path). This is due to the fact that the
privileged guest is responsible for placing data originating from the network to
pages we have already pinned and granted. On the other hand, the TCP case
consumes softirq time as data elevate on the TCP/IP network stack to reach the
application’s socket.

5 Conclusions

We have described the design and implementation of a VM-aware high-


performance cluster interconnect architecture. Our approach integrates HPC
interconnect semantics in PV VMs using the split driver model. Specifically, we
build a framework that consists of a low-level backend driver running in the
driver domain, a frontend running in the VMs, and a user-space library that
provides applications with our protocol semantics. We implement these RDMA
semantics using a lightweight protocol and deploy network microbenchmarks to
evaluate its performance.
Our work extends the concept of user-level networking to VM-level network-
ing. Allowing VMs to interact with the network without the intervention of
unoptimized virtual Ethernet interfaces or the TCP/IP stack, yields significant
performance improvements in terms of CPU utilization and throughput.
Our prototype implementation supports generic 10GbE adapters in the Xen
virtualization platform. Experimental evaluation leads to the following two re-
markable results: our framework sustains 92% (681MiB/sec over 737MiB/sec) of
the maximum Ethernet rate achieved in our system; at this maximum attainable
performance, the driver domain’s CPU utilization is limited to 34%, while the
guest’s CPU is idle.
We are confident that our approach is generic enough to be applicable to
various virtualization platforms. Although our work is focused on PV systems,
it can be easily extended by decoupling the proposed lightweight network stack
from the driver domain to dedicated guests or hardware. This way, virtualization
can gain considerable leverage in HPC application deployment from a networking
perspective. We plan to enrich our protocol semantics in order to implement low-
level backends for higher-level parallel frameworks such as MPI or MapReduce.

References
1. Whitaker, A., Shaw, M., Gribble, S.D.: Denali: Lightweight virtual machines for
distributed and networked applications. In: Proc. of the USENIX Annual Technical
Conference (2002)
2. PCI SIG: SR-IOV (2007),
http://www.pcisig.com/specifications/iov/single_root/
3. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I.A., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003:
Proc. of the 19th ACM Symposium on Operating Systems Principles, pp. 164–177.
ACM, NY (2003)
406 A. Nanos et al.

4. Recio, R., Culley, P., Garcia, D., Hilland, J.: An RDMA Protocol Specification
(Version 1.0) This document is a Release Specification of the RDMA Consortium
5. Goglin, B.: Design and Implementation of Open-MX: High-Performance Message
Passing over generic Ethernet hardware. In: CAC 2008: Workshop on Commu-
nication Architecture for Clusters, held in conjunction with IPDPS 2008. IEEE
Computer Society Press, Miami (2008)
6. Shalev, L., Satran, J., Borovik, E., Ben-Yehuda, M.: IsoStack—Highly Efficient
Network Processing on Dedicated Cores. In: USENIX ATC 2010: USENIX Annual
Technical Conference (2010)
7. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Evaluating the Performance Impact
of Xen on MPI and Process Execution For HPC Systems. In: 1st Intern. Workshop
on Virtualization Techn. in Dstrb. Computing. VTDC 2006 (2006)
8. Nanos, A., Goumas, G., Koziris, N.: Exploring I/O Virtualization Data Paths for
MPI Applications in a Cluster of VMs: A Networking Perspective. In: Guarra-
cino, M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla,
F., Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010.
LNCS, vol. 6586, pp. 665–671. Springer, Heidelberg (2011)
9. Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen.
In: ATEC 2006: Proceedings of the Annual Conference on USENIX 2006 Annual
Technical Conference, p. 2. USENIX, Berkeley (2006)
10. Ram, K.K., Santos, J.R., Turner, Y.: Redesigning xen’s memory sharing mechanism
for safe and efficient I/O virtualization. In: WIOV 2010: Proceedings of the 2nd
Conference on I/O Virtualization, p. 1. USENIX, Berkeley (2010)
11. Dong, Y., Dai, J., Huang, Z., Guan, H., Tian, K., Jiang, Y.: Towards high-quality
I/O virtualization. In: SYSTOR 2009: Proceedings of SYSTOR 2009: The Israeli
Experimental Systems Conference, pp. 1–8. ACM, NY (2009)
12. Santos, J.R., Turner, Y., Janakiraman, G., Pratt, I.: Bridging the gap between
software and hardware techniques for I/O virtualization. In: ATC 2008: USENIX
2008 Annual Technical Conference on Annual Technical Conference, pp. 29–42.
USENIX, Berkeley (2008)
13. Ram, K.K., Santos, J.R., Turner, Y., Cox, A.L., Rixner, S.: Achieving 10 Gb/s us-
ing safe and transparent network interface virtualization. In: VEE 2009: Proceed-
ings of the 2009 ACM SIGPLAN/SIGOPS International Conference on Virtual
Execution Environments, pp. 61–70. ACM, NY (2009)
14. Liu, J., Huang, W., Abali, B., Panda, D.K.: High performance VMM-bypass I/O
in virtual machines. In: ATEC 2006: Proceedings of the Annual Conference on
USENIX 2006 Annual Technical Conference, p. 3. USENIX, Berkeley (2006)
15. Nanos, A., Koziris, N.: MyriXen: Message Passing in Xen Virtual Machines over
Myrinet and Ethernet. In: 4th Workshop on Virtualization in High-Performance
Cloud Computing, The Netherlands (2009)
16. Jones, R.: Netperf, http://www.netperf.org
Coexisting Scheduling Policies Boosting I/O
Virtual Machines

Dimitris Aragiorgis, Anastassios Nanos, and Nectarios Koziris

Computing Systems Laboratory,


National Technical University of Athens
{dimara,ananos,nkoziris}@cslab.ece.ntua.gr

Abstract. Deploying multiple Virtual Machines (VMs) running various


types of workloads on current many-core cloud computing infrastructures
raises an important issue: The Virtual Machine Monitor (VMM) has to
efficiently multiplex VM accesses to the hardware. We argue that altering
the scheduling concept can optimize the system’s overall performance.
Currently, the Xen VMM achieves near native performance multiplex-
ing VMs with homogeneous workloads. Yet having a mixture of VMs with
different types of workloads running concurrently, it leads to poor I/O
performance. Taking into account the complexity of the design and im-
plementation of a universal scheduler, let alone the probability of being
fruitless, we focus on a system with multiple scheduling policies that co-
exist and service VMs according to their workload characteristics. Thus,
VMs can benefit from various schedulers, either existing or new, that are
optimal for each specific case.
In this paper, we design a framework that provides three basic coex-
isting scheduling policies and implement it in the Xen paravirtualized
environment. Evaluating our prototype we experience 2.3 times faster
I/O service and link saturation, while the CPU-intensive VMs achieve
more than 80% of current performance.

1 Introduction
Currently, cloud computing infrastructures feature powerful VM containers, that
host numerous VMs running applications that range from CPU– / memory–
intensive to streaming I/O, random I/O, real-time, low-latency and so on. VM
containers are obliged to multiplex these workloads and maintain the desirable
Quality of Service (QoS), while VMs compete for a time-slice. However, running
VMs with contradicting workloads within the same VM container leads to sub-
optimal resource utilization and, as a result, to degraded system performance.
For instance, the Xen VMM [1], under a moderate degree of overcommitment (4
vCPUs per core), favors CPU–intensive VMs, while network I/O throughput is
capped to 40%.
In this work, we argue that by altering the scheduling concept on a busy VM
container, we optimize the system’s overall performance. We propose a frame-
work that provides multiple coexisting scheduling policies tailored to the work-
loads’ needs. Specifically, we realize the following scenario: the driver domain

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 407–415, 2012.

c Springer-Verlag Berlin Heidelberg 2012
408 D. Aragiorgis, A. Nanos, and N. Koziris

is decoupled from the physical CPU sets that the VMs are executed and does
not get preempted. Additionally, VMs are deployed on CPU groups according
to their workloads, providing isolation and effective resource utilization despite
their competing demands.
We implement this framework in the Xen paravirtualized environment. Based
on an 8-core platform, our approach achieves 2.3 times faster I/O service, while
sustaining no less than 80% of the default overall CPU-performance.

2 Background
To comprehend how scheduling is related to I/O performance, in this section we
refer shortly to the system components that participate in an I/O operation.
Hypervisor. The Xen VMM is a lightweight hypervisor that allows multiple
VM instances to co-exist in a single platform using ParaVirtualization (PV). In
the PV concept, OS kernels are aware of the underlying virtualization platform.
Additionally, I/O is handled by the driver domain, a privileged domain having
direct access to the hardware.
Breaking down the I/O path. Assuming for instance that a VM application
transmits data to the network, the following actions will occur: i) Descending
the whole network stack (TCP/IP, Ethernet) the netfront driver (residing in the
VM) acquires a socket buffer with the appropriate headers containing the data.
ii) The netfront pushes a request on the ring (preallocated shared memory) and
notifies the netback driver (residing in driver domain) with an event (a virtual
IRQ) that there is a pending send request that it must service. iii) The netback
pushes a response to the ring and en-queues the request to the actual driver. iv)
The native device driver, who is authorized to access the hardware, eventually
transmits the packet to the network.
In PV, multiple components, residing in different domains, take part in an
I/O operation (frontend: VM, backend–native driver: driver domain). The whole
transaction stalls until pending tasks (events) are serviced; therefore the targeted
vCPU has to be running. This is where the scheduler interferes.
The Credit Scheduler. Currently, Xen’s default scheduler is the Credit sched-
uler and is based on the following algorithm: (a) Every physical core has a local
run-queue of vCPUs eligible to run. (b) The scheduler picks the head of the
run-queue to execute for a time-slice of 30ms at maximum. (c) The vCPU is
able to block and yield the processor before its time-slice expires. (d) Every
10ms accounting occurs which debits credits to the running domain. (e) New
allocation of credits occurs when all domains have their own consumed. (f ) A
vCPU is inserted to the run-queue after all vCPUs with greater or equal priori-
ty. (g) vCPUs can be in one of 4 different priorities (ascending): IDLE, OVER,
UNDER, BOOST. A vCPU is in the OVER state when it has all its credits
consumed. BOOST is the state when one vCPU gets woken up. (h) When a
run-queue is empty or full with OVER / IDLE vCPUs, Credit migrates neigh-
boring UNDER / BOOST vCPUs to the specific physical core (load-balancing).
Coexisting Scheduling Policies Boosting I/O Virtual Machines 409

Credit’s Shortcomings: As a general purpose scheduler, Credit as expected falls


shorts in some cases. If a VM yields the processor before accounting occurs, no
credits are debited [7]. This gives the running VM an advantage over others
that run for a bit longer. BOOST vCPUs are favored unless they have their
credits consumed. As a result, in the case of fast I/O, CPU-bound domains get
neglected. Finally CPU-bound domains exhaust their time-slice and I/O-bound
domains get stalled even if data is available to transmit or receive.

3 Motivation
3.1 Related Work
Recent advances in virtualization technology have minimized overheads associ-
ated with CPU sharing when every vCPU is assigned to a physical core. As a
result, CPU–bound applications achieve near-native performance when deployed
in VM environments. However, I/O is a completely different story: intermediate
virtualization layers impose significant overheads when multiple VMs share net-
work or storage devices [6]. Numerous studies present significant optimizations
on the network I/O stack using software [5,8] or hardware approaches [3].
These studies attack the HPC case, where no CPU over-commitment occurs.
However, in service-oriented setups, vCPUs that belong to a vast number of
VMs and run different types of workloads, need to be multiplexed. In such a
case, scheduling plays an important role.
Ongaro et al. [7] examine the Xen’s Credit Scheduler and expose its vulner-
abilities from an I/O performance perspective. The authors evaluate two basic
existing features of Credit and propose run-queue sorting according to the credits
each VM has consumed. Contrary to our approach, based on multiple, co-existing
scheduling policies, the authors in [7] optimize an existing, unified scheduler to
favor I/O VMs.
Cucinotta [2] in the IRMOS1 project proposes an real-time scheduler to fa-
vor interactive services. Such a scheduler could be one of which coexist in our
concept.
Finally, Hu et al. [4] propose a dynamic partitioning scheme using VM mon-
itoring. Based on run–time I/O analysis, a VM is temporarily migrated to an
isolated core set, optimized for I/O. The authors evaluate their framework using
one I/O–intensive VM running concurrently with several CPU–intensive ones.
Their findings suggest that more insight should be obtained on the implications
of co-existing CPU– and I/O– intensive workloads. Based on this approach, we
build an SMP-aware, static CPU partitioning framework taking advantage of
contemporary hardware. As opposed to [4], we choose to bypass the run-time
profiling mechanism, which introduces overhead and its accuracy cannot be guar-
anteed.
Specifically, we use a monitoring tool to examine the bottlenecks that arise
when multiple I/O–intensive VMs co-exist with multiple CPU–intensive ones.
1
More information is available at: http://www.irmosproject.eu
410 D. Aragiorgis, A. Nanos, and N. Koziris

We then deploy VMs to CPU-sets (pools) with their own scheduler algorithm,
based on their workload characteristics. In order to put pressure on the I/O
infrastructure, we perform our experiments in a modern multi-core platform,
using multi-GigaBit network adapters. Additionally, we increase the degree of
overcommitment to apply for a real-world scenario. Overall, we evaluate the
benefits of coexisting scheduling policies in a busy VM container with VMs run-
ning various types of workloads. Our goal is to fully saturate existing hardware
resources and get the most out of the system’s performance.

3.2 Default Setup


In this section we show that, in a busy VM container, running mixed types of
workloads leads to poor I/O performance and under-utilization of resources.
We measure the network I/O and CPU throughput, as a function of the
number of VMs. In the default setup, we run the vanilla Xen VMM, using its
default scheduler (Credit) and assign one vCPU to the driver domain and to
each of the VMs. We choose to keep the default CPU affinity (any). All VMs
share a single GigaBit NIC (bridged setup).

To this end, we examine two separated cases:


% Overall Performance

% Overall Performance

100 100
80 80
60 CPU 60 CPU
40 I/O 40 I/O
20 20
0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40
Number of VMs Number of VMs

(a) CPU or I/O VMs (exclusive) (b) CPU and I/O VMs (concurrently)

Fig. 1. Overall Performance of the Xen Default Case

Exclusive CPU– or I/O–intensive VMs. Figure 1(a) shows that the overall
CPU operations per second are increasing until the number of vCPUs becomes
equal to the number of physical CPUs. This is expected as the Credit scheduler
provides fair time-sharing for CPU intensive VMs. Additionally, we observe that
the link gets saturated but presents minor performance degradation in the max-
imum degree of overcommitment as a result of bridging all network interfaces
together while the driver domain is being scheduled in and out repeatedly.
Concurrent CPU– and I/O–intensive VMs. Figure 1(b) points out that when
CPU and I/O VMs run concurrently we experience a significant negative effect
on the link utilization (less than 40%).

4 Co-existing Scheduling Polices


In this section we describe the implementation of our framework. We take the
first step towards distinctive pools, running multiple schedulers, tailored to the
Coexisting Scheduling Policies Boosting I/O Virtual Machines 411

needs of VMs’ workloads and evaluate our approach of coexisting scheduling


policies in the Xen virtualization platform.
In the following experiments we emulate streaming network traffic (e.g.
stream/ftp server) and CPU/Memory-bound applications for I/O– and CPU–
intensive VMs respectively using generic tools (dd, netcat and bzip2). We mea-
sure the execution time of every action and calculate the aggregate I/O and
CPU throughput. To explore the platform’s capabilities we run the same exper-
iments on native Linux and evaluate the utilization of resources. Our results are
normalized to the maximum throughput achieved in the native case.
Testbed. Our testbed consists of an 8-core Intel Xeon X5365 @ 3.00 GHz
platform as the VM container, running Xen 4.1-unstable with linux-2.6.32.24
pvops kernel , connected back–to–back with a 4-core AMD Phenom @ 2.3 GHz
via 4 Intel 82571EB GigaBit Ethernet controllers.

4.1 Monitoring Tool


9 To investigate the apparent sub-
msec lost per MB transmitted

8 dom0−>domUoptimal performance discussed in


domU−>dom0
7
Section 3.2, we build a monitor-
6
5 ing tool on top of Xen’s event
4 channel mechanism that measures
3 the time lost between event han-
2
1
dling (Section 2). Figure 2 plots
a b a b a b
0
a b
the delay between domU event
2 VMs 6 VMs 16 VMs 30 VMs notification and dom0 event han-
Fig. 2. Monitoring tool: msecs lost per MB dling (dark area) and vice-versa
transmitted: (a) default setup; (b) 2 pools setup (light area). The former includes
the outgoing traffic, and the lat-
ter the acknowledges of driver domain and the incoming traffic (e.g. TCP ACK
packets). We observe a big difference between both directions; this is debited to
the fact that the driver domain gets more often awaken due to I/O operations of
other domains, so it is able to batch work. Most important the overall time spent
is increasing proportionally to the degree of over-commitment. This is an artifact
of vCPU scheduling: the CPU-bound vCPUs exhaust their time-slice and I/O
VMs get stalled even if data is available to receive or transmit. Moreover I/O
VMs, including driver domain who is responsible for the I/O multiplexing get
scheduled in and out, eventually leading to poor I/O performance.

4.2 The Driver Domain Pool


To eliminate the effect discussed in Section 4.1, we decouple the driver domain from
all VMs. We build a primitive scheduler that bounds every newly created vCPU to
an available physical core; this vCPU does not sleep and as a result does not suffer
from unwanted context switch. Taking advantage of the pool concept of Xen, we
launch this no-op scheduler on a separate pool running the driver domain. VMs
are deployed on different pool and suffer the Credit scheduler policy.
412 D. Aragiorgis, A. Nanos, and N. Koziris

default default
2 pool 2 pool
3 pool 3 pool
100% 100%
% ( of maximum performance )

% ( of maximum performace )
80% 80%

60% 60%

40% 40%

20% 20%

0% 0%
3+3 9+9 15+15 3+3 9+9 15+15
VMs (I/O+CPU) VMs (I/O+CPU)
(a) CPU Overall Performance (b) I/O Overall Performance

Fig. 3. Overall Performance using Pools: default; 2 pools; 3 pools

Taking a look back at Figure 2, we observe that the latency between domU and
dom0 (dark area) is eliminated. That is because dom0 never gets preempted and
achieves maximum responsiveness. Moreover the time lost in the other direction
(light area) is apparently reduced; more data rate is available and I/O domains
can batch more work.
Figure 3 plots the overall performance (normalized to the maximum observed),
as a function of concurrent CPU and I/O VMs. The first bar (dark area) plots
the default setup (Section 3.2), whereas the second one (light area) plots the ap-
proach discussed in this Section. Figure 3(b) shows that even though the degree
of over-commitment is maximum (4 vCPUs per physical core) our framework
achieves link saturation. On the other hand, CPU performance drops propor-
tionally to the degree of over-commitment (Figure 3(a)).
The effect on CPU VMs is attributed to the driver domain’s ability to process
I/O transactions in a more a effective way; more data rate is available and I/O
VMs get notified more frequently; according to Credit’s algorithm I/O VMs get
boosted and eventually steal time-slices from the CPU VMs.
Trying to eliminate the negative ef-
fect to the CPU–intensive VMs, we 100
% of native maximum

experiment with physical resources 80

distribution. Specifically we evalu- 60 CPU


40 I/O
ate the system’s overall performance
20
when allocating a different number 0
1 2 3 4 5 6 7
of physical CPUs to the aforemen- Number of CPU
tioned second pool (Fig. 4). We ob- Fig. 4. Overall Performance vs. Physical
serve that with one CPU, the GigaBit Resources Distribution to VM pool
link is under-utilized, whereas with
two CPUs link saturation is achieved.
On the other hand, cutting down resources to the CPU-intensive VMs does not
have a negligible effect; in fact it can shrink up to 20%.
Coexisting Scheduling Policies Boosting I/O Virtual Machines 413

4.3 Decoupling vCPUs Based on Workload Characteristics


Taking all this into consideration we obtain a platform with 3 pools: pool0 with
only one CPU dedicated to the driver domain with the no-op scheduler; pool1
with 2 CPUs servicing I/O intensive VMs (running potentially an I/O–optimized
scheduler); and pool2 for the CPU-intensive VMs that suffer the existing Credit
scheduling policy. Running concurrently a large number of VMs with two types
of workloads we experience GigaBit saturation and 62% CPU utilization, as
opposed to 38% and 78% respectively in the default case (Fig. 3, third bar).
In addition to that, we point out that
there is no overall benefit if a VM finds it- Table 1. VM Misplacement effect to
self in the ”wrong” pool, albeit a slight im- individual Performance
provement of this VM’s I/O performance
is experienced (Table 1). This is an arti- Misplaced VM All other
fact of Credit’s fairness discussed in pre- CPU -17% -1.3%
vious sections (Section 4.2 and 3.2). I/O +4% -0.4%

5 Discussion
5.1 Credit Vulnerabilities to I/O Service
The design so far has decoupled I/O– and CPU–intensive VMs achieving iso-
lation and independence, yet a near optimal utilization of resources. But is the
Credit scheduler ideal for multiplexing only I/O VMs? We argue that slight
changes can benefit I/O service.
Time-slice allocation: Having 100
achieved isolation between different 90
80
Link Utilization %

workloads we now focus on I/O pool 70


30ms
60 3ms
(pool1 ). We deploy this pool on the 50
second CPU-package and reduce the 40
30
time-slice from 30ms to 3ms (account- 20
10
ing occurs every 1ms). We observe 0
4,000 800 400 200 40
that I/O throughput outperforms the Packet Size in Bytes
previous case, despite the decreasing Fig. 5. Time-slice: 30ms vs 3ms
packet-size (Fig. 5). Such a case, dif-
fers from the streaming I/O workload scenario (e.g. stream/ftp server) discussed
so far (Section 4), and can apply to a random I/O workload (such as busy web
server).
Anticipatory Concept: Moreover we propose the introduction of an anticipato-
ry concept to the existing scheduling algorithm; for the implementation multi-
hierarchical priority sets are to be used, while the scheduler, depending the
previous priority of the vCPU, adjust it when gets woken up, sleeps, or gets
credits debited. Thus, the vCPU will sustain the boost state a bit longer and
take advantage the probability of transmitting or receiving data in the near
future.
414 D. Aragiorgis, A. Nanos, and N. Koziris

5.2 Exotic Scenarios


In this section we argue that in the case of multiple GigaBit NICs, a uni–core
driver domain is insufficient. As in Section 5.1, we focus on pool1 (I/O). This
time we compare the link utilization of 1-4 x Gbps, when the driver domain is
deployed on 1,2,3 or 4 physical cores (Fig. 6).
To exploit the SMP characteris-
100 tics of our multi-core platform, we
90
80 assign each NIC’s interrupt handler
Link Utilization %

70
60
1 VCPU
#VCPU=#NICs to a physical core, by setting the
50
40
smp affinity of the corresponding
30 irq. Thus the NIC’s driver does not
20
10 suffer from interrupt processing con-
0
1Gbps 2Gbps 3Gbps 4Gbps tention. However, we observe that af-
ter 2Gbps the links do not get sat-
Fig. 6. Multiple GigaBit NICs urated. Preliminary findings suggest
that this unexpected behavior is due
to Xen’s network path. Nevertheless, this approach is applicable to cases where
the driver domain or other stub-domains have demanding responsibilities such
as multiplexing accesses to shared devices.

5.3 Dynamic Instead of Static


After having proved that the coexisting scheduling policies can benefit I/O per-
formance and resources utilization we have to examine how such a scenario can
be automated or adaptive. How to implement the VM classification and the re-
sources partitioning? Upon this we consider the following design dilemma; the
profiling tool should reside in the driver domain or in the Hypervisor? The former
is aware of the I/O characteristics of each VM while the latter can keep track of
their time-slice utilization. Either way such a mechanism should be lightweight
and its actions should respond to the average load of the VM and not to random
spikes.

6 Conclusions
In this paper we examine the impact of VMM scheduling in a service orient-
ed VM container and argue that co-existing scheduling policies can benefit the
overall resource utilization when numerous VMs run contradicting types of work-
loads. VMs are grouped into sets based on their workload characteristics, suf-
fering scheduling policies tailored to the need of each group. We implement our
approach in the Xen virtualization platform. In a moderate overcommitment
scenario (4 vCPUs/ physical core), our framework is able to achieve link satura-
tion compared to less than 40% link utilization, while CPU-intensive workloads
sustain 80% of the default case.
Our future agenda consists of exploring exotic scenarios using different types
of devices shared across VMs (multi-queue and VM-enabled multi-Gbps NICs,
Coexisting Scheduling Policies Boosting I/O Virtual Machines 415

hardware accelerators etc.), as well as experiment with scheduler algorithms


designed for specific cases (e.g. low latency applications, random I/O, disk I/O
etc. ). Finally our immediate plans are to implement the anticipatory concept
and the profiling mechanism discussed in the previous section.

References
1. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I.A., Warfield, A.: Xen and the Art of Virtualization. In: SOSP 2003:
Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles,
pp. 164–177. ACM, New York (2003)
2. Cucinotta, T., Giani, D., Faggioli, D., Checconi, F.: Providing Performance Guaran-
tees to Virtual Machines Using Real-Time Scheduling. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A.,
Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586,
pp. 657–664. Springer, Heidelberg (2011)
3. Dong, Y., Yu, Z., Rose, G.: SR-IOV networking in Xen: architecture, design and
implementation. In: WIOV 2008: Proceedings of the First Conference on I/O Vir-
tualization, p. 10. USENIX Association, Berkeley (2008)
4. Hu, Y., Long, X., Zhang, J., He, J., Xia, L.: I/o scheduling model of virtual machine
based on multi-core dynamic partitioning. In: IEEE International Symposium on
High Performance Distributed Computing, pp. 142–154 (2010)
5. Menon, A., Cox, A.L., Zwaenepoel, W.: Optimizing network virtualization in Xen.
In: ATEC 2006: Proceedings of the Annual Conference on USENIX 2006 Annual
Technical Conference, p. 2. USENIX Association, Berkeley (2006)
6. Nanos, A., Goumas, G., Koziris, N.: Exploring I/O Virtualization Data Paths for
MPI Applications in a Cluster of VMs: A Networking Perspective. In: Guarracino,
M.R., Vivien, F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F.,
Knüpfer, A., Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS,
vol. 6586, pp. 665–671. Springer, Heidelberg (2011)
7. Ongaro, D., Cox, A.L., Rixner, S.: Scheduling i/o in virtual machine monitors. In:
Proceedings of the Fourth ACM SIGPLAN/SIGOPS International Conference on
Virtual Execution Environments, VEE 2008, pp. 1–10. ACM, New York (2008)
8. Ram, K.K., Santos, J.R., Turner, Y.: Redesigning xen’s memory sharing mechanism
for safe and efficient I/O virtualization. In: WIOV 2010: Proceedings of the 2nd
Conference on I/O Virtualization, p. 1. USENIX Association, Berkeley (2010)
PIGA-Virt: An Advanced Distributed MAC
Protection of Virtual Systems

J. Briffaut, E. Lefebvre, J. Rouzaud-Cornabas, and C. Toinard

ENSI de Bourges – LIFO, 88 bd Lahitolle, 18020 Bourges cedex, France


{jeremy.briffaut,jonathan.rouzaud-cornabas,
christian.toinard}@ensi-bourges.fr

Abstract. Efficient Mandatory Access Control of Virtual Machines re-


mains an open problem for protecting efficiently Cloud Systems. For ex-
ample, the MAC protection must allow some information flows between
two virtual machines while preventing other information flows between
those two machines. For solving these problems, the virtual environment
must guarantee an in-depth protection in order to control the information
flows that starts in a Virtual Machine (vm) and finishes in another one.
In contrast with existing MAC approaches, PIGA-Virt is a MAC protec-
tion controlling the different levels of a virtual system. It eases the man-
agement of the required security objectives. The PIGA-Virt approach
guarantees the required security objectives while controlling efficiently
the information flows. PIGA-Virt supports a large range of predefined
protection canvas whose efficiency has been demonstrated during the
ANR Sec&Si 1 security challenge. The paper shows how the PIGA-Virt
approach guarantees advanced confidentiality and integrity properties by
controlling complex combinations of transitive information flows passing
through intermediate resources. As far as we know, PIGA-Virt is the first
operational solution providing in-depth MAC protection, addressing ad-
vanced security requirements and controlling efficiently information flows
inside and between virtual machines. Moreover, the solution is indepen-
dent of the underlying hypervisor. Performances and protection scenarios
are given for protecting KVM virtual machines.

1 Introduction
A virtualization layer, i.e. an hypervisor, brings isolation between multiple sys-
tems, i.e. Virtual Machines, hosted on the same hardware. The hypervisor re-
duces the interferences between the vms. But the virtualization is not a security
guarantee. It increases the attack surface and adds new attack vectors. As a
consequence, the virtualization must not be the sole technology for providing
isolation within a Cloud. For example, in [14], the isolation is broken through
drivers that allow the access of the underlying hardware from inside a vm. In-
deed, these drivers can access the physical memory without passing through the
1
http://www.agence-nationale-recherche.fr/magazine/actualites/detail/
resultats-du-defi-sec-si-systeme-d-exploitation
-cloisonne-securise-pour-l-internaute/

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 416–425, 2012.

c Springer-Verlag Berlin Heidelberg 2012
An Advanced Distributed MAC Protection of Virtual Systems 417

kernel or the hypervisor thus by passing the protection layers. Preventing such
attacks requires to guarantee the integrity of 1) the vm, 2) the hypervisor and
3) the underlying Operating System. Moreover, a vm can produce information
flows that come to another vm in order to break some of the requested secu-
rity objectives. Accordingly, a vm can attack the integrity, confidentiality and
availability of other vms that run on the same hardware.
With cloud paradigm, the data and the entire computing infrastructure is
outside the scope of the final users. Thus, security is one of the top concerns
in clouds [9]. Indeed, with a cloud infrastructure relying on virtualization, the
hardware is shared between multiple users (multi-tenancy) and these users can
be adversaries. Moreover, as explained in [13,4], various securities and function-
alities are needed to enforce the criticality of missions within a cloud. Thus, the
major goal of this work is to increase the security assurance by 1) hardening the
isolation of the virtualization layer and 2) providing a mission-aware security
component for the virtualization layer and 3) balancing the security with the
performance.
The first section defines the precise objectives of our solution. Second, the
paper describes the different protection modes supported by PIGA-Virt. Third,
it describes how PIGA-Virt enforces the protection. Forth, it gives the efficiency
for the different modes of the versatile PIGA-Virt solution. Fifth, it describes
the related works. Finally, the paper concludes by defining the future works.

2 Motivation
In-depth end-to-end Mandatory Protection Inside a vm
The mandatory control minimizes the privileges that a process (a subject) has
regarding the various objects. But existing mac approaches mainly deal with
direct flows. A first set of objectives consider the control of the flows inside a
given virtual machine.
The first purpose is thus to control indirect information flows transiting through
intermediate resources or processes (covert channels). The second purpose is to
ease the definition of a large set of security objectives, such as separation of priv-
ileges or indirect accesses to the information through covert channels. The third
purpose is to provide a mandatory protection controlling all the levels (in-depth
protection) of a virtual machine, such as processes, graphic interface, network,
etc. Our fourth purpose is to provide an efficient mandatory protection that
guarantees all the supported security properties with satisfying performance in
the context of the virtual machines sharing the same host. In contrast with our
previous works [5], the fourfth objective is addressed in that paper since perfor-
mances need to be improved for hosting multiple vms on the same machine.
In-depth end-to-end Protection between vms
With multiple multiple vms sharing the same host, the flows between the vms
must also be efficiently controlled. For example, the protection must prevent a
malicious information flow, coming from (vm1), from going to (vm2) through
a nfs share. But, some NFS flows between vm1 and vm2 must be allowed
418 J. Briffaut et al.

while others must be denied. A second set of objectives address specifically that
control of the flows between the vms. The corresponding objectives have not
been addressed during our previous works.
A fifth purpose is that the in-depth end-to-end protections must control the
flows between the different vms. Such a protection consists in controlling 1) all
the indirect flows that are visible to the vms and 2) all the indirect flows using,
intermediate entities of the target host, that are invisible to the vms. A sixth
purpose is to have a protection independent from the target hypervisor. It is
an important issue since several kinds of hypervisor technologies can cohabit
in the context of a cloud. A seventh purpose is that the proposed protection
must be easy to configure. A eigth purpose is to ease the tuning of the protec-
tion efficiency. Thus, the administrator can tune the protection to balance the
performance with the security.

3 Architecture of PIGA-Virt

PIGA-Virt provides mechanisms to reach our 8 objectives.


piga-virt is a protection system that controls interactions inside and between
vms. It has two layers:

– Local layer : each vm runs a local PIGA-Decision [5] engine associated


with a SELinux/PIGA-Kernel. The combination of PIGA-Kernel and PIGA-
Decision provides an efficient reference monitor guaranteeing a large set of
security objectives. Thanks to our Security Protection Language (spl), it
eases the definition of security objectives and controls the information flows
inside each vm.

Fig. 1. Architecture of PIGA-Virt


An Advanced Distributed MAC Protection of Virtual Systems 419

– Shared layer : a dedicated secure vm runs a shared PIGA Decision engine


that 1) improves the performance for the intra flows and 2) controls the
information flows between the vms. Thanks to the proposed extension of
spl, the shared layer controls the information flows starting in a vm and
finishing in another vm. The extension consists 1) in adding a virtual machine
identifier to the security contexts and 2) in processing a flow graph for each
virtual machine.
The administrator can combine the local and shared layers according to the
required security objectives. So, the administrator can choose between three
modes: local, shared/local and shared modes. The shared mode simplifies the
task of the security administrator since it controls both the inter and the in-
tra flows through a central management. The sequel describes the unified mac
approach provided by the local and the shared mode and compares the perfor-
mances between the local and the shared modes.

4 In-Depth End-to-End Mandatory Protection


PIGA-Virt provides a unified mac approach for the local and the shared mode:
A] selinux and xselinux control the direct information flows of the appli-
cations and XWindow; Those controls are always performed in a local manner
since it consists in reusing the SELinux approach.
B] piga includes two major components:
- piga-protect (piga-kernel and piga-decision) that controls the tran-
sitive information flows. PIGA-Protect prevents against millions of vulnera-
bilities inside the SELinux policies. In practice, illegal activities allowed by the
SELinux policies are precomputed and piga-decision compares in real-time
the precomputed set of illegal activities with the real activities occuring on the
system. When a real activity matches with a precomputed one, piga-decision
denies the corresponding system call.
- piga-firewall targets mandatory network access hardening. It guarantees
an end-to-end control of network flows. For example, a firefox process associated
with the security context f iref ox taxes t has the guaranty to transmit taxes’
data only to the network site providing the e-taxes.
The piga policies are the same in the local and shared mode. In contrast with
the local mode, the shared mode processes the controls within the dedicated
virtual machine and is able to control the inter flows. Since piga-protect is
the heart of our advanced MAC protection, let us give few examples of canvas
written using our SPL.
Canvas of Mandatory Protection Supporting Security Objectives
The following property aims at guaranteeing the confidentiality of a set sc2 of
objects regarding a set sc1 of subjects. That property prevents against reading
flows that can be direct > or indirect >>.
420 J. Briffaut et al.

1 define confidentiality( sc1 in SCS, sc2 in SCO ) [ ¬(sc2 > sc1) AND ¬(sc2 >> sc1) ];

The following property prevents a process from creating a file, executing an


interpreter (e.g. bash) that then attempts to execute the created file.

1 define dutiesseparationbash( sc1 IN SC ) [ Foreach sc2 IN SCO, Foreach sc3 IN SC,


2 ¬ ( ( sc1 >write sc2) −−then→ (sc1 >execute sc3 ) −−then→ (sc3 >read sc2)) ];

Usage of the Proposed Canvas for Intra and Inter vm Protections


The administrator defines a small number of protection rules with the relevant
parameters (SELinux contexts) for the canvas. The following rule protects a vm
against the attacks relying on a shell interpretation of downloaded scripts. It
prevents all the user processes, associated with the regular expression user u :
user r : user. ∗ t, from downloading a script in order to read and execute that
script. That rule can be set-up within each virtual machine (local mode) or
within the dedicated virtual machine (shared mode).

1 dutiesseparationbash( ”user u:user r:user.∗ t” );

In contrast with our previous works [5], the major improvement is the way the
dedicated virtual machine computes the controls in the shared mode. The PIGA
shared decision engine computes independant data for each virtual machine. The
PIGA shared decision engine communicates with the different piga-kernels
available into the different virtual machines. When PIGA shared decision finds
a real activity of a virtual machine matching with the precomputed set of illegal
activities associated with that virtual machine, it sends a deny to the corre-
sponding piga-kernel in order to cancel the corresponding system call.
The following rule guarantees the confidentiality of the /etc files in vm1 re-
garding the users of vm2. In contrast with previous works, a virtual machine
identifier is added to the SELinux context. Thus, the administrator easily ex-
press the control of the flows between two different virtual machines. The corre-
sponding control is only available in the shared mode. The PIGA shared decision
engine breaks an illegal activity into different subactivities for the correspond-
ing virtual machines. When all the subactivities are detected, the PIGA shared
decision engines sends a deny for the lattest system call.
1 confidentiality(user u:user r:user.∗ t:vm2, system u:object r:etc t:vm1);

5 Experimentations
PIGA-Virt is integrated into a Scientific Linux 6 host with kvm as hypervisor.
The PIGA-Kernel consists in a small patch that captures the SELinux hooks.
PIGA-Decision is available as a Java process. Experimentations run on an AMD
Phenom(tm) II X4 965 Processor with 8 Giga bytes of RAM.
An Advanced Distributed MAC Protection of Virtual Systems 421

Performances
Figure 2 presents several types of PIGA-Virt instances for protection two Linux
virtual machines V M 1 and V M 2. PIGA-Virt runs a) without SELinux, b) with
the targeted SELinux policy, c) with the strict SELinux policy, d) in local mode
to detect the violations of the requested properties and e) in shared mode to
detect the violations, f) in local mode to prevent the intra violations and g) in
shared mode to prevent the inter violations. The local mode controls the flows
inside a vm whereas the shared mode controls the flows between vms. In contrast
with the PIGA-Virt detection, the PIGA-Virt protection enables to evaluate the
overhead introduced to prevent the violations of the required security properties.
Several benchmarks (open/close of files, executing the ls -lR command to
parse the whole file system, fork and file access latency) show the performances
of PIGA-Virt. As shown in the figure 2, the overhead due to the in-depth end-to-
end protection of PIGA-Virt is a very low. The performance of the environment
without any mac protection corresponds to the a) column i.e. SELinux OFF.
The performance of the controls inside the vm is given by the f) column i.e. local
protection. The performance of the controls between the vms, is given by the
g) column i.e. shared protection. Sometimes the mac protections improve the
performances e.g. the ls command takes more time without any mac protection
since the mac protections minimize the accesses to the file system. In contrast
with SELinux, the PIGA-Virt protection either reduces or equals the overhead.
The only exception is the fork result but this benchmark is very stressful since
it corresponds to unusual millions of simultaneous fork operations.
Globally, the PIGA-Virt protection brings a very low overhead. In contrast
with no mac protection, PIGA-Virt improves the performances. In contrast with
the local mode associated with our previous works, the shared mode factories the
PIGA-Decision within a single instance. So, our new shared approach minimizes
the overhead due to the security mechanisms. Moreover, the shared mode uses
TCP connections between the vms and PIGA-Decision. So, PIGA-Decision can
be run on a dedicated machine with high performance capabilities, improving
thus more the CPU consumption.
Protection Efficiency
In contrast with the local mode i.e. our previous work, the shared mode is of
major importance in term of security assurance. Indeed, it is the only way to
control the flows between the vms sharing the same host.
Let us give a small example of the protection carried out by the conf identiality
property. For example, the following global illegal activity, with a subactivity
on vm1 (user t reading /etc before writing into nf s t) and a subactivity on
vm2 (user t reading nf s t), is a violation of conf identiality($sc1 := ”user u :
user r : user. ∗ t : vm2”, $sc2 := system u : object r : etc t : vm1). In such a
case, the shared decision engine cancels the reading of nf s t on vm2 since it is
the lattest system call of the global activity. Such an activity corresponds, for
example, to a malware, executed from the user environment of vm1, and trans-
mitting the /etc/shadow password to a distant virtual machine vm2. Thus, the
shared mode eases the protection against generic malicious activities such as
422 J. Briffaut et al.

Fig. 2. Performances of PIGA-Virt

NFS threats. It enables to prevent illegal flows while authorizing safe flows since
it allows, for example, sysadm t to transmit data through NFS to a distant user.
PIGA-Virt is very efficient since defining a safe SELinux policy is a tricky
task. So, a couple of SPL rules is simpler than writing a SELinux policy 1)
including millions of rules and 2) that does not control the transitive flows.
Mission Efficiency
PIGA-Virt is a mission-aware environment. First, it takes into account the se-
curity objectives i.e. the requested security properties. Second, it provides the
efficiency of each security objective.
Table 1 provides the efficiency of the different properties used during our
experimentation. For example, the efficiency of the conf identiality property
is 108.045. That value shows 108.045 illegal activities enabling to violate the
confidentiality property within the considered SELinux policy. Such a value has
two meanings: 1) it gives the security enforcement of a property i.e. the higher
the value is, the stronger the property enforces the security and 2) it evaluates
the cost of the property i.e. the higher the value is, the higher the processing
time is needed by PIGA-Virt.
Mission Tuning
As demonstrated, more a security property is strong the higher the overhead
is. It is a well known relationship between security and performance. How-
ever, the important point here is that the administrator has a precise evaluation
An Advanced Distributed MAC Protection of Virtual Systems 423

of each security property. Thus, he can tune the security objectives to fit the
performance needings. For example, the dutiesseparationbash property is a
large overestimation of the separation of duties, that protects against malicious
scripts, since it prevents millions of potential vulnerabilities. In contrast with the
dutieseparationbash, the dutiesseparation property is less large since it protects
only against binary executions preventing thus only 208.240 illegal activities.
However, dutiesseparationbash and dutiesseparation do not tackle the same
security objective. In order to tune a property such as dutiesseparationbash,
several facilities are available.
PIGA-Virt eases the tuning of the security missions in different ways. Thus,
the administrator can adjust a security objective by 1) providing different secu-
rity contexts for the security canvas, 2) modifying the definition of the canvas
and 3) modifying the SELinux policy. That latter solution is usually the tricki-
est. However, PIGA-Virt facilitates this task. Let us consider the conf identiality
property preventing illegal activities including:
1 user u:user r:user t−(dbus{send msg})−>user u:user r:user dbusd t; user u:user r:user dbusd t−(
file{write})−>user u:object r:user home t; user u:user r:gpg agent t−(file{read})−>user u:
object r:user home t; user u:user r:gpg agent t−(file{write} )−>system u:object r:nfs t

Thus, the administrator sees that the dbus and gpg are involved into that threat.
PIGA-Virt shows that the problem can be corrected with a separation of duties
for dbus or gpg. Thus, the tuning consists in a new SELinux policy including, for
example, separation of duties for dbus (e.g. removing the permission of writing
into user home t for dbus t).

Table 1. Efficiency of the requested security mission

Property Efficiency
Security transitionsequence 101 533
mission notrereadconfigfile 2
ourreadconfigfile 4
dutiesseparation 208 240
dutiesseparationbash 194 629 680
confidentiality 108045
integrity 30
trustedpathexecution 8 715
trustedpathexecutionuser 204
trustedpathexecutionuser 26
consistentaccess 50 470

6 Related Works
A frequent approach is to use integrity verification technologies. [1] uses a dedi-
cated hypervisor to encrypt the data and the network transmission. GuardHype [2]
424 J. Briffaut et al.

and [10] verifies the integrity of the hypervisor itself or the integrity of the kernel
and critical applications. But these approaches are limited to statically verify the
integrity of an image, a binary or a part of the memory. However, those solu-
tions do not control the access to the ressources. The followed approach is to
put Madatory Access Control outside of the vms. Thus, the multiple virtual ma-
chines can be controlled consistently and safely using a single security monitor.
mac [6] is the only way to guarantee security objectives. In [3], that approach
is limited to the control inside an untrusted virtual machine and cannot guar-
anty the isolation between the virtual machines. For example, sHype [12] brings
Type Enforcement to control the inter-vm communications. But, sHype only
controls overt channels thus missing implicit covert channels. Moreover, they do
not propose a way to express security properties. The mac enforcement of the
hypervisor can be extended to the mac enforcement inside the virtual machine.
Thus, [8] divides the overall policy into specialized policies (one per vm and one
for the interaction between vm)s. For example, Shamon [7] is a prototype based
on Xen/XSM (Inter-vm mac) and SELinux (os Level mac) to control applica-
tions running on different vms. As explained in [11], the common way to analyze
mac policies is to search for illegal information flows inside them. In order to re-
duce the complexity, [11] analyses each layer (hypervisor then os). The analysis
is too complex and the illegal flows cannot be blocked in real time. So exist-
ing solutions cannot control in real-time advanced security properties associated
with multiple information flows between the different virtual machines.

7 Conclusion
That paper presents the first mission-aware security approach for vms that sup-
ports a large range of security objectives and provides a precise evaluation of the
security efficiency. In contrast with existing approaches, it provides a real time
protection of advanced security objectives with a very low overhead. Moreover,
PIGA-Virt eases the work of the administrator since around ten security rules
are generally sufficient to control efficiently the flows between the different vms
sharing the same host. Finally, PIGA-Virt is an extensible approach. Indeed, it
requires only security contexts associated with the different system resources.
For example, a Windows 7 module is available providing consistent security la-
bels that can be processed through PIGA-Virt. It is an excellent way to improve
the security of heterogeneous vms such as required in Cloud infrastructures. Fu-
ture works deal with distributed scheduling of vms as a security mission-aware
service providing Security as a Service ([Sec]aaS) in the context of anything as
Service approaches (XaaS Clouds).

References
1. BitVisor 1.1 Reference Manual (2010), http://www.bitvisor.org/
2. Carbone, M., Zamboni, D., Lee, W.: Taming virtualization. IEEE Security and
Privacy 6(1), 65–67 (2008)
An Advanced Distributed MAC Protection of Virtual Systems 425

3. Chen, X., Garfinkel, T., Christopher Lewis, E., Subrahmanyam, P., Waldspurger,
C.A., Boneh, D., Dwoskin, J., Ports, D.R.K.: Overshadow: a virtualization-based
approach to retrofitting protection in commodity operating systems. SIGOPS
Oper. Syst. Rev. 42, 2–13 (2008)
4. Jaeger, T., Schiffman, J.: Outlook: Cloudy with a chance of security challenges and
improvements. IEEE Security and Privacy 8, 77–80 (2010)
5. Briffaut, C.T.J., Peres, M.: A dynamic end-to-end security for coordinating multi-
ple protections within a linux desktop. In: Proceedings of the 2010 IEEE Workshop
on Collaboration and Security (COLSEC 2010), pp. 509–515. IEEE Computer So-
ciety, Chicago (2010)
6. Loscocco, P.A., Smalley, S.D., Muckelbauer, P.A., Taylor, R.C., Turner, S.J., Far-
rell, J.F.: The Inevitability of Failure: The Flawed Assumption of Security in Mod-
ern Computing Environments. In: Proceedings of the 21st National Information
Systems Security Conference, Arlington, Virginia, USA, pp. 303–314 (October
1998)
7. McCune, J.M., Jaeger, T., Berger, S., Caceres, R., Sailer, R.: Shamon: A sys-
tem for distributed mandatory access control. In: Proceedings of the 22nd Annual
Computer Security Applications Conference, pp. 23–32. IEEE Computer Society,
Washington, DC (2006)
8. Payne, B.D., Sailer, R., Cáceres, R., Perez, R., Lee, W.: A layered approach to
simplified access control in virtualized systems. SIGOPS Oper. Syst. Rev. 41,
12–19 (2007)
9. Pearson, S., Benameur, A.: Privacy, security and trust issues arising from cloud
computing. In: Proceedings of the 2010 IEEE Second International Conference on
Cloud Computing Technology and Science, CLOUDCOM 2010, pp. 693–702. IEEE
Computer Society, Washington, DC (2010)
10. Quynh, N.A., Takefuji, Y.: A real-time integrity monitor for xen virtual machine.
In: ICNS 2006: Proceedings of the International Conference on Networking and
Services, p. 90. IEEE Computer Society, Washington, DC (2006)
11. Rueda, S., Vijayakumar, H., Jaeger, T.: Analysis of virtual machine system poli-
cies. In: Proceedings of the 14th ACM Symposium on Access Control Models and
Technologies, SACMAT 2009, pp. 227–236. ACM, New York (2009)
12. Sailer, R., Jaeger, T., Valdez, E., Caceres, R., Perez, R., Berger, S., Griffin, J.L.,
Van Doorn, L., Center, I.B.M.T.J.W.R., Hawthorne, N.Y.: Building a MAC-based
security architecture for the Xen open-source hypervisor. In: 21st Annual Computer
Security Applications Conference, p. 10 (2005)
13. Sandhu, R., Boppana, R., Krishnan, R., Reich, J., Wolff, T., Zachry, J.: Towards
a discipline of mission-aware cloud computing. In: Proceedings of the 2010 ACM
Workshop on Cloud Computing Security Workshop, CCSW 2010, pp. 13–18. ACM,
New York (2010)
14. Wojtczuk, R.: Subverting the Xen hypervisor. BlackHat USA (2008)
An Economic Approach for Application QoS
Management in Clouds

Stefania Costache1,2, , Nikos Parlavantzas2,3, Christine Morin2 ,


and Samuel Kortas1
1
EDF R&D, France
2
INRIA Centre Rennes - Bretagne Atlantique, France
3
INSA Rennes, France
{Stefania.Costache,Nikos.Parlavantzas,Christine.Morin}@inria.fr,
Samuel.Kortas@edf.fr

Abstract. Virtualization provides increased control and flexibility in


how resources are allocated to applications. However, common resource
provisioning mechanisms do not fully use these advantages; either they
provide limited support for applications demanding quality of service,
or the resource allocation complexity is high. To address this problem
we propose a novel resource management architecture for virtualized
infrastructures based on a virtual economy. By limiting the coupling
between the applications and the resource management, this architecture
can support diverse types of applications and performance goals while
ensuring an efficient resource usage. We validate its use through simple
policies that scale the resource allocations of the applications vertically
and horizontally to meet application performance goals.

1 Introduction

Managing resources of private clouds while providing application QoS guaran-


tees is a key challenge. A cloud computing platform needs to host on its limited
capacity a variety of applications (e.g., web applications, scientific workloads)
that possibly require different QoS guarantees (e.g., throughput, response time).
Thus, the resource management system is required to be flexible enough to meet
all user demands while ensuring an efficient resource utilization. The flexibility of
the resource management can be achieved by decoupling the application perfor-
mance management from the infrastructure resource management and passing
information about applications to the infrastructure in a generic way. An ef-
ficient resource management is possible by using virtualization technologies to
dynamically provision the resources in a fine-grained manner and to transpar-
ently balance the load between physical machines. However, common resource
management systems either fail to address these requirements or they achieve
them through algorithms that have a high computational complexity and would
not scale well with the size of the infrastructure [6].

This work is supported by ANRT through the CIFRE sponsorship No. 0332/2010.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 426–435, 2012.

c Springer-Verlag Berlin Heidelberg 2012
An Economic Approach for Application QoS Management in Clouds 427

In this paper we present a resource management architecture for cloud plat-


forms that addresses the flexibility and efficiency issues through a market-based
approach. Each application is managed by a local agent that determines the
resource demand that meets the application’s performance goal, while a global
controller performs the infrastructure resource management based on the agent’s
communicated application preferences. The agent communicates its application
preferences by submitting bids expressing their willingness to pay for resources.
The global controller uses a proportional-share rule [5] to allocate resources to
applications according to their bid. The resource price variation provides ser-
vice differentiation between applications while the proportional share ensures
a maximum utilization of infrastructure resources. While this model does not
necessarily lead to a global optimal resource allocation, it allows applications
to closely meet their performance goals while keeping a simple resource man-
agement. We illustrate how this model supports application performance goals
through agents that scale the allocation of their applications using feedback-
based control policies. We simulated our architecture and validated the policies
in contention scenarios using the CloudSim toolkit [2].
This paper is organized as follows. In Section 2 we give an overview of our
solution and describe the main architecture elements and in Section 3 we describe
how the architecture can be used to execute different application types. Section 4
describes the related work. Finally, we conclude and present future steps in
Section 5.

2 Architecture

In this Section we describe the architecture of our solution. We detail the main
components and the interaction between them. We then describe the current im-
plementation of the proportional-share allocation algorithm and the assumptions
that we make regarding the infrastructure’s virtual currency management.

Overview. Figure 1 shows the main architecture components. Our architecture


consists of distributed application managers that receive a budget of credits from
a budget manager and execute applications submitted by users. To request re-
sources for their applications, the managers communicate with a resource con-
troller that provisions them virtual clusters (i.e., groups of virtual machines)
from a virtual infrastructure manager and charges them for their used resources.
This virtual infrastructure manager (e.g. OpenNebula [9]) supports operations
related to creation, destruction, and dynamic placement of virtual machines. We
also consider that it is capable of providing monitoring information about the
physical hosts and virtual machines to the resource controller.
The application managers are started when the applications are submitted to
the infrastructure and manage the application’s life-cycle. A manager requests
resources for its application by submitting bids of the form b(n, rmin , s) to a
resource controller. This bid specifies the size of the virtual cluster, n, a mini-
mum resource allocation, rmin , that a resource controller should ensure for any
428 S. Costache et al.


 

  
  
   # 
     
$ 
 

 
 


  
   
    
 
  
%   "
   




 
  
 
 !   
" 
&"   

Fig. 1. Architecture overview

instance of the virtual cluster and the manager’s willingness to pay for the al-
located resources, s (spending rate). After its virtual cluster is allocated, the
manager starts its application. During the application execution the manager
monitors the application and uses application performance metrics (e.g., number
of processed tasks/time unit), or system information (e.g., resource utilization
metrics) to adapt its resource request to its application performance goal. This
can be done in two different ways: (i) by changing the virtual cluster size; (ii)
by changing the spending rate for the virtual cluster.
The resource controller allocates a resource fraction (e.g., 10% CPU or 1MB
memory) on a physical node for each virtual machine instance of a virtual clus-
ter. This allocation is enforced by a Virtual Machine Monitor (e.g., Xen [1])
and is proportional with the manager’s spending rate and inversely proportional
with the current resource price. If the allocation becomes lower than the min-
imum resource allocation requested by the manager then the virtual cluster is
preempted.

Resource Allocation. The resource controller recomputes the allocations for all
running virtual machines periodically. At the beginning of each time period, the
resource controller aggregates all newly received and existing requests and dis-
tributes the total infrastructure capacity between them through a proportional-
share allocation rule. This rule is applied as follows.
We consider the infrastructure has a total capacity C that needs to be shared
between M virtual machine instances. Each virtual machine receives a resource
b
amount defined as a = Pj · C, where si is the spending rate per virtual machine
M
and P = Σi=1 si is the total resource price. However, because the capacity of the
infrastructure is partitioned between different physical nodes, after computing
the allocations we may reach a situation in which we cannot accommodate all the
virtual machines on the physical nodes. Thus, instead of computing the allocation
from the total infrastructure capacity, we compute the allocation considering
the node capacity and we try to minimize the resulting error. For simplicity we
assume that the physical infrastructure is homogeneous and we treat only the
CPU allocation case.
An Economic Approach for Application QoS Management in Clouds 429

The algorithm applied by the resource controller has the following steps. To
ensure that the allocation of the virtual machine instances belonging to the
same group is uniform, the spending rate of the group is distributed between
the virtual machine instances in an equal way. Then, the instances are sorted
in descending order by their spending rates s. Afterwards, each virtual machine
instance from each virtual cluster is assigned to the node with the smallest price
m
p = Σk=1 sk , given that there are m instances already assigned to it. This ensures
that the virtual machine gets the highest allocation for the current iteration,
fully utilizing the resources and minimizing the allocation error. The resource
allocations for the current period are computed by iterating through all nodes
and applying the proportional-share rule locally.
Finally, the application managers are charged with the cost of using resources
s M
for the previous period, c = M Σi=1 ui ; ui represents the total amount of used
resource by a virtual machine instance i belonging to the virtual cluster of size M.
Budget Management. The logic of distributing amounts of credits to application
managers is abstracted by the budget manager component of our architecture.
For now we consider that this entity applies a credit distribution policy that
follows the principle ”use it or loose it”. That is, each manager receives an
amount of credits at a large time interval. To prevent hoarding of credits, the
manager is not allowed to save any credits from one time interval (i.e., renew
period ) to another. We also consider that this amount of credits can come from
an user’s account, at a rate established by the user itself; we don’t deal with the
management of the user’s credits in the rest of this paper.

3 Use Cases
We illustrate how the agents can adapt either their spending rates or their virtual
cluster size to take advantage of the resource availability and to meet specific
application goals. We consider two examples: (i) a rigid application (e.g., MPI
job) that needs to execute before a deadline; (ii) an elastic application (i.e., bag-
of-tasks application) composed of a large number of tasks that can be executed
as soon as resources become available on the infrastructure; we assume that
a master component keeps the tasks in a queue and submits them to worker
components to be processed. For the first case the manager requires a virtual
cluster of fixed size to the resource controller and then it controls the virtual
cluster’s allocation by scaling its spending rate. For the second case the manager
requires a virtual cluster with an initial size which is then scaled according to
infrastructure’s utilization level. Both application models are well known in the
scientific community and are representative for scientific clouds.
We analyzed the behavior of our designed managers by implementing and
evaluating our architecture in CloudSim [2]. We don’t consider the overheads
of virtual machine operations as we only want to show the managers behavior
and not the architecture’s performance. As we focus on the proportional-share
of CPU resources, we consider that the memory capacity of the node is enough
to accommodate all submitted applications. We describe next the design and
behavior of each manager.
430 S. Costache et al.

3.1 Adapting the Agent’s Spending Rate


In this case we design a manager that uses application progress information
to finish the application before a given deadline while being cost-effective. We
describe the manager logic and we analyze its behavior under varying load.

Application Management Logic. To provision resources for its application, the


manager uses a policy that adapts its spending rate based on a reference progress.
This reference progress represents how much of the application needs to be pro-
cessed per scheduling period to meet its deadline:

total length length
min( execution time , deadline−now ), if now < deadline
pref erence = total length (1)
execution time , otherwise

The length is a parameter specific to the application: it can be number of files


that the application needs to process to finish its execution, number of iterations
or instructions. The execution time represents the time in which the application
finishes if it runs alone on the infrastructure. If the current time is smaller than
the application deadline, the reference progress is computed as the remaining
application length distributed over the remaining execution time. Otherwise,
the application is already delayed, so it is desirable to make a maximum amount
of progress in its computation.
The manager monitors its application and receives information about the
progress made in the last scheduling period. To save its budget for future use, if
the application made enough progress then the manager decreases its bid. When
the application cannot meet its reference progress the manager uses all its saved
credits. To adapt the bid, the manager uses a subtractive decrease/multiplicative
increase rule:

max(pr , b − α ∗ pr ), if pcurrent ≥ pref erence
b= , (2)
min(bmax , β · b), otherwise

where α and β are configurable parameters that establish the scaling rate of the
bid and pr is the minimum price of using resources. To avoid depleting its budget
before the application completion, the manager limits its maximum submitted
bid to an amount bmax . For a more efficient use of the budget, we choose the
smallest time period between the remaining time to the budget renew and the
estimated remaining execution time of the application and we distribute the cur-
rent budget over it. The remaining execution time is estimated as the remaining
time to completion if the application continues to make pcurrent progress each
scheduling period. Given a budget B, the manager computes bmax as follows:

bmax = Bcurrent /(min(renew − now, remaining execution time) · Cnode ) (3)

Evaluation. To illustrate the advantage of using a feedback-based control man-


ager, we simulate the execution of a deadline-driven application under varying
workload. We consider that the infrastructure is used to run best-effort and
An Economic Approach for Application QoS Management in Clouds 431

Application progress (MIPS/scheduling period)


150
140
130 progress (best-effort)
120 real progress
110 reference progress
100
90
80
70
60
50
40
30
20
10
0
0 1000 2000 3000 4000 5000 6000 7000
Time (seconds)

Fig. 2. Application progress variation in time

deadline-driven applications. For the best-effort application we define a man-


ager that distributes its budget equally over the renew period.
For our experiment we consider the following settings. The managers are given
an amount of 450000 credits that is renewed at 3600 seconds. The reserve price
is set to 1credit/second. 3 applications, each of them with a single task of length
of 360000 MIPS are submitted to a single physical node with 100 MIPS. The
first application is submitted with a deadline of 5400 seconds while the other two
are best-effort. These best-effort applications are submitted after 1800 seconds
at a distance of 5 minutes each. The scheduling period is set to 5 minutes. To
scale its bid, the manager uses the feedback control rule parameters: α = 0.5
and β = 2.
Figure 2 shows the results of adapting the bid to follow the application’s ref-
erence progress. During the first 1800 seconds the application executes alone
on the node so it makes a maximum amount of progress. Thus, its reference
progress also drops. After the first 1800 seconds the other two applications start
executing one by one so the manager needs to adapt its bid to follow the refer-
ence progress. The fluctuations in the real progress represent the result of this
adaptation. We compare this case with the best-effort manager. In our case, the
application completes before its deadline. However, in the case of the best-effort
manager, the application completes much later (1600 seconds past the deadline),
because the manager is not aware of the competition for resources.

3.2 Adapting the Virtual Cluster Size


We design a manager that uses its past virtual cluster resource allocation as a
feedback and scales its application to minimize its completion time. The manager
is willing to spend all its budget at a constant rate. We describe its logic and
behavior next.

Application Manager Logic. To scale its application, the manager applies an


additive increase/multiplicative decrease rule and uses its virtual cluster past
average CPU allocation as a congestion signal. To compute the past average CPU
allocation, the manager uses an EWMA filter. As long as the application master
432 S. Costache et al.

has tasks in its queue, the manager expands the virtual cluster. To ensure that
the application’s tasks already submitted to virtual machines are processed as
fast as possible, the manager shrinks the virtual cluster when the existing virtual
machines don’t have enough CPU. The virtual cluster size (i.e., the number of
virtual machines), n, is updated as follows:

n + α, if aavg ≥ Ta and remaining tasks to process > 0
n= (4)
 2nβ , otherwise
where α and β are configurable parameters that establish the scaling rate of the
virtual cluster size and Ta is a threshold on the virtual cluster allocation.

Evaluation. To illustrate the benefits of the elastic scaling on the application


execution time, we analyze the behavior of the elastic application manager un-
der varying load. For our experiment we consider the following settings. The
elastic manager is given a budget of 1.800.000 credits and the other managers
120.000 credits; their budgets are renewed at 3600 seconds. The infrastructure
has 10 nodes each with 100 MIPS and the scheduling period is set to 5 minutes.
An application with 200 tasks, with an average execution time of 10 minutes
each, starts executing. After 200 seconds 15 applications with a length of 360000
MIPS are submitted with an exponential inter-arrival time distribution, with
an average inter-arrival time of 160 seconds. The virtual cluster average alloca-
tion threshold is set to 85% of Cnode . The manager is conservative in scaling the
virtual cluster and uses the feedback control rule parameters: α = 1 and β = 0.5.
Figure 3 shows the resource allocation variation in terms of CPU (a) and
number of virtual machines (b). The manager starts its application with an
initial number of 5 virtual machines at full capacity. When the demand is low,
the manager gets more resources for its existing virtual machines and expands
its virtual cluster. This is noticed after the application is submitted and after the
other applications finish their execution. When all the submitted applications are
running, the allocation for the existing virtual machines drops and the manager
shrinks its virtual cluster to 4 virtual machines. Because the average allocation
is greater than the given threshold, when the infrastructure is free the manager
actually creates more virtual machines than the infrastructure’s capacity. Setting
a higher threshold would avoid this behavior.
We compare our proportional-share mechanism to a static allocation mech-
anism. With the static allocation mechanism the manager doesn’t receive any
feedback from the infrastructure and is not able to scale its application. When
the application is executed with our proportional-share mechanism it finishes
in 300 minutes while in the static allocation case it finishes in 417 minutes.
The elastic behavior of the manager leads to a better resource usage, as seen in
Figure 3 (c), and to a smaller execution time of the application.

4 Related Work
Many recent research efforts focused on designing algorithms for dynamic
resource provisioning in shared platforms. However few of them decouple the
An Economic Approach for Application QoS Management in Clouds 433

1500
1.4
1400
18
Proportional-share
1300 Static allocation
1200 1.2

Total CPU Utilization (MIPS)


Application CPU Allocation

1100 15
1000 1

Virtual cluster size


900 12
800 0.8
700 9
600 0.6
500
6
400 0.4
300
200 3 0.2
100
0 0 0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2500 5000 7500 10000 12500 15000 17500 20000 22500 25000
Time (seconds) Time (seconds) Time (seconds)

(a) (b) (c)

Fig. 3. Application allocation in terms of CPU (a), number of virtual machines (b)
and datacenter utilization (c) in time

application performance management from the resource management. This de-


coupling can be achieved with two mechanisms: i) using utility functions with
which applications express their valuation for resources to the resource man-
ager; ii) using an economic model with which both applications and the resource
manager act selfishly to maximize their own benefit.
Utility functions were used to dynamically control the resource allocation
for applications in a virtualized [6] and non-virtualized [11] datacenter. The
users specify their valuation for certain levels of performance, which is then
expressed as a function of the application’s resource allocation (i.e., resource-
level utility). By knowing the resource-level utilities of all applications, a resource
manager computes the resource configuration according to a global objective,
i.e., maximize the sum of all resource-level utilities [11], ensure a (max-min) fair
allocation [3]. As the resource controller needs to determine the most efficient
allocation by considering any fraction of resource the application would get, the
computational complexity is high. Scaling with the size of the infrastructure
and the number of hosted applications clearly demands resource management
algorithms with a low run-time complexity.
Opposed to this approach, we use an economic model to dynamically pro-
vision resources to applications. Through an economic model [12] the resource
control becomes decentralized. Each entity from the system acts selfishly: each
application tries to meet its own performance goal while the resource provider
tries to maximize its own revenue. Applying this model to dynamically allocate
resources between competitive applications is not new. Both Stratford et. al. [10]
and Norris et. al. [7] proposed to use the dynamic pricing of resources as a mech-
anism to regulate the resource allocation between competitive applications. In
both cases resources were traded using a commodity market model. However,
this model would have a high communication overhead and it would be difficult
to use in a large scale system.
A popular approach to regulate access to resources in distributed systems is
to use an auction-based market. In auctions the price of the resource is given
by the bids of the participants. However, when considering divisible resources,
most auction models suffer from the same computational complexity as the
434 S. Costache et al.

utility functions, as the resource manager must compute an efficient allocation.


From this perspective, the simplest auction mechanism for resource allocation is
the proportional-share introduced by Lai et. al. [5]. This mechanism has a low
complexity as it applies a simple computational rule to distribute the resource
between competitive users and thus can scale with the size of the infrastructure
and the number of applications. We propose in our work to use this mechanism
for virtual machine provisioning to allow applications to adapt their resource
allocations according to their performance goals.
Several market-based systems [5, 4, 8] propose a proportional-share approach
but they do not specifically target cloud infrastructures. From this perspec-
tive, the most similar to our work is Tycoon [5]. In Tycoon, resources are allo-
cated through a proportional-share rule on each physical node while agents select
the nodes according to user’s preferences and budget. In our architecture, the
proportional-share rule is applied for the entire infrastructure capacity instead
of one physical machine, decoupling the resource provisioning from the physical
placement. Our agents are concerned with meeting application goals through
intelligently managing their budgets and adapting to the fluctuating resource
availability.

5 Conclusions
In this paper we presented a new architecture for managing applications and
resources in a cloud infrastructure. To allocate resources between multiple com-
petitive applications, this architecture uses a proportional-share economic model.
The main advantage of this model is the decentralization of the resource control.
Each application is managed by an independent agent that requests resources
by submitting bids to a resource controller. The manager’s bid is limited by its
given budget. To meet its application performance goals the manager can apply
different strategies to vary its bid in time. Through this approach, our archi-
tecture supports different types of applications and allows them to meet their
performance goals while having a simple resource management mechanism.
We validated our architecture by designing and simulating application man-
agers for rigid and elastic applications. We showed how managers can use simple
feedback-based policies to scale the allocation of their applications according
to a given goal. This opens the path towards designing more efficient managers
that optimize their budget management to meet several application performance
goals. For example, in the elastic application case, the manager would take de-
cisions to manage its budget and scale its virtual cluster based on an estimated
finish time of the tasks and a possible deadline. A further step would be then
to consider applications with time-varying resource demands. Optimizing the
resource allocation mechanism and adding support for multiple resource types
will also be our next focus. To improve the support of many application types,
we plan to add the possibility for applications to express placement preferences.
Finally, we plan to implement and validate our architecture in a real system.
An Economic Approach for Application QoS Management in Clouds 435

References
[1] Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer,
R., Pratt, I., Warfield, A.: Xen and the art of virtualization. In: Proceedings of
the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 2003),
pp. 164–177. ACM Press, New York (2003)
[2] Calheiros, R.N., Ranjan, R., Beloglazov, A., De Rose, C.A.F., Buyya, R.:
Cloudsim: a toolkit for modeling and simulation of cloud computing environments
and evaluation of resource provisioning algorithms. Software: Practice and Expe-
rience 41(1), 23–50 (2011)
[3] Carrera, D., Steinder, M., Whalley, I., Torres, J., Ayguade, E.: Utility-based place-
ment of dynamic web applications with fairness goals. In: IEEE Network Opera-
tions and Management Symposium, pp. 9–16 (2008)
[4] Chun, B.N., Culler, D.E.: REXEC: A Decentralized, Secure Remote Execution
Environment for Clusters. In: Falsafi, B., Lauria, M. (eds.) CANPC 2000. LNCS,
vol. 1797, pp. 1–14. Springer, Heidelberg (2000)
[5] Lai, K., Rasmusson, L., Adar, E., Zhang, L., Huberman, B.: Tycoon: An imple-
mentation of a distributed, market-based resource allocation system. Multiagent
and Grid Systems 1(3), 169–182 (2005)
[6] Nguyen Van, H., Dang Tran, F., Menaud, J.-M.: SLA-aware virtual resource man-
agement for cloud infrastructures. In: 9th IEEE International Conference on Com-
puter and Information Technology (CIT 2009), pp. 1–8 (2009)
[7] Norris, J., Coleman, K., Fox, A., Candea, G.: Oncall: Defeating spikes with a free-
market application cluster. In: Proceedings of the First International Conference
on Autonomic Computing (2004)
[8] Sandholm, T., Lai, K.: Dynamic Proportional Share Scheduling in Hadoop.
In: Frachtenberg, E., Schwiegelshohn, U. (eds.) JSSPP 2010. LNCS, vol. 6253,
pp. 110–131. Springer, Heidelberg (2010)
[9] Sotomayor, B., Montero, R., Llorente, I., Foster, I.: An Open Source Solution for
Virtual Infrastructure Management in Private and Hybrid Clouds. IEEE Internet
Computing 13(5), 14–22 (2009)
[10] Stratford, N., Mortier, R.: An economic approach to adaptive resource manage-
ment. In: Proceedings of the The Seventh Workshop on Hot Topics in Operating
Systems, HOTOS 1999. IEEE Computer Society (1999)
[11] Tesauro, G., Kephart, J.O., Das, R.: Utility functions in autonomic systems. In:
ICAC 2004: Proceedings of the First International Conference on Autonomic Com-
puting, pp. 70–77. IEEE Computer Society (2004)
[12] Yeo, C.S., Buyya, R.: A taxonomy of market-based resource management systems
for utility-driven cluster computing. Softw. Pract. Exper. 36, 1381–1419 (2006)
Evaluation of the HPC Challenge Benchmarks
in Virtualized Environments

Piotr Luszczek, Eric Meek, Shirley Moore, Dan Terpstra, Vincent M. Weaver,
and Jack Dongarra

Innovative Computing Laboratory


University of Tennessee Knoxville
{luszczek,shirley,terpstra,vweaver1,dongarra}@eecs.utk.edu

Abstract. This paper evaluates the performance of the HPC Challenge


benchmarks in several virtual environments, including VMware, KVM
and VirtualBox. The HPC Challenge benchmarks consist of a suite of
tests that examine the performance of HPC architectures using ker-
nels with memory access patterns more challenging than those of the
High Performance LINPACK (HPL) benchmark used in the TOP500 list.
The tests include four local (matrix-matrix multiply, STREAM, Rando-
mAccess and FFT) and four global (High Performance Linpack – HPL,
parallel matrix transpose – PTRANS, RandomAccess and FFT) kernel
benchmarks.
The purpose of our experiments is to evaluate the overheads of the
different virtual environments and investigate how different aspects of
the system are affected by virtualization. We ran the benchmarks on an
8-core system with Core i7 processors using Open MPI. We did runs on
the bare hardware and in each of the virtual environments for a range of
problem sizes. As expected, the HPL results had some overhead in all the
virtual environments, with the overhead becoming less significant with
larger problem sizes. The RandomAccess results show drastically differ-
ent behavior and we attempt to explain it with pertinent experiments.
We show the cause of variability of performance results as well as major
causes of measurement error.

1 Introduction

With the advent of cloud computing, more and more workloads are being moved
to virtual environments. High Performance Computing (HPC) workloads have
been slow to migrate, as it has been unclear what kinds of trade-offs will occur

This material is based upon work supported in part by the National Science Foun-
dation under Grant No. 0910812 to Indiana University for “FutureGrid: An Ex-
perimental, High-Performance Grid Test-bed.” Partners in the FutureGrid project
include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U.
Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia,
Purdue I., and T-U. Dresden.

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 436–445, 2012.

c Springer-Verlag Berlin Heidelberg 2012
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 437

when running these workloads in such a setup [10,13]. We evaluated the over-
heads of several different virtual environments and investigated how different
aspects of the system are affected by virtualization.
The virtualized environments we investigated were VMware Player, KVM and
VirtualBox. We used the HPC Challenge (HPCC) benchmarks [6] to evaluate
these environments. HPC Challenge examines performance of HPC architectures
using kernels with memory access patterns more challenging than those of the
High Performance LINPACK (HPL) benchmark used in the TOP500 list. The
tests include four local (matrix-matrix multiply, STREAM, RandomAccess and
FFT) and four global (High Performance Linpack – HPL, parallel matrix trans-
pose – PTRANS, RandomAccess and FFT) kernel benchmarks.
We ran the benchmarks on an 8-core system with Core i7 processors using
Open MPI. We ran on bare hardware and inside each of the virtual environments
for a range of problem sizes.
As expected, the HPL results had some overhead in all the virtual environ-
ments, with the overhead becoming more significant with larger problem sizes
and VMware Player having the least overhead. The latency results showed higher
latency in the virtual environments, with KVM being the highest.
We do not intend for this paper to provide a definitive answer as to which
virtualization technology achieves the highest performance results. Rather, we
seek to provide guidance on more generic behavioral features of various virtual-
ization packages and to further the understanding of VM technology paradigms
and their implications for performance-conscious users.

2 Related Work
There have been previous works that looked at measuring the overhead of HPC
workloads in a virtualized environment. Often the works measure timing external
to the guest, or, when they use the guest, they do not explain in great detail what
problems they encountered when trying to extrapolate meaningful performance
measurements: the very gap we attempt to breach with this paper.
Youseff et al. [14] measured HPC Challenge and ASCI Purple benchmarks.
They found that Xen has better memory performance than real hardware, and
not much overhead.
Walters et al. [12] compared the overheads of VMWare Server (not ESX), Xen
and OpenVZ with Fedora Core 5, Kernel 2.6.16. They used NetPerf and Iozone
to measure I/O and the NAS Parallel benchmarks (both serial, OpenMP and
MPI) for HPC. They found Xen best in networking, OpenVZ best for filesystems.
On serial NAS, most are close to native, some even ran faster. For OpenMP runs,
Xen and OpenVZ are close to real hardware, but VMware has large overhead.
Danciu et al. [1] measured both high-performance and high-throughput work-
loads on Xen, OpenVZ, and Hyper-V. They used LINPACK and Iometer. For
timing, they used UDP packets sent out of the guest to avoid timer scaling is-
sues. They found that Xen ran faster than native on many workloads, and that
I/O did not scale well when running multiple VMs on the same CPU.
438 P. Luszczek et al.

Han et al. [2] ran Parsec and MPI versions of the NAS Parallel benchmarks
on Xen and kernel 2.6.31. They found that the overhead becomes higher when
more cores are added.
Huang et al. [4] ran the MPI NAS benchmarks and HPL inside of Xen. They
measured performance using the Xenoprof infrastructure and found most of the
overhead to be I/O related.
Li et al. [5] ran SPECjvm2008 on a variety of commercial cloud providers.
Their metrics include cost as well as performance.
Mei et al. [7] measured performance of webservers using Xenmon and Xentop.
Performance of OpenMP benchmarks was studied in detail and showed a
wide range of overheads that depended on the work load and parallelization
strategies [9].

3 Setup
3.1 Self-monitoring Results
When conducting large HPC experiments on a virtualized cluster, it would be
ideal if performance results could be gathered from inside the guest. Most HPC
workloads are designed to be measured that way, and doing so requires no change
to existing code.
Unfortunately measuring time from within the guest has its own difficulties.
These are spelled out in detail by VMware [11]. Time that occurs inside a guest
may not correspond at all to outside wallclock time. The virtualization software
will try its best to keep things relatively well synchronized, but, especially if
multiple guests are running, there are no guarantees.
On modern Linux, either gettimeofday() or clock gettime() are used by
most applications to gather timing information. PAPI, for example, uses
clock gettime() for its timing measurements. The C library translates these calls
into kernel calls and executes them either by system call, or by the faster VDSO
mechanism that has lower overhead. Linux has a timer layer that supports these
calls. There are various underlying timers that can be used to generate the tim-
ing information, and an appropriate one is picked at boot time. The virtualized
host emulates the underlying hardware and that is the value passed back to
the guest. Whether the returned value is “real” time or some sort of massaged
virtual time is up to the host.
A list of timers that are available can be found by looking at the file
/proc/timer_list.
There are other methods of obtaining timing information. The rdtsc call
reads a 64-bit time-stamp counter on all modern x86 chips. Usually this can be
read from user space. VMs like VMware can be configured to pass through the
actual system TSC value, allowing access to actual time. Some processor imple-
mentations stop or change the frequency of the TSC during power management
situations, which can limit the usefulness of this resource.
The RTC (real time clock) can also generate time values and can be accessed
directly from user space. However, this timer is typically virtualized.
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 439

Others have attempted to obtain real wall clock time measurements by sending
messages outside the guest and measuring time there. For example, Danciu et
al. [1] send a UDP packet to a remote guest at the start and stop of timing,
which allows outside wallclock measurements. We prefer not to do this, as it
requires network connectivity to the outside world that might not be available
on all HPC virtualized setups.
For our measurements we use the values returned by the HPC Challenge
programs, which just call the gettimeofday() interface invoked by MPI Wtime().

3.2 Statistical Methods


As the VM environments are one step removed from the hardware and, conse-
quently, introduce additional sources of measurement errors, we make an effort to
counteract this uncertainty with a number of statistical techniques. Each of the
results we report is a statistical combination of up to five measurements, each
of which was taken under the same (or nearly the same) circumstances. One
exception is the accuracy-drift experiments from Section 5 that were explicitly
meant to show variability of performance measurement caused by inconsistent
state of the tested VMs. When applicable, we also indicate the standard devi-
ation on our graph charts to indicate the variability and a visual feedback on
trustworthiness of the particular measurement.
To combine multiple data points, we use the minimum function for time and
the maximum for performance. In our view, these two best filter out randomly
injected overheads that could potentially mask the inherent behavior of the
virtual environments we tested.

3.3 Hardware and Software Used in Tests


For our tests we used an Intel Core i7 platform. The featured processor was a
four-core Intel Xeon X5570 clocked at 2.93 GHz. The VMware Player was version
3.1.4, VirtualBox – 4.0.8, and KVM was compatible with kernel version 2.6.35.
All the VMMs were hosted on Linux with kernel 2.6.35. We aim this analysis
more towards the consumer grade solutions. Our intention is to focus on server
and enterprise level solutions in the follow up virtualization work. This would
close the gap in the test environments and extend our investigation to Xen and
VMware ESXi products. The decision is driven also by our ability to perform
our tests in parallel in multiple instances of the same hardware to accelerate the
testing process while retaining their quality by keeping the setup and running
environment consistent through out all runs.

4 Disparity between CPU and Wall Clock Timers


Among other things, HPCC attempts to underscore the difference between CPU
time and wall clock time in performance measurements. The former may sim-
ply be measured with a C library call clock() and the latter with either BSD’s
440 P. Luszczek et al.

50
VMware Player
VirtualBox
48 Bare metal
KVM
46

44

42

40
18 20 22 24 26 28 30

7
6
Percentage difference

5
4
3
2
1
0
18 20 22 24 26 28 30
LOG2 ( Problem size )

Fig. 1. Variation in percentage difference between the measured CPU and wall clock
times for MPIRandomAccess test of HPC Challenge. The vertical axis has been split
to offer a better resolution for the majority of data points.

gettimeofday() or clock gettime() from real-time extension of POSIX. A common


complaint is the low resolution of the primitives for measuring the CPU time.
Under greater scrutiny, the complaint stems from the workings of CPU time
accounting inside the kernel – the accuracy is closely related to the length of the
scheduler tick for computationally intensive processes. As a result, CPU and wall
clock time tend to be in large disagreement for such workloads across shorter
timing periods. This observation is readily confirmed with the results from Fig-
ure 1. The figure plots the relative difference between readings from both timers
across a large set of input problem sizes:
 
 TCPU 
 
 Twall − 1 × 100% . (1)

We may readily observe nearly a 50% difference between the reported CPU and
wall clock times for small problem sizes on bare metal. The discrepancy dimin-
ishes for larger problem sizes that require longer processing times and render
low timer resolution much less of an issue. In fact, the difference drops below a
single percentage point for most problems larger than 222 . Virtual environments
do not enjoy such consistent behavior though. For small problem sizes, both
timers diverge only by about 5%. Our understanding attributes this behavior to
a much greater overhead imposed by virtual environments on system calls such
as the timing primitives that require hardware access. More information on the
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 441

70
VMware Player
VirtualBox
KVM
60

50
Percentage difference

40

30

20

10

0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
Matrix size

Fig. 2. Variation in percentage difference between the measured wall clock times for
HPL (a computationally intensive problem) for ascending and descending orders of
problem sizes during execution

sources of this overhead may be found in Section 2. In summary, for a wide range
of problem sizes we observed nearly an order of magnitude difference between
the observed behavior of CPU and wall clock time.
Another common problem is a timer inversion whereby the system reports
that the process’ CPU time exceeds the wall clock time: TCPU > Twall . On bare
metal, the timer inversion occurs due to a large difference in relative accuracy
of both timers and is most likely to occur when measuring short periods of time
that usually result in large sampling error [3]. This is not the case inside all of the
virtual machines we tested. The observed likelihood of timer inversion for virtual
machines is many-fold greater than the bare metal behavior. In addition, the
inversions occur even for some of the largest problem sizes we used: a testament
to a much diminished accuracy of the wall clock timer that we attribute to the
virtualization overhead.

5 Accuracy-Drift of Timers with VM State Accumulation

Another important feature of virtual environments that we investigated was the


accuracy drift of measurements with respect to the amount of accumulated state
within the VMM. This directly relates to a common workflow within benchmark-
ing and performance tuning communities whereby a given portion of the code
is run repeatedly until a satisfactory speedup is reached. Our findings indicate
442 P. Luszczek et al.

that this may lead to inconsistent results due to, in our understanding, a change
of state within software underlying the experiment. We understand that this
may be related to the fact that VMs maintain internal data structures that
evolve over time and change the virtualization overheads. To illustrate this phe-
nomenon, we ran HPL, part of the HPC Challenge suite, twice with the same
set of input problem sizes. In the first instance, we made the runs in ascending
order: starting with the smallest problem size and ending with the largest. Then
after a reboot, we used the descending order: the largest problem size came first.
On bare metal, the resulting performance shows no noticeable impact. This is
not the case inside the VMs as shown in Figure 2. We plot the percentage dif-
ference of times measured for the same input problem size for the ascending and
descending orders of execution:
 
 Tdescending 
 − 1  × 100% . (2)
 Tascending 

The most dramatic difference was observed for the smallest problem size of
1000. In fact, it was well over 50% for both VirtualBox and VMware Player. We
attribute this to the short running time of this experiment and, consequently,
a large influence of the aforementioned anomaly. Even if we were to dismiss
this particular data point, the effects of the accuracy drift are visible across the
remaining problem sizes but the pattern of influence is appreciably different.
For VMware Player, the effect gradually abates for large problem sizes after
attaining a local maximum of about 15% at 4000. On the contrary, KVM shows
comparatively little change for small problem sizes and then drastically increases
half way through to stay over 20% for nearly the rest of the large problem sizes.
And finally, the behavior of VirtualBox initially resembles that of VMware Player
and later the accuracy drift diminishes to fluctuate under 10%.
From a performance measurement standpoint this resembles the problem
faced when benchmarking file systems. The factors influencing the results include
the state of the memory file cache and the level of file and directory fragmenta-
tion [8]. In virtual environments, we also observe this persistence of state that
in the end influences the performance of VMs and the results observed inside its
guest operating systems. Clean boot results proved to be the most consistent in
our experiments. However, we recognize that, for most users, rebooting the VM
after each experiment might not be the feasible deployment requirement.

6 Results
In previous sections we have outlined the potential perils of accurate performance
evaluation of virtual environments. With these in mind, we attempt to show in
this section the performance results we obtained by running the HPC Challenge
suite across the tested VMs. We consider the ultimate goal for a VM to match
the performance of the bare metal run. In our performance plots then we use
a relative metric – the percentage fraction of bare metal performance that is
achieved inside a given VM:
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 443
Fraction of bare metal performance

Fraction of bare metal performance


120 120
VMware Player VirtualBox
100 100

80 80

60 60

40 40

20 20

0 0
0

20

40

60

80

10

12

14

16

18

20

40

60

80

10

12

14

16

18
00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00
0

0
Problem size Problem size
Fraction of bare metal performance

120
KVM
100

80

60

40

20

0
0

20

40

60

80

10

12

14

16

18
00

00

00

00

00

00

00

00

00
0

Problem size

Fig. 3. Percentage of bare metal performance achieved inside VMware Player, Virtu-
alBox, and KVM for HPC Challenge’s HPL test. Each data bar shows the standard
deviation bar to indicate the variability of the measurement.

performanceVM
× 100% . (3)
performancebare metal

Due to space constraints we cannot present a detailed view of all of our results.
Instead, we focus on two tests from the HPC Challenge suite: HPL and MPI-
RandomAccess. By selecting these two tests we intend to contrast the behavior
of two drastically different workloads. The former represents codes that spend
most of their running time inside highly optimized library kernels that nearly op-
timally utilize the cache hierarchy and exhibit very low OS-level activity which
could include servicing TLB misses and network card interrupts required for
inter-process communication. Such workloads are expected to suffer little from
the introduction of a virtualization layer and our results confirm this as shown
in Figure 3. In fact, we observed that virtualization adds very little overhead for
such codes and the variability of the results caused by the common overheads is
relatively small across a wide range of input problem sizes. On the contrary, MPI-
RandomAccess represents workloads that exhibit high demand on the memory
subsystem including TLBs and require handling of very large counts of short
messages exchanged between processors. Each of these characteristics stresses
the bare metal setup and is expected to do so inside a virtualized environment.
Our results from Figure 4 fully confirm this prediction. The virtualization over-
head is very high and could reach 70% performance loss. Furthermore, a large
444 P. Luszczek et al.
Fraction of bare metal performance

Fraction of bare metal performance


105 60
VMware Player VirtualBox
100 55
50
95
45
90 40
85 35
80 30
25
75
20
70 15
65 10
18 20 22 24 26 28 30 18 19 20 21 22 23 24 25 26 27 28 29
log( Problem size ) log( Problem size )
Fraction of bare metal performance

50
KVM
48
46
44
42
40
38
36
34
32
30
18 19 20 21 22 23 24 25 26 27 28 29
log( Problem size )

Fig. 4. Percentage of bare metal performance achieved inside VMware Player, Virtu-
alBox, and KVM for HPC Challenge’s MPIRandomAccess test. Each data bar shows
the standard deviation bar to indicate the variability of the measurement.

standard deviation of the measurements indicates to us that long running codes


of this type can accumulate state information inside the VM that adversely af-
fects the accuracy of the measurements. Such accumulated accuracy drift may
persist across a multitude of input problem sizes.

7 Conclusions and Future Work


In this paper we have shown how virtualization exacerbates the common prob-
lems of accurate performance measurement. Furthermore, we have observed new
obstacles to reliable measurements which we related to the accumulation of state
information that is internal to all the tested VMs. This leads us to present our
results with clear indicators of their statistical quality and a detailed description
of settings and circumstances of the runs to render them repeatable. This, we be-
lieve, should be strongly stressed for measuring performance with virtualization
enabled – even more so than it is customary for the bare metal runs.
We also showed two drastically different workloads in terms of how they stress
the virtualization layer. In the future, we will focus on detailed examination
of these workloads and devising new ones that will help us better understand
the resulting overheads and performance variability and how hardware features
such as nested paging could help. As mentioned earlier we would like to extend
our tests beyond desktop-oriented solutions. In particular, we are looking into
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments 445

testing Xen and VMware ESXi to see how our observations carry over to these
technologies. They are much closer to the hardware and we believe that it will
give them an advantage over the virtualization platforms presented in this paper.

References
1. Danciu, V.A., gentschen Felde, N., Kranzlmüller, D., Lindinger, T.: High-
performance aspects in virtualized infrastructures. In: 4th International DMTF
Academic Alliance Workshop on Systems and Virtualization Management,
pp. 25–32 (October 2010)
2. Han, J., Ahn, J., Kim, C., Kwon, Y., Choi, Y.-r., Huh, J.: The Effect of Multi-
core on HPC Applications in Virtualized Systems. In: Guarracino, M.R., Vivien,
F., Träff, J.L., Cannatoro, M., Danelutto, M., Hast, A., Perla, F., Knüpfer, A.,
Di Martino, B., Alexander, M. (eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586,
pp. 615–623. Springer, Heidelberg (2011)
3. Hines, S., Wyatt, B., Chang, J.M.: Increasing timing resolution for processes and
threads in Linux (2000) (unpublished)
4. Huang, W., Liu, J., Abali, B., Panda, D.: A case for high performance computing
with virtual machines. In: Proceedings of the 20th Annual International Conference
on Supercomputing (2006)
5. Li, A., Yang, X., Kandula, S., Zhang, M.: CloudCmp: comparing public cloud
providers. In: 10th Annual Conference on Internet Measurement (2010)
6. Luszczek, P., Bailey, D., Dongarra, J., Kepner, J., Lucas, R., Rabenseifner, R.,
Takahashi, D.: The HPC challenge HPCC benchmark suite. In: SuperComputing
2006 Conference Tutorial (2006)
7. Mei, Y., Liu, L., Pu, X., Sivathanu, S.: Performance measurements and analy-
sis of network I/O applications in virtualized cloud. In: IEEE 3rd International
Conference on Cloud Computing, pp. 59–66 (August 2010)
8. Smith, K.A., Selzter, M.: File layout and file system performance. Computer Sci-
ence Technical Report TR-35-94, Harvard University (1994)
9. Tao, J., Fürlinger, K., Marten, H.: Performance Evaluation of OpenMP Appli-
cations on Virtualized Multicore Machines. In: Chapman, B.M., Gropp, W.D.,
Kumaran, K., Müller, M.S. (eds.) IWOMP 2011. LNCS, vol. 6665, pp. 138–150.
Springer, Heidelberg (2011)
10. Tsugawa, M., Fortes, J.A.B.: Characterizing user-level network virtualization: per-
formance, overheads and limits. International Journal of Network Management
(2009), doi:10.1002/nem.733
11. Timekeeping in VMware Virtual Machines: VMware ESX 4.0/ESXi 4.0, VMware
workstation 7.0 information guide
12. Walters, J., Chaudhary, V., Cha, M., Guercio, S.J., Gallo, S.: A comparison of vir-
tualization technologies for HPC. In: 22nd International Conference on Advanced
Information Networking and Applications, pp. 861–868 (March 2008)
13. Younge, A.J., Henschel, R., Brown, J.T., von Laszewski, G., Qiu, J., Fox, G.C.:
Analysis of virtualization technologies for High Performance Computing environ-
ments. In: Proceedings of The Fourth IEEE International Conference on Cloud
Computing (CLOUD 2011), Washington Marriott, Washington DC, USA, July
4-9 (2011); technical Report (February 15, 2011), updated (April 2011)
14. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Paravirtualization for HPC Sys-
tems. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds.) ISPA
Workshops 2006. LNCS, vol. 4331, pp. 474–486. Springer, Heidelberg (2006)
DISCOVERY, Beyond the Clouds
DIStributed and COoperative Framework to Manage
Virtual EnviRonments autonomicallY:
A Prospective Study

Adrien Lèbre1, Paolo Anedda2 , Massimo Gaggero2, and Flavien Quesnel1


1
ASCOLA Research Group, Ecole des Mines de Nantes, Nantes, France
{firstname.lastname}@mines-nantes.fr
2
CRS4 Distributed Computing Group, Edificio 1, Polaris, Pula, Italy
{firstname.lastname}@crs4.it

Abstract. Although the use of virtual environments provided by cloud


computing infrastructures is gaining consensus from the scientific com-
munity, running applications in these environments is still far from reach-
ing the maturity of more usual computing facilities such as clusters or
grids. Indeed, current solutions for managing virtual environments are
mostly based on centralized approaches that barter large-scale concerns
such as scalability, reliability and reactivity for simplicity. However, con-
sidering current trends about cloud infrastructures in terms of size (larger
and larger) and in terms of usage (cross-federation), every large-scale con-
cerns must be addressed as soon as possible to efficiently manage next
generation of cloud computing platforms.
In this work, we propose to investigate an alternative approach lever-
aging DIStributed and COoperative mechanisms to manage Virtual En-
viRonments autonomicallY (DISCOVERY). This initiative aims at over-
coming the main limitations of the traditional server-centric solutions
while integrating all mandatory mechanisms into a unified distributed
framework. The system we propose to implement, relies on a peer-to-
peer model where each agent can efficiently deploy, dynamically schedule
and periodically checkpoint the virtual environments they manage. The
article introduces the global design of the DISCOVERY proposal and
gives a preliminary description of its internals.

1 Introduction

Since the first proposals almost ten years ago [15,20], the use of virtual technolo-
gies has radically changed the perception of distributed infrastructures. Through
an encapsulation of software layers into a new abstraction – the virtual machine
(VM) –, users can run their own runtime environment without considering, in
most cases, software and hardware restrictions which were formerly imposed by
computing centers. Relying on specific APIs, users can create, configure and up-
load their VMs to cloud computing providers, which in turn are in charge of
deploying and running the requested virtual environment (VE) on their physical

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 446–456, 2012.

c Springer-Verlag Berlin Heidelberg 2012
DISCOVERY, Beyond the Clouds 447

infrastructure. In some ways, users may consider the distributed infrastructure


as a unique and large hardware where they can launch as many VMs as they
want to compose and recompose their environment on demand.
Because of its flexibility and its indubitable economic advantage, this ap-
proach, known now as Infrastructure-as-a-Service (IaaS), is becoming more and
more popular. However, running applications in those virtualized environments
and upon those infrastructures is still far from reaching the maturity of more
usual computing facilities such as clusters or grids. Indeed, most IaaS frame-
works, such as Nimbus [1], OpenStack [3] and OpenNebula [32], have been de-
signed with the ultimate goal of deploying VEs upon physical resources, setting
aside or addressing only as secondary concerns large-scale infrastructure chal-
lenges.
Considering that cloud computing providers permanently invest in new phys-
ical resources to satisfy the increasing demand of VEs, all issues related to the
management of large-scale infrastructures should be considered as major con-
cerns of IaaS frameworks. This is reinforced with recent proposals promoting
the federation of IaaS infrastructures, leading to larger and more complex sys-
tems [21]. From our point of view, both the design of IaaS frameworks and the
management of VEs should be driven by:

– Scalability, targeting the management of hundred thousands of VMs upon


thousands of physical machines (PMs), potentially spread across multiple
sites;
– Reliability, considering “hardware failures as the norm rather the excep-
tion” [8];
– Reactivity, handling each reconfiguration event as swiftly as possible to main-
tain VEs’ Quality of Service (QoS).

If the first point is a well-known challenge, some clarifications should be made


regarding the expectations about the two latest ones. Concerning reliability,
IaaS frameworks should be robust enough to face failures. Besides remaining
operational – i.e. users can continue to interact with them –, they must provide
mechanisms to resume any faulty VEs in a consistent state, while limiting the
impact on the sound ones. Regarding reactivity, IaaS frameworks should swiftly
handle events that require performing particular operations either on virtual or
on physical resources. These events can be related to submissions or completions
of VEs, to physical resource changes, or to administrator’s interventions. The
main objective is to maximize the system utilization while insuring the QoS
expectations.
Although, the management and the use of VMs in distributed architectures is
a hot topic leading to a significant number of publications, most of the current
works only focus on one particular concern. To our best knowledge, no work
currently investigates whether all these concerns can be tackled all together into
a unified system.
Yet, we assume that the maturity of system virtualization capabilities and re-
cent improvements in their usage [7, 14] enable to design and implement such a
448 A. Lèbre et al.

system. To overcome the issues of traditional server-centric solutions, its design


should benefit from the lessons learnt from distributed operating systems and
Single System Image proposals [29]. Furthermore, to address the different objec-
tives and reduce the management complexity, we advocate the use of autonomic
mechanisms [22]. In other words, we argue for the design and the implementa-
tion of a distributed OS, sometimes referred as “Cloud OS”, manipulating VEs
instead of processes. We strongly support to use micro-kernel concepts to de-
liver a platform agnostic framework where a physical node, as well as a complete
IaaS platform, can be seen as a simple bare hardware. As a consequence, the
framework can manage each VM of a VE throughout a federation of bare hard-
ware, using the capabilities provided by each of them (for example start/stop,
suspend/resume).
To our best knowledge, XenServer [4] and VSphere [24] are probably the
most advanced proprietary solutions targeting most of these goals. However,
they are still facing scalability issues and do not address, for instance, IaaS
federation concerns. In this paper, we propose to go further by giving an overview
of the DISCOVERY architecture, a DIStributed and COoperative framework to
manage Virtual EnviRonments autonomicallY.
The remaining of the paper is organized as follows. First, we present current
IaaS frameworks and most advanced mechanisms to manage VEs in Section 2.
Second, we introduce the global architecture and briefly discuss scientific and
technical challenges of the components of the DISCOVERY system in Section 3.
Section 4 presents an overview of the DISCOVERY engine. Finally, Section 5
concludes and highlights the importance to address such a proposal through a
solid community composed of experts of each domain (storage, network, fault-
tolerance, P2P, security . . . ).

2 Related Work

Due to the recent widespread diffusion of cloud computing (CC), there is a


growing number of software projects that deal with the management of virtual
infrastructures, especially in the context of private CC. Most of these systems
are designed to substantially reduce the administrative burden of managing clus-
ters of virtual machines while simultaneously improving the ability of users to
request, control, and customize their virtual computing environment. Beyond
the previously cited Nimbus, OpenStack and OpenNebula projects, there are a
lot of Open Source projects. Among the others, we can mention: OpenQRM [2],
SnowFlock [23], Usher [26] and Eucalyptus [28]. Although they differ in ap-
proach and technological aspects, most of these systems are designed with a
traditional centralized approach. As reported in [13], these systems do not scale
well and moreover, lead to the problem of Single Point Of Failures (SPOFs).
Such drawbacks have not been really addressed until now and major improve-
ments have rather focused on virtualization internals or on particular needs. For
instance, VMs live-migration [12] provides flexibility by enabling to schedule VEs
dynamically in a cluster-wide context [17]. But, migrating several VMs among
DISCOVERY, Beyond the Clouds 449

different physical nodes transparently, while ensuring the correctness of their


computation, requires advanced memory management and data transfer strate-
gies [18, 19]. Moreover, when live migration is done at a Wide Area Network
(WAN) scale [9, 10], also VM image concerns must be taken into account. As
we stated in Section 1, failures become an important issue to address and cloud
resiliency is becoming an important task [16]. Checkpointing is a promising ap-
proach to system reliability [6,11], since it ensures a way for taking snapshots of
the execution of a virtual environment and allows, in case of failure, to restart
computations from a previously saved state. VM images management is another
big concern. Using traditional network solutions for storage such as the Network
File-System (NFS), while it is a perfectly adequate solution for small clusters,
it will not scale as the number of nodes increases. Apart from the specifics of a
given hardware setup, this is a direct consequence of having an external fixed
storage system, whose bandwidth is independent from the computational clus-
ter size. On the contrary, the use of distributed file-systems in the context of
VMs management [5] seems very promising and is encouraging the development
of dedicated distributed FS specifically tailored to the VMs management [27].
Finally, deploying several VMs in different administrative domains [34], while
providing a unified network overlay, requires new solutions based on the cre-
ation of virtual isolated network environments [25, 31].
A lot of works have and still continue to be done in virtualization in distributed
architectures. However none of these works are focusing on the design and the
implementation concerns of a unified system leveraging recent contributions to
efficiently manage VEs across a large-scale infrastructure.

3 The DISCOVERY Proposal


While considering previous and on-going works as foundations for the DISCOV-
ERY initiative, we argue for the design and the implementation of a unified
framework that aims at insuring scalability, reliability and reactivity in the
management of a significant number of VEs. In this section, we present the
architecture overview we designed to meet these objectives and then, highlight
scientific and technical challenges of each component.

3.1 Architecture Overview


The DISCOVERY architecture relies on a peer-to-peer model composed of sev-
eral agents (see Figure 1). Each agent cooperates in managing VEs throughout
the DISCOVERY network.
In the DISCOVERY system, we define a VE as a set of VMs that may have
specific requirements in terms of hardware, software and also in terms of place-
ment: a user may express the wish to have particular VMs in a same location
to cope with performance objectives whereas he/she can ask that others should
not be collocated to insure high-availability criteria for instance.
In order to be platform agnostic, each agent leverages virtualization technolo-
gies wrappers. This enables to start, stop, suspend, resume and relocate VMs
450 A. Lèbre et al.

Fig. 1. The DISCOVERY infrastructure

without limiting the DISCOVERY proposal to a particular virtualization plat-


form. Moreover, the adoption of the Open Virtualization Format (OVF) [14] by
major virtualization actors should enable to soon assign any VM on whatever
virtualization platforms. Similarly, we propose to leverage IaaS APIs. By means
of specialized DISCOVERY agents that wrap the IaaS functionalities, it allows
to treat IaaS frameworks as if they were “super” nodes of the system. Although
it implies few restrictions such as the inability to use live-migration between an
external PM and an IaaS framework yet, it enables to hide all the underlying
instruments so that the VEs are unaware of the physical resources they are run-
ning on. Regarding the VM snapshotting capability that is required to insure
reliability of VEs, we assume that IaaS providers will extend their API in order
to offer it in a mid-term future.

3.2 The DISCOVERY Agent


Relying on the peer-to-peer approach, on the concept of the VEs and on the
common set of VM operations, we designed the DISCOVERY agent. At coarse-
grain, it is composed of three major services (see Figure 2) (i) the DISCOVERY
Network Tracker (DNT), (ii) the Virtual Environments Tracker (VET) and (iii)
the Local Resources Tracker (LRT).

DISCOVERY Network Tracker. The DNT is in charge of maintaining a


logical view of the DISCOVERY network to make communications and informa-
tion sharing between services transparent and reliable. Leveraging Distributed
Hash Table (DHT) mechanisms [30, 33, 35], it relieves each service of dealing
with the burden of nodes’ resiliency. First works will focus on reducing as far as
possible the DISCOVERY’s system states that should be saved into the DHT.
The objective is to minimize the performance degradation while insuring the
reliability of the whole system. Mid-term challenges will concern the definition
of one or several network overlays with respect to the network topologies so that
when one peer leaves or fails, the one that takes over is “well” located. Finally,
the study of voluntary split or merge of overlays can be also relevant.
DISCOVERY, Beyond the Clouds 451

Fig. 2. Architecture overview

Virtual Environment Tracker. Each VET is charge of managing a set of VEs


during their whole life cycle. This includes handling user-requests, uploading VM
images into the DISCOVERY network and insuring that the VEs it manages can
start and correctly run until their completion. The main challenges concern:
– The configuration of the network (covering VM IP assignments and use of ad-
vanced technologies to maintain intra-connectivity while insuring isolations
and avoiding conflicts between the different VEs [25, 31, 34]).
– The management of the VM images that should be (i) consistent with regard
to the location of each VM throughout the DISCOVERY network and (ii)
reachable in case of failures.
– The efficient use of the snapshotting capability to resume a VE from its
latest consistent state in case of failures.
These three concerns are respectively addressed through functionalities available
in the Network, Image and Reliability layers. Each layer will rely on solutions
such as the ones described in Section 2. Our objective is to let the possibility to
developers to switch between several mechanisms.

Local Resources Tracker. The LRT is in charge of monitoring the resource


usage of the bare hardware. It notifies events (such as overloaded, underloaded,
extinction requested . . . ) to other LRTs in order to balance or suspend VMs of
VEs with respect to the scheduling policy that has been defined (consolidation,
load-balancing, . . . ). The main challenges concern :
– The management of the events (considering that each event may occur
simultaneously throughout the infrastructure leading to several schedul-
ing/reconfigurations processes).
– The scheduling process itself (keeping in mind that for scalability reason, it
will not be able to rely on a global view of the resource usage).
452 A. Lèbre et al.

– And finally the application of each reconfiguration that may occur concur-
rently throughout the infrastructure.

According to the lack of solutions that try to address these concerns and con-
sidering that the LRT component is central for the DISCOVERY architecture,
we chose to start our investigation from it.

4 DISCOVERY in a Nutshell

Assuming that a lot of works has to be done to develop a framework as complex


as the one we described, we present in this section, a basic overview of the
DISCOVERY engine. This description has been driven by major events/actions
that may occur throughout the DISCOVERY network.
When a peer joins the DISCOVERY network, it queries the DNT to get a
VET instance. If one peer of the network manages more than one VET, the DNT
will assign one of these VETs to the new peer. Otherwise, a new VET instance
with a unique id is allocated on the new peer that starts to become active.
A user can query any active peer for the creation of a particular VE. His/her
request is forwarded to one of the VETs available in the system according to
the DISCOVERY balancing policy. Once the request has been assigned to a
particular VET, a VE handler (VEH) is created. This VEH is identified by a
unique id composed of the id of the VET and a local id incremented each time the
VET launches a new VEH. The VEH will monitor and will apply each operation
that is mandatory to correctly run the VE. Similarly to the VEH, a VM handler
(VMH) is created for each VM composing the VE. The VEH and the VMHs
interact during the whole execution of the VE.
At the beginning, the VEH starts, locally, as many VMHs as it is requested.
The LRT detects these new VMHs and checks whether it will be able to host the
related VMs. According to the available resources, each VMH may be relocated
to another peer or be informed that the system cannot satisfy their requirements
due to a lack of resources. When enough resources are available in the DISCOV-
ERY system, each VMH contacts its VEH to notify it to effectively start the
VMs. The VEH is then in charge of delivering the VM images to the right loca-
tions and configuring the network (including IP assignments and VLAN setup).
When all VMs are started, the VE switches to the running state. Each time
the LRT decides to relocate (or suspend) a VM, it notifies the VMH, which in
turn informs its VET to perform the requested operation. By preventing direct
interactions on VMs, we insure to keep VEs in consistent state. If one of the
VMs should be suspended due to a lack of resources, the VEH will suspend the
whole VE, keeping it in a consistent state.
When a peer wants to leave the DISCOVERY network, the LRT switches to
an overloaded state where each VMH (and by transitivity the related VMs) have
to be relocated somewhere else in the DISCOVERY network. In the meantime,
the DNT associates the VET to another node so that VMHs can continue to
contact it (as illustrated on Figure 2), a DISCOVERY agent can be composed
DISCOVERY, Beyond the Clouds 453

of several VETs). Once the VET has been assigned to another node and once
all VMHs have been relocated (or suspended), the peer can properly leave.
Regarding reliability, two cases must be considered: the crash of VMs and the
crash of nodes. In the first case, the reliability relies on (i) the snapshots of the
VE, which is periodically performed by the VEH and (ii) the heartbeats that are
periodically sent by each VMHs to the VEH. If the VEH does not receive one
of the VMHs’ heartbeats, it has to suspend all remaining VMs and resume the
whole VE from its latest consistent state. This process is similar to the starting
one: the missing VMHs are launched locally and the LRT is in charge of assigning
them throughout the DISCOVERY network. When the LRT completes this op-
eration, the VMHs receive a notification and in turn contact the VET to resume
all VMs from their latest consistent state. Before resuming each VM, the VET
checks whether it has to deliver the snapshot images to the nodes. Regarding
the crash of a node, the recovery process relies on DHT mechanisms used by the
DNT. When a VET starts a new VEH, the description of the associated VE is
stored in the DHT. Similarly, this description is updated/completed each time
the VEH snapshots the VE (mainly to update the locations of the snapshots).
By such a way, when a failure of a node is detected (either by leveraging DHT
principles or simply by implementing a heartbeat approach between nodes), the
“neighbor” node is able to restart the VET and the associated VEHs from the
information that have been previously replicated through the DHT. Once all
VEHs have recovered, the VMHs heartbeat mechanism is used either to reat-
tach the VMHs to the VEH or to resume the VE from its latest consistent state
if it is needed.

5 Conclusion
It is undeniable: virtualization technology has become a key element of dis-
tributed architectures. Although there have been considerable improvements, a
lot of works continue to focus on virtualization internals and only few actions
address design and implementation concerns of the frameworks that leverage
virtualization technologies to manage distributed architectures. Considering the
growing size of infrastructures in terms of nodes and virtual machines, new
proposals relying on more autonomic and decentralized approaches should be
discussed to overcome the limitations of traditional server-centric solutions.
In this paper, we introduced the DISCOVERY initiative that aims at leverag-
ing recent contributions on virtualization technologies and previous distributed
operating systems proposals to design and implement a new kind of virtualiza-
tion frameworks insuring scalability, reliability and reactivity of the whole sys-
tem. Our proposal relies on micro-kernel approaches and peer-to-peer models.
Starting from the point that each node may be seen as a bare-hardware provid-
ing basic functionalities to manipulate VMs and monitor resources usages, we
design an agent composed of several services that cooperate in managing virtual
environments throughout the DISCOVERY network.
Although the design may look simple at the first sight, the implementation
of each block will require specific expertise. As an example, strong assumptions
454 A. Lèbre et al.

on the internals of the Virtual Environments Tracker have been done (consider-
ing that the three layers: Image, Network and Reliability were available). Each
of them requires deeper investigations with the contributions of the scientific
community. Furthermore, the DISCOVERY framework should be extended with
other concerns such as security, user quota . . . to meet our objective to design
and implement a complete distributed OS of VMs. Again, this cannot be done
without querying the scientific community.

References
1. Nimbus is cloud computing for science, http://www.nimbusproject.org/
2. Openqrm, http://www.openqrm.com/
3. Openstack: The open source, open standards cloud. open source software to build
private and public clouds, http://www.openstack.org/
4. XenServer Administrator’s Guide 5.5.0. Tech. rep., Citrix Systems (February 2010)
5. Anedda, P., Leo, S., Gaggero, M., Zanetti, G.: Scalable Repositories for Virtual
Clusters. In: Lin, H.-X., Alexander, M., Forsell, M., Knüpfer, A., Prodan, R.,
Sousa, L., Streit, A. (eds.) Euro-Par 2009 Workshop. LNCS, vol. 6043, pp. 414–423.
Springer, Heidelberg (2010)
6. Anedda, P., Leo, S., Manca, S., Gaggero, M., Zanetti, G.: Suspending, migrating
and resuming hpc virtual clusters. Future Generation Computer Systems 26(8),
1063–1072 (2010)
7. Bolte, M., Sievers, M., Birkenheuer, G., Niehörster, O., Brinkmann, A.: Non-
intrusive virtualization management using libvirt. In: Proceedings of the Con-
ference on Design, Automation and Test in Europe, DATE 2010, pp. 574–579.
European Design and Automation Association, Leuven (2010)
8. Borthakur, D.: The Hadoop Distributed File System: Architecture and Design.
The Apache Software Foundation (2007)
9. Bose, S.K., Brock, S., Skeoch, R., Rao, S.: CloudSpider: Combining Replication
with Scheduling for Optimizing Live Migration of Virtual Machines Across Wide
Area Networks. In: 11th IEEE/ACM International Symposium on Cluster, Cloud,
and Grid Computing (CCGrid), Newport Beach, California, U.S.A (May 2011)
10. Bradford, R., Kotsovinos, E., Feldmann, A., Schiöberg, H.: Live wide-area migra-
tion of virtual machines including local persistent state. In: Proceedings of the
3rd International Conference on Virtual Execution Environments, VEE 2007, pp.
169–179. ACM, San Diegoe (2007)
11. Chanchio, K., Leangsuksun, C., Ong, H., Ratanasamoot, V., Shafi, A.: An efficient
virtual machine checkpointing mechanism for hypervisor-based hpc systems. In:
High Availability and Performance Computing Workshop, Denver, USA (2008)
12. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I.,
Warfield, A.: Live migration of virtual machines. In: Proceedings of the 2nd con-
ference on Symposium on Networked Systems Design & Implementation, NSDI
2005, vol. 2, pp. 273–286. USENIX Association, Berkeley (2005)
13. Claudel, B., Huard, G., Richard, O.: Taktuk, adaptive deployment of remote exe-
cutions. In: Proceedings of the 18th ACM International Symposium on High Per-
formance Distributed Computing, HPDC 2009. ACM, Munich (2009)
14. DMTF: Open Virtualization Format Specification (January 2010),
http://www.dmtf.org/standards/ovf
DISCOVERY, Beyond the Clouds 455

15. Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual
machines. In: Proceedings of the 23rd International Conference on Distributed
Computing Systems (ICDCS). IEEE, Washington, DC (2003)
16. Ghosh, R., Longo, F., Naik, V.K., Trivedi, K.S.: Quantifying resiliency of iaas
cloud. In: SRDS, pp. 343–347. IEEE (2010)
17. Hermenier, F., Lèbre, A., Menaud, J.M.: Cluster-wide context switch of virtual-
ized jobs. In: Proceedings of the 19th ACM International Symposium on High
Performance Distributed Computing, HPDC 2010. ACM, New York (2010)
18. Hines, M.R., Gopalan, K.: Post-copy based live virtual machine migration using
adaptive pre-paging and dynamic self-ballooning. In: Proceedings of the 2009 ACM
SIGPLAN/SIGOPS International Conference on Virtual Execution Environments,
VEE 2009, pp. 51–60. ACM, Washington, DC (2009)
19. Jin, H., Deng, L., Wu, S., Shi, X., Pan, X.: Live virtual machine migration with
adaptive, memory compression. In: IEEE International Conference on Cluster
Computing and Workshops, CLUSTER 2009, pp. 1–10 (September 2009)
20. Keahey, K.: From sandbox to playground: Dynamic virtual environments in the
grid. In: Proceedings of the 5th International Workshop on Grid Computing (2004)
21. Keahey, K., Tsugawa, M., Matsunaga, A., Fortes, J.: Sky computing. IEEE Internet
Computing 13, 43–51 (2009)
22. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. Computer 36(1),
41–50 (2003)
23. Lagar-Cavilla, H.A., Whitney, J., Bryant, R., Patchin, P., Brudno, M., de Lara,
E., Rumble, S.M., Satyanarayanan, M., Scannell, A.: Snowflock: Virtual ma-
chine cloning as a first class cloud primitive. Transactions on Computer Systems
(TOCS) 19(1) (February 2011)
24. Lowe, S.: Introducing VMware vSphere 4, 1st edn. Wiley Publishing Inc., Indi-
anapolis (2009)
25. McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford,
J., Shenker, S., Turner, J.: OpenFlow: Enabling Innovation in Campus Networks.
SIGCOMM Comput. Commun. Rev. 38(2), 69–74 (2008)
26. McNett, M., Gupta, D., Vahdat, A., Voelker, G.M.: Usher: An Extensible Frame-
work for Managing Clusters of Virtual Machines. In: Proceedings of the 21st Large
Installation System Administration Conference (LISA) (November 2007)
27. Nicolae, B., Bresnahan, J., Keahey, K., Antoniu, G.: Going back and forth: Effi-
cient multi-deployment and multi-snapshotting on clouds. In: Proceedings of the
20th ACM International Symposium on High Performance Distributed Computing,
HPDC 2011. ACM, New York (2011)
28. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,
Zagorodnov, D.: The eucalyptus open-source cloud-computing system. In: Pro-
ceedings of the 9th IEEE/ACM International Symposium on Cluster Computing
and the Grid, CCGRID, Washington, DC, USA (2009)
29. Quesnel, F., Lebre, A.: Operating Systems and Virtualization Frameworks: From
Local to Distributed Similarities. In: Cotronis, Y., Danelutto, M., Papadopoulos,
G.A. (eds.) PDP 2011: Proceedings of the 19th Euromicro International Confer-
ence on Parallel, Distributed and Network-Based Computing, pp. 495–502. IEEE
Computer Society, Los Alamitos (2011)
30. Rowstron, A., Druschel, P.: Pastry: Scalable, Decentralized Object Location, and
Routing for Large-Scale Peer-to-Peer Systems. In: Guerraoui, R. (ed.) Middleware
2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)
456 A. Lèbre et al.

31. Ruth, P., Rhee, J., Xu, D., Kennell, R., Goasguen, S.: Autonomic live adaptation
of virtual computational environments in a multi-domain infrastructure. In: IEEE
International Conference on Autonomic Computing, ICAC 2006 (June 2006)
32. Sotomayor, B., Montero, R., Llorente, I., Foster, I., et al.: Virtual infrastructure
management in private and hybrid clouds. IEEE Internet Computing 13(5) (2009)
33. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek,
F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet
applications. IEEE/ACM Transactions on Networking 11(1), 17–32 (2003)
34. Tsugawa, M., Fortes, J.: A virtual network (vine) architecture for grid computing.
In: International Parallel and Distributed Processing Symposium, p. 123 (2006)
35. Zhao, B.Y., Huang, L., Stribling, J., Rhea, S.C., Joseph, A.D., Kubiatowicz, J.D.:
Tapestry: a resilient global-scale overlay for service deployment. IEEE Journal on
Selected Areas in Communications 22(1), 41–53 (2004)
Cooperative Dynamic Scheduling of Virtual
Machines in Distributed Systems

Flavien Quesnel and Adrien Lèbre

ASCOLA Research Group, Ecole des Mines de Nantes/INRIA/LINA, Nantes, France


firstname.lastname@mines-nantes.fr

Abstract. Cloud Computing aims at outsourcing data and applications


hosting and at charging clients on a per-usage basis. These data and ap-
plications may be packaged in virtual machines (VM), which are them-
selves hosted by nodes, i.e. physical machines.
Consequently, several frameworks have been designed to manage VMs
on pools of nodes. Unfortunately, most of them do not efficiently address
a common objective of cloud providers: maximizing system utilization
while ensuring the quality of service (QoS). The main reason is that
these frameworks schedule VMs in a static way and/or have a centralized
design.
In this article, we introduce a framework that enables to schedule
VMs cooperatively and dynamically in distributed systems. We evaluated
our prototype through simulations, to compare our approach with the
centralized one. Preliminary results showed that our scheduler was more
reactive. As future work, we plan to investigate further the scalability of
our framework, and to improve reactivity and fault-tolerance aspects.

1 Introduction
Scheduling jobs has been a major concern in distributed computer systems.
Traditional approaches rely on batch schedulers [2] or on distributed operating
systems (OS) [7]. Although batch schedulers are the most deployed solutions,
they may lead to a suboptimal use of resources. They usually schedule processes
statically – each process is assigned to a given node and stays on it until its
termination – according to user requests for resource reservations, that may
be overestimated. On the contrary, preemption mechanisms were developed for
distributed OSes to make them schedule processes dynamically, in line with
their effective resource requirements. However, these mechanisms were hard to
implement due to the problem of residual dependencies [1].
Using system virtual machines (VM) [14], instead of processes, allows to per-
form dynamic scheduling of jobs while avoiding the issue of residual dependen-
cies [4,12]. However, some virtual infrastructure managers (VIM) still schedule
VMs in a static way [6,10]; it conflicts with a common objective of virtual infras-
tructure providers: maximizing system utilization while ensuring the quality of
service (QoS). Other VIMs implement dynamic VM scheduling [5,8,15], which
enables a finer management of resources and resource overcommitment. However,

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 457–466, 2012.

c Springer-Verlag Berlin Heidelberg 2012
458 F. Quesnel and A. Lèbre

Service node
Worker node
Communication
between nodes

1. Monitoring

3. Applying 2. Computing
schedule schedule

(a) Scheduling steps (b) Workload fluctuations during


scheduling

Fig. 1. Scheduling in a centralized architecture

they often rely on a centralized design, which prevents them to scale and to be
reactive. Scheduling is indeed an NP-hard problem, the time needed to solve it
grows exponentially with the number of nodes and VMs considered. Besides, it
takes time to apply a new schedule, because manipulating VMs is costly [4]. Dur-
ing the computation and the application of a schedule (cf. Fig. 1(a)), centralized
managers do not enforce the QoS anymore, and thus cannot react quickly to QoS
violations. Moreover, the schedule may be outdated when it is eventually applied
if the workloads have changed (cf. Fig. 1(b)). Finally, centralization can lead to
fault-tolerance issues: VMs may not be managed anymore if the master node
crashes, as it is a single point of failure (SPOF). Considering all the limitations
of centralized solutions, more decentralized ones should be investigated. Indeed,
scheduling takes less time if the work is distributed among several nodes, and
the failure of a node does not stop the scheduling anymore.
Several proposals have been made precisely to distribute dynamic VM man-
agement [3,13,17]. However, the resulting prototypes are still partially central-
ized. Firstly, at least one node has access to a global view of the system. Secondly,
several VIMs consider all nodes for scheduling, which limits scalability. Thirdly,
several VIMs still rely on service nodes, that are potential SPOFs.
In this paper, we introduce a VIM that enables to schedule and manage VMs
cooperatively and dynamically in distributed systems. We designed it to be non-
predictive and event-driven, to work with partial views of the system, and to
require no SPOF. We made these choices for the VIM to be reactive, scalable
and fault-tolerant. In our proposal, when a node cannot guarantee the QoS for
its hosted VMs or when it is under-utilized, it starts an iterative scheduling pro-
cedure (ISP) by querying its neighbor to find a better placement. If the request
cannot be satisfied by the neighbor, it is forwarded to the following one until the
ISP succeeds. This approach allows each ISP to consider a minimum number of
nodes, thus decreasing the scheduling time, without requiring a central point. In
addition, several ISPs can occur independently at the same moment throughout
the infrastructure, which significantly improves the reactivity of the system. It
Cooperative Dynamic Scheduling of VM in Distributed Systems 459

should be noted that nodes are reserved for exclusive use by a single ISP, to
prevent conflicts that can occur if several ISPs do concurrent operations on the
same nodes or VMs. In other words, scheduling is performed on partitions of the
system, that are created dynamically. Moreover, communication between nodes
is done through a fault-tolerant overlay network, which relies on distributed hash
table (DHT) mechanisms to mitigate the impact of a node crash [9]. We eval-
uated our prototype by means of simulations, to compare our approach with
the centralized one. Preliminary results were encouraging and showed that our
scheduler was reactive even if it had to manage several nodes and VMs.
The remainder of this article is structured as follows. Section 2 presents re-
lated work. Section 3 gives an overview of our proposal, while Sect. 4 details its
implementation and Sect. 5 compares it to a centralized proposal [5]. Finally,
Sect. 6 discusses perspectives and Sect. 7 concludes this article.

2 Related Work
This section presents some work that aim at distributing resource management,
especially those related to the dynamic scheduling of VMs. Contrary to previous
solutions that performed scheduling periodically, recent proposals tend to rely
on an event-based approach: scheduling is started only if an event occurs in the
system, for example if a node is overloaded.
In the DAVAM project [16], VMs are dynamically distributed among man-
agers. When one VM has not enough resources, its manager tries to relocate it
by considering all resources of the system (the manager builds this global view
by communicating with its neighbors).
Another proposal [13] relies on peer-to-peer networks. It is very similar to the
centralized approaches, except that there is no service node, so that it is more
fault-tolerant. When an event occurs on a node, this node collects monitoring
information on all nodes, finds which nodes can help it to fix the problem, and
performs appropriate migrations.
A third proposition [17] relies on the use of a service node that collects mon-
itoring information on all worker nodes. When an event occurs on a worker
node, this node retrieves information from the service node, computes a new
schedule and performs appropriate migrations. This approach does not consider
fault-tolerance issues.
Snooze [3] has a hierarchical design: nodes are dynamically distributed among
managers, a super manager oversees managers and has a global view of the
system. When an event occurs, it is processed by a manager that considers all
nodes it is in charge of. Snooze design is close to the Hasthi [11] one; the main
difference is that Snooze targets virtualized systems and single system images,
while Hasthi is presented to be system agnostic.

3 Proposal Overview
In this section, we describe the theoretical foundations of our proposal. After
giving its main characteristics, we explain shortly how it works.
460 F. Quesnel and A. Lèbre

3.1 Main Characteristics

Reactivity, scalability and fault-tolerance are desired properties to make a VIM


with a better QoS management.
Keeping that in mind, we made the VIM follow an event-based approach. In
this context, scheduling is started only when it is required, on the reception of
events, leading to better reactivity. This contrasts with more traditional solu-
tions where scheduling is started periodically. This also differs from a predictive
approach, where new schedules are computed in advance to anticipate work-
load fluctuations; this kind of approach requires knowledge on workload profiles,
which is not always possible.
An event may be generated each time a virtualized job (vjob) [4] is submitted
or terminates, when a node is overloaded or underloaded, or when a system
administrator wants to put a node into maintenance mode.
Besides relying on events, our VIM is comparable to peer-to-peer systems.
There is no service node, all nodes are equal. Each node can (i) be used to
submit vjobs, (ii) generate events and (iii) try to solve events generated by other
nodes.
A node monitors only its local resources. However, it can get access on-demand
to a partial view of the system by communicating with its neighbors by means of
an overlay network similar to those used to implement distributed hash tables.
To facilitate understanding, we consider that the communication path is a ring.
Accessing a partial view of the system improves scalability (computing and ap-
plying a schedule is faster) while the DHT mechanisms enhance fault-tolerance
(the nodes can continue to communicate transparently even if several of them
crash).

3.2 The Iterative Scheduling Procedure

When a node Ni retrieves its local monitoring information and detects a problem
(e.g. it is overloaded), it starts a new iterative scheduling procedure by generating
an event, reserving itself for the duration of this ISP, and sending the event to
its neighbor, node Ni+1 (cf. Fig. 2).
Node Ni+1 reserves itself, updates node reservations and retrieves monitoring
information on all nodes reserved for this ISP, i.e. on nodes Ni and Ni+1 . It then
computes a new schedule. If it fails, it forwards the event to its neighbor, node
Ni+2 .
Node Ni+2 performs the same operations as node Ni+1 . If the computation of
the new schedule succeeds, node Ni+2 applies it (e.g. by performing appropriate
VM migrations) and finally cancels the reservations, so that nodes Ni , Ni+1 and
Ni+2 are free to take part in another ISP.
Considering that a given node can take part only in one of these iterative
scheduling procedures at a time, several ISPs can occur simultaneously and
independently throughout the infrastructure, thus improving reactivity.
Note that if a node receives an event while it is reserved, it just forwards it
to its neighbor.
Cooperative Dynamic Scheduling of VM in Distributed Systems 461

Fig. 2. Iterative scheduling procedure

4 Implementation
4.1 Current State
We implemented our proposal in Java. The prototype can currently process
‘overloaded node’ and ‘underloaded node’ events; these events are defined by
means of CPU and memory thresholds by the system administrator. Moreover,
the overlay network is a simple ring (cf. Fig. 3) without any fault-tolerance
mechanism, i.e. it cannot recover from a node crash. Furthermore, the prototype
manipulates virtual VMs, i.e. Java objects.

4.2 Node Agent


The VIM is composed of node agents (NA).
There is one NA on every node, each NA being made of a knowledge base, a
resource monitor, a client, a server and a scheduler (cf. Fig. 3).
The knowledge base contains various types of information. Some information
is available permanently: monitoring information about the local node (resources
consumed and VMs hosted), a stub to contact the neighbor, and a list of events
generated by the node. Other information is accessible only during an iterative
scheduling procedure: monitoring information about the nodes reserved (if a
scheduler is running on the node) and a stub to contact the scheduler that tries
to solve the event.
The resource monitor retrieves node monitoring information periodically and
updates the knowledge base accordingly. If it detects a problem (e.g. the node is
overloaded), it starts a new ISP by generating an event, reserving the node for
this ISP and sending the event to the neighbor by means of a client.
462 F. Quesnel and A. Lèbre

Fig. 3. Implementation overview

A client is instantiated on-demand to send a request or a message to a server.


The server processes requests and messages from other nodes. In particular,
it launches a scheduler when it receives an event.
The scheduler first retrieves monitoring information from the nodes taking
part in an ISP. It then tries to solve the corresponding event by computing a
new schedule and applying it, if possible. If the schedule is applied successfully,
the scheduler finally cancels node reservations. The prototype is designed so that
any dynamic VM scheduler may be used to compute and apply a new schedule.
Currently, the prototype relies on Entropy [5], with consolidation as the default
scheduling policy.

5 Experiments
We compared our approach with the Entropy [5] one by means of simulation.
Basically, the simulator injected a random CPU workload into each virtual VM
and waited until the VIM solves all ‘overloaded node’ issues. Comparison cri-
teria included the average time to solve an event, the time elapsed since the
load injection until all ‘overloaded node’ issues are solved, and the cost of the
schedule to apply. This cost is related to the kind of actions to perform on VMs
(e.g. migrations) and to the amount of memory allocated to the VMs that are
manipulated [5].
The experiments were done on a HP Proliant DL165 G7 with 2 CPUs (AMD
Opteron 6164 HE, 12 cores, 1.7 GHz) and 48 GB of RAM. The software stack
was composed of Debian 6/Squeeze, Sun Java VM 6 and Entropy 1.1.1. The
simulated nodes had 2 CPUs (2 GHz) and 4 GB of RAM. The simulated VMs
had 1 virtual CPU (2 GHz) and 1 GB of RAM. The virtual CPU load could take
only one of the following values (in percentage): 0, 20, 40, 60, 80, 100. Entropy
has timeouts to prevent it to spend too much time computing a new schedule;
these timeouts were set to twice the number of nodes considered (in seconds).
Our VIM considers that a node is overloaded if the VMs hosted try to consume
Cooperative Dynamic Scheduling of VM in Distributed Systems 463

more than 100% of CPU or RAM; it is underloaded if less than 20% of CPU
and less than 50% of RAM are used.
As we can see on Table 1, our VIM is more reactive, i.e. it quickly solved
individual events, especially the ‘overloaded node’ ones. This can be explained
by the fact that our VIM generally considers a few number of nodes, compared
to Entropy. This leads to a smaller cost for applying schedules.

Table 1. Experimental results

128 VM / 64 nodes 256 VM / 128 nodes


DVMS Entropy DVMS Entropy
Iteration length (s) Avg 83 198 114 475
(time between two Std dev 41 56 82 37
iterations) Max 232 240 427 489
Avg 12 N/A 12 N/A
Time to solve an event
Std dev 18 N/A 19 N/A
(s)
Max 149 N/A 299 N/A
Time to solve an Avg 6 N/A 6 N/A
overloaded node event Std dev 12 N/A 12 N/A
(s) Max 52 N/A 48 N/A
Number of nodes Avg 8 64 10 128
considered (partition Std dev 8 0 14 0
size) Max 60 64 115 128
Maximum cost for Avg 7134 24405 8479 39977
applying the schedule Std dev 2690 12798 2756 20689
(arbitrary unit) Max 13312 49152 18432 87040
Avg 55 53 54 53
Percentage of nodes
Std dev 2 2 2 2
hosting VMs (%)
Max 58 58 59 59
(Distributed VM Scheduler vs Entropy Centralized approach)
Avg: average values, Std dev: standard deviation, Max: worst case

In details, the first row shows the iteration length that corresponds to the
required time to solve all events occurring during one iteration. The second row
gives the time to solve one event. That is the time between the event appearance
and its resolution. The third row focuses on overloaded events. These events refer
to QoS violations and must be solved as quickly as possible. For these two rows,
we do not mention the values of the centralized approach since it relies on a
periodic scheme: Entropy monitors the configuration at the beginning of the
iteration, analyzes the configuration and applies the schedule at the end. The
fourth row shows the size of each partition: i.e. the number of nodes considered
for a scheduling. As we can see on the fifth row, the smaller the partition is, the
cheaper is the reconfiguration cost. However, it is worth nothing that the values
for the Entropy approach, as previously, consider the total cost for the whole
iteration whereas the cost of the reconfiguration related to one event is considered
for the DVMS approach. As a consequence, the sum of each reconfiguration in
464 F. Quesnel and A. Lèbre

the DVMS approach can be higher than the cost corresponding of the Entropy
one. However, since we are trying to solve each event as soon as possible, we are
not interested by the global cost but by the cost for one event. Finally, the last
row presents the consolidation rate, which is the percentage of nodes hosting
at least one VM. We can see that, despite the fact that our approach is more
reactive, it does not impact negatively the consolidation rate.

6 Future Work
Several ways should be explored to improve the prototype, with regard to event
management, fault-tolerance and network topology.

Event Management. Event management could be enhanced by merging it-


erative solving procedures, rethinking event definition and implementing other
kinds of events.
Using ISPs can result in deadlocks, as they rely on dynamic partitions of
the system. A deadlock occurs when each node belongs to a partition and all
partitions need to grow, i.e. each ISP needs more nodes to solve the corresponding
event. Deadlocks can be resolved by merging ISPs, which implies to merge the
related events and partitions. A basic algorithm was implemented to do that,
but it will not be detailed in this article due to space limitations.
ISP merging can also be used to combine complementary events (e.g. an ‘over-
loaded node’ event with an ‘underloaded node’ one) to make ISPs converge faster,
thus increasing reactivity.
‘Overloaded node’ and ‘underloaded node’ events are currently defined by
means of CPU and memory thresholds. It may not be always relevant. For ex-
ample, if a load balancing policy is used while the global load is low, many
nodes will send ‘underloaded node’ events that cannot be solved. Refining event
definition by taking the neighbors’ load into account may be a solution.
Other kinds of events should be implemented, like those related to vjob sub-
missions or terminations, or to a node that is put into maintenance mode. More-
over, it may be interesting to take other resources than CPU and memory into
account, like network bandwidth.

Fault-Tolerance. Currently, the VIM is not fault-tolerant: if a node crashes, it


breaks the overlay network. This can be fixed with mechanisms used in DHTs [9].

Network Topology. The current prototype does not take the network topology
into account. However, the knowledge of network bandwidth between each pair
of nodes could lead to faster migrations in a heterogeneous system.

7 Conclusion
In this article, we proposed a new approach to schedule VMs dynamically and
cooperatively in distributed systems, keeping in mind the following objective:
maximizing system utilization while ensuring the quality of service.
Cooperative Dynamic Scheduling of VM in Distributed Systems 465

We presented the current state of implementation of a prototype and we eval-


uated it by means of simulations, to compare our approach with the centralized
one. Preliminary results were encouraging and showed that our solution was
more reactive and scalable.
On-going work has focused on performing larger-scale simulations and on
evaluating the prototype with real VMs. Future work will be done with regard
to event management, fault-tolerance and network topology. This work fits into
a broader project that seeks to implement a framework for managing VMs in
distributed systems the same way an OS manages processes on a local machine.

Acknowledgments. Experiments presented in this paper were carried out us-


ing the Grid’5000 experimental testbed, being developed under the INRIA AL-
ADDIN development action with support from CNRS, RENATER and several
Universities as well as other funding bodies (see https://www.grid5000.fr).

References

1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I.,
Warfield, A.: Live migration of virtual machines. In: NSDI 2005: Proceedings of the
2nd Conference on Symposium on Networked Systems Design and Implementation,
NSDI 2005, pp. 273–286. USENIX Association, Berkeley (2005)
2. Etsion, Y., Tsafrir, D.: A Short Survey of Commercial Cluster Batch Schedulers.
Tech. rep., The Hebrew University of Jerusalem, Jerusalem, Israel (May 2005)
3. Feller, E., Rilling, L., Morin, C., Lottiaux, R., Leprince, D.: Snooze: A Scalable,
Fault-Tolerant and Distributed Consolidation Manager for Large-Scale Clusters.
Tech. rep., INRIA Rennes, Rennes, France (September 2010)
4. Hermenier, F., Lebre, A., Menaud, J.M.: Cluster-Wide Context Switch of Virtu-
alized Jobs. In: VTDC 2010: Proceedings of the 4th International Workshop on
Virtualization Technologies in Distributed Computing. ACM, New York (2010)
5. Hermenier, F., Lorca, X., Menaud, J.M., Muller, G., Lawall, J.: Entropy: a consoli-
dation manager for clusters. In: Hosking, A.L., Bacon, D.F., Krieger, O. (eds.) VEE
2009: Proceedings of the 2009 ACM SIGPLAN/SIGOPS International Conference
on Virtual Execution Environments, pp. 41–50. ACM, New York (2009)
6. Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good,
J.: On the Use of Cloud Computing for Scientific Workflows. In: ESCIENCE
2008: Proceedings of the 2008 Fourth IEEE International Conference on eScience,
pp. 640–645. IEEE Computer Society, Washington, DC (2008)
7. Lottiaux, R., Gallard, P., Vallee, G., Morin, C., Boissinot, B.: OpenMosix, OpenSSI
and Kerrighed: a comparative study. In: CCGRID 2005: Proceedings of the Fifth
IEEE International Symposium on Cluster Computing and the Grid, vol. 2,
pp. 1016–1023. IEEE Computer Society, Washington, DC (2005)
8. Lowe, S.: Introducing VMware vSphere 4, 1st edn. Wiley Publishing Inc., Indi-
anapolis (2009)
9. Milojicic, D.S., Kalogeraki, V., Lukose, R., Nagaraja, K., Pruyne, J., Richard, B.,
Rollins, S., Xu, Z.: Peer-to-Peer Computing. Tech. rep., HP Laboratories, Palo
Alto, CA, USA (July 2003)
466 F. Quesnel and A. Lèbre

10. Nurmi, D., Wolski, R., Grzegorczyk, C., Obertelli, G., Soman, S., Youseff, L.,
Zagorodnov, D.: The Eucalyptus Open-Source Cloud-Computing System. In: Cap-
pello, F., Wang, C.L., Buyya, R. (eds.) CCGRID 2009: Proceedings of the 2009
9th IEEE/ACM International Symposium on Cluster Computing and the Grid,
pp. 124–131. IEEE Computer Society, Washington, DC (2009)
11. Perera, S., Gannon, D.: Enforcing User-Defined Management Logic in Large Scale
Systems. In: Services 2009: Proceedings of the 2009 Congress on Services - I,
pp. 243–250. IEEE Computer Society, Washington, DC (2009)
12. Quesnel, F., Lebre, A.: Operating Systems and Virtualization Frameworks: From
Local to Distributed Similarities. In: Cotronis, Y., Danelutto, M., Papadopoulos,
G.A. (eds.) PDP 2011: Proceedings of the 19th Euromicro International Confer-
ence on Parallel, Distributed and Network-Based Computing, pp. 495–502. IEEE
Computer Society, Los Alamitos (2011)
13. Rouzaud-Cornabas, J.: A Distributed and Collaborative Dynamic Load Balancer
for Virtual Machine. In: Guarracino, M.R., Vivien, F., Träff, J.L., Cannatoro, M.,
Danelutto, M., Hast, A., Perla, F., Knüpfer, A., Di Martino, B., Alexander, M.
(eds.) Euro-Par-Workshop 2010. LNCS, vol. 6586, pp. 641–648. Springer, Heidel-
berg (2011)
14. Smith, J.E., Nair, R.: The Architecture of Virtual Machines. Computer 38(5),
32–38 (2005)
15. Sotomayor, B., Montero, R.S., Llorente, I.M., Foster, I.: Virtual Infrastructure
Management in Private and Hybrid Clouds. IEEE Internet Computing 13(5),
14–22 (2009)
16. Xu, J., Zhao, M., Fortes, J.A.B.: Cooperative Autonomic Management in Dynamic
Distributed Systems. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873,
pp. 756–770. Springer, Heidelberg (2009)
17. Yazir, Y.O., Matthews, C., Farahbod, R., Neville, S., Guitouni, A., Ganti, S.,
Coady, Y.: Dynamic Resource Allocation in Computing Clouds Using Distributed
Multiple Criteria Decision Analysis. In: Cloud 2010: IEEE 3rd International Con-
ference on Cloud Computing, pp. 91–98. IEEE Computer Society, Los Alamitos
(2010)
Large-Scale DNA Sequence Analysis
in the Cloud: A Stream-Based Approach

Romeo Kienzler1 , Rémy Bruggmann2 , Anand Ranganathan3,


and Nesime Tatbul1
1
Department of Computer Science, ETH Zurich, Switzerland
romeok@student.ethz.ch, tatbul@inf.ethz.ch
2
Bioinformatics, Department of Biology, University of Berne, Switzerland
remy.bruggmann@biology.unibe.ch
3
IBM T.J. Watson Research Center, NY, USA
arangana@us.ibm.com

Abstract. Cloud computing technologies have made it possible to an-


alyze big data sets in scalable and cost-effective ways. DNA sequence
analysis, where very large data sets are now generated at reduced cost
using the Next-Generation Sequencing (NGS) methods, is an area which
can greatly benefit from cloud-based infrastructures. Although existing
solutions show nearly linear scalability, they pose significant limitations
in terms of data transfer latencies and cloud storage costs. In this paper,
we propose to tackle the performance problems that arise from having
to transfer large amounts of data between clients and the cloud based on
a streaming data management architecture. Our approach provides an
incremental data processing model which can hide data transfer latencies
while maintaining linear scalability. We present an initial implementa-
tion and evaluation of this approach for SHRiMP, a well-known software
package for NGS read alignment, based on the IBM InfoSphere Streams
computing platform deployed on Amazon EC2.

Keywords: DNA sequence analysis, Next-Generation Sequencing


(NGS), NGS read alignment, cloud computing, data stream processing,
incremental data processing.

1 Introduction
Today, huge amounts of data is being generated at ever increasing rates by
a wide range of sources from networks of sensing devices to social media and
special scientific devices such as DNA sequencing machines and astronomical
telescopes. It has become both an exciting opportunity to use these data sets
in intelligent applications such as detecting and preventing diseases or spotting
business trends, as well as a major challenge to manage their capture, transfer,
storage, and analysis.
Recent advances in cloud computing technologies have made it possible to
analyze very large data sets in scalable and cost-effective ways. Various platforms
and frameworks have been proposed to be able to use the cloud infrastructures

M. Alexander et al. (Eds.): Euro-Par 2011 Workshops, Part II, LNCS 7156, pp. 467–476, 2012.

c Springer-Verlag Berlin Heidelberg 2012
468 R. Kienzler et al.

for solving this problem such as the MapReduce framework [2], [4]. Most of
these solutions are primarily designed for batch processing of data stored in a
distributed file system. While such a design supports scalable and fault-tolerant
processing very well, it may pose some limitations when transferring data. More
specifically, large amounts of data has to be uploaded into the cloud before the
processing starts, which not only causes significant data transfer latencies, but
also adds to the cloud storage costs [19], [26].
In this short paper, we mainly investigate the performance problems that
arise from having to transfer large amounts of data in and out of the cloud
based on a real data-intensive use case from bioinformatics, for which we pro-
pose a stream-based approach as a promising solution. Our key idea is that data
transfer latencies can be hidden by providing an incremental data processing
architecture, similar in spirit to pipelined query evaluation models in traditional
database systems [15]. It is important though that this is done in a way to also
support linear scalability through parallel processing, which is an indispensable
requirement for handling data and compute-intensive workloads in the cloud.
More specifically, we propose to use a stream-based data management archi-
tecture, which not only provides an incremental and parallel data processing
model, but also facilitates in-memory processing, since data is processed on the
fly and intermediate data need not be materialized on disk (unless it is explicitly
needed by the application), which can further reduce end-to-end response time
and cloud storage costs.
The rest of this paper is outlined as follows: In Section 2, we describe our use
case for large-scale DNA sequence analysis which has been the main motivation
for the work presented in this paper. We present our stream-based solution
approach in Section 3, including an initial implementation and evaluation of our
use case based on the IBM InfoSphere Streams computing platform [5] deployed
on Amazon EC2 [1]. Finally, we conclude with a discussion of future work in
Section 4.

2 Large-Scale DNA Sequence Analysis


Determining the order of the nucleotide bases in DNA molecules and analyzing
the resulting sequences have become very essential in biological research and
applications. Since 1970s, the Sanger method (also known as dideoxy or chain
terminator method) had been the standard technique [22]. With this method, it
is possible to read about 80 kilo base pairs (bp) per instrument-day at a total
cost of $150. The Next-Generation Sequencing (NGS) methods, invented in 2004,
dramatically increased this per-day bp throughput, and therefore, the amount
of data generated that needed to be stored and processed [27]. To compare with
the Sanger method above, with NGS, the cost for sequencing 80 kbps has fallen
to less than $0.01 and is done in less than 10 seconds. Table 1 shows an overview
of speed and cost of three different NGS technologies compared to the Sanger
method. The higher throughput and lower cost of these technologies have led
to the generation of very large datasets that need to be efficiently analyzed. As
stated by Stein [25]:
Incremental DNA Sequence Analysis in the Cloud 469

Table 1. Compared to the Sanger method, NGS methods have significantly higher
throughput at a significant fraction of their costs

Sanger Roche 454 Illumina 2k SOLID 5


read length 700-900 500 100 75
GB per day 0.00008 0.5 25 42
cost per GB $2,000,000 $20,000 $75 $75

“Genome biologists will have to start acting like the high energy physi-
cists, who filter the huge datasets coming out of their collectors for a tiny
number of informative events and then discard the rest.”
NGS is used to sequence DNA in an automated and high-throughput process.
DNA molecules are fragmented into pieces of 100 to 800 bps, and digital versions
of DNA fragments are generated. These fragments, called reads, originate from
random positions of DNA molecules. In re-sequencing experiments the reads are
mapped back to a reference genome (e.g., human) [19] or - without a reference
genome - they can be assembled de novo [23]. However, de novo assembly is more
complex due to the short read length as well as to potential repetitive regions
in the genome. In re-sequencing experiments, polymorphisms between analyzed
DNA and the reference genome can be observed. A polymorphism of a single bp
is called Single Nucleotide Polymorphism (SNP) and is recognized as the main
cause of human genetic variability [9]. Figure 1 shows an example, with a ref-
erence genome at the top row and two SNPs identified on the analyzed DNA
sequences depicted below. As stated by Fernald et al, once NGS technology be-
comes available on a clinical level, it will become part of the standard healthcare
process to check patients’ SNPs before medical treatment (a.k.a., “personalized
medicine”) [12]:
“We are on the verge of the genomic era: doctors and patients will have
access to genetic data to customize medical treatment.”
Aligning NGS reads to genomes is computationally intensive. Li et al give an
overview of algorithms and tools currently in use [19]. To align reads containing
SNPs, probabilistic algorithms have to be used, since finding an exact match
between reads and a given reference is not sufficient because of polymorphisms
and sequencing errors. Most of these algorithms are based on a basic pattern
called seed and extend [8], where small matching regions between reads and the
reference genome are identified first (seeding), and then further extended. Ad-
ditionally, to be able to identify seeds that contain SNPs, a special algorithm
that allows for a certain difference during seeding needs to be used [16]. Un-
fortunately, this adaptation further increases the computational complexity. For
example, on a small cluster used by FGCZ [3] (25 nodes with a total of 232 CPU
compute cores and 800 GB main memory), a single genome alignment process
can take up to 10 hours.
Read alignment algorithms have been shown to have a great potential for
linear scalability [24]. However, sequencing throughput increases faster than
470 R. Kienzler et al.

Fig. 1. SNP identification: The top row shows a subsequence of the reference genome.
The following rows are aligned NGS reads. Two SNPs can be identified. T is replaced
by C (7th column) and C is replaced by T (25th column). In one read (line 7), a
sequencing error can be observed where A has been replaced by G (last column).
Source: http://bioinf.scri.ac.uk/tablet/.

computational power and storage size [25]. As a result, although NGS machines
are becoming cheaper, using dedicated compute clusters for read alignment is
still a significant investment. Fortunately, even small labs can do the alignment
by using cloud resources [11]. Li et al state that cloud computing might be a
possible solution for small labs, but also raises concerns about data transfer
bottlenecks and storage costs [19]. Thus, existing cloud-based solutions such as
CloudBurst [24] and Crossbow [17] as well as the cloud-enabled version of Galaxy
[14] have a common disadvantage: before processing starts, large amounts of data
has to be uploaded into the cloud, potentially causing significant data transfer
latency and storage costs [26].
In this work, our main focus is to develop solutions for the performance prob-
lems that stem from having to transfer large amounts of data in and out of the
cloud for data-intensive use cases such as the one described above. If we roughly
capture the overall processing time with a function f (n, s) ∝ cs + ns , where n is
the number of CPU cores, s is the problem size1 , and c is a constant for data
1
Problem size for NGS read alignment depends on a number of factors including the
number of reads to be aligned, the size of the reference genome, and the “fuzziness”
of the alignment algorithm.
Incremental DNA Sequence Analysis in the Cloud 471

transfer rate between a client and the cloud, our main goal is to bring down the
first component (cs) in this formula. Furthermore, we would like to do it in a way
that supports linear scalability. In the next section, we will present the solution
that we propose, together with an initial evaluation study which indicates that
our approach is a promising one.

3 A Stream-Based Approach

In the following, we first present our stream-based approach in general terms,


and then describe how we applied it to a specific NGS read alignment use case
together with results of a preliminary evaluation study.

3.1 Incremental Data Processing with an SPE

We propose to use a stream-based data management platform in order to reduce


the total processing time of data-intensive applications deployed in the cloud
by eliminating their data transfer latencies. Our main motivation to do so is to
exploit the incremental and in-memory data processing model of Stream Pro-
cessing Engines (SPEs) (such as the Borealis engine [7] or the IBM InfoSphere
Streams (or Streams for short) engine [13]).
SPEs have been primarily designed to provide low-latency processing over
continuous streams of time-sensitive data from push-based data sources. Appli-
cations are built by defining directed acyclic graphs, where nodes represent oper-
ators and edges represent the dataflows between them. Operators transform data
between their inputs and outputs, working on finite chunks of data sequences
(a.k.a., sliding windows). SPEs provide query algebras with a well-defined set
of commonly used operators, which can be easily extended with custom, user-
defined operators. There are also special operators/adapters for supporting ac-
cess to a variety of data sources including files, sockets, and databases. Once an
application is defined, SPEs take care of all system-level requirements to exe-
cute it in a correct and efficient way such as interprocess communication, data
partitioning, operator distribution, fault tolerance, and dynamic scaling.
In our approach, we do not provide any new algorithms, but we provide an
SPE-based platform to bring existing algorithms/software into the cloud in a
way that they can work with their input data in an incremental fashion. One
generic way of doing this is to use command line tools provided by most of these
software. For example, in the NGS software packages that we looked at, we have
so far seen two types of command line tools: those that are able to read and
write to standard Unix pipes, and those that can not. We build custom streaming
operators by wrapping the Unix processes. If standard Unix pipe communication
is supported, using one thread, the Unix process is provided with incoming data
streams and results are read by a second thread. Otherwise, data is written in
chunks to files residing on an in-memory file system. For each chunk, the Unix
process is run once and the produced output data is read and passed on to the
next operator as a data stream.
472 R. Kienzler et al.

Fig. 2. Using SHRiMP on multiple nodes as standalone application requires to split


the raw NGS read data into equal-sized chunks, transfer them to multiple cloud nodes,
run SHRiMP in parallel, copy back the results to the client, and finally, merge them
into a single file

Fig. 3. With our stream-based approach, the client streams the reads into the cloud,
where they instantly get mapped to a reference genome and results are immediately
streamed back to the client

Figure 2 and Figure 3 contrast how data-intensive applications are typically


being deployed in the cloud today vs. how they could be deployed using our
approach, respectively. Although the figures illustrate our NGS read alignment
use case specifically, the architectural and conceptual differences apply in general.

3.2 Use Case Implementation


We now describe how we implemented our approach for a well-known NGS read
alignment software package called SHRiMP [21] using IBM InfoSphere Streams
[5] as our SPE and Amazon EC2 [1] as our cloud computing platform.
Incremental DNA Sequence Analysis in the Cloud 473

Fig. 4. Operator and dataflow graph for our stream-based incremental processing im-
plementation of SHRiMP

Figure 4 shows a detailed data flow graph of our implementation. A client ap-
plication implemented in Java compresses and streams raw NGS read data into
the cloud, where a master Streams node first receives it. At the master node, the
read stream gets uncompressed by an Uncompress operator and is then fed into a
TCPSource operator. In order to be able to run parallel instances of SHRiMP for
increased scalability, TCPSource operator feeds the stream into a ThreadedSplit
operator. ThreadedSplit is aware of the data input rates that can be handled by
its downstream operators, and therefore, it can provide an optimal load distri-
bution. The number of substreams that ThreadedSplit generates determines the
the number of processing (i.e., slave) nodes in the compute cluster, each of which
will run a SHRiMP instance. SHRiMP instances are created by instantiating a
custom Streams operator using standard Unix pipes. The resulting aligned read
data (in the form of SAM output [6]) on different SHRiMP nodes are merged
by the master node using a Merge operator. Then a TCPSink operator passes
the output stream to a Compress operator, which ensures that results are sent
back to the client application in compact form, where they should be uncom-
pressed again before being presented to the user. The whole chain, including the
compression stages, is fully incremental.

3.3 Initial Evaluation


In this section, we present an initial evaluation of our approach on the imple-
mented use case in terms of scalability, costs, and ease of use. For scalability, we
have done an experiment that compares the two architectures shown in Figure 2
and Figure 3. In the experiment, we have aligned 30000 reads of Streptococcus
suis, an important pathogen of pigs, against its reference genome. Doing this on
a single Amazon EC2 m1.large instance takes around 28 minutes. In order to
be able to project this to analyzing more complicated organisms (like humans),
we have scaled all our results up by a factor of 60 (e.g., 28 hours instead of
28 minutes). In all cases, data is compressed before being transferred into the
cloud. To serve as a reference point, assuming a broadband Internet connection,
transferring the compressed input data set into the cloud takes about 90 minutes.
474 R. Kienzler et al.

Fig. 5. At a cluster with size of 4 nodes and above, the stream-based solution incurs
less total processing time than the standalone application. This is because data transfer
time always adds up to the curve of the standalone application.

Scalability. Figure 5 shows the result of our scalability experiment. The bottom
flat line corresponds to the data transfer time of 90 minutes for our specific input
dataset. This time is included in the SHRiMP standalone curve, where input data
has to be uploaded into the cloud in advance. On the other hand, the stream-
based approach does not transfer any data in advance, thus does not incur this
additional latency. Both approaches show linear scalability in total processing
time as the number of Amazon EC2 nodes are increased. Upto 4 nodes, the
standalone approach takes less processing time. However, we see that as the
cluster size increases beyond this value, the relative effect of the initial data
transfer latency for the standalone approach starts to show itself, reaching to
almost a 30-minute difference in processing time over the stream-based approach
for the 16-node setup. We expect this difference to become even more significant
as the input dataset size and the cluster size further increase.

Costs. As our solution allows data processing to start as soon as the data arrives
in the cloud, we can show that the constant c in the formula f (n, s) ∝ cs + ns
introduced in the previous section can be brought to nearly zero, leading to

f (n, s) ∝ ns for the overall data processing time. Since we have shown linear scale

out, we can calculate the CPU cost using p(n, s) ∝ nf (n, s) ∝ n ns ∝ s. Since
the cost ends up being dependent only on the problem size, one can minimize

the processing time f (n, s) by maximizing n without any significant effect on
the cost. Data transfer and storage costs are relatively small in comparison
to the CPU cost, therefore, we have ignored them in this initial cost analysis.
Nevertheless, it is not difficult to see that these costs will also decrease with our
stream-based approach.
Incremental DNA Sequence Analysis in the Cloud 475

Ease of Use. Our client, a command line tool, behaves exactly the same way as
a command line tool for any read alignment software package. Therefore, existing
data processing chains can be sped up by simply replacing the existing aligner
with our client without changing anything else. Even flexible and more complex
bioinformatics data processing engines (e.g., Galaxy [14] or Pegasus [10]) can be
transparently enhanced by simply replacing the original data processing stages
with our solution.

4 Conclusions and Future Work


In this paper, we proposed a stream-based approach to bringing data- and CPU-
intensive applications into the cloud without transferring data in advance. We
applied this idea to a large-scale DNA sequence analysis use case and showed
that overall processing time can be significantly reduced, while providing linear
scalability, reduced monetary costs, and ease of use.
We would like to extend this work along several directions. At the moment,
only SHRiMP [21] and Bowtie [18] have been enabled to run on our system. We
would like to explore other algorithms (e.g., SNP callers [20]) that can benefit
from our solution. As some of these presuppose sorted input, this will be an
additional challenge that we need to handle. Furthermore, we would like to
take a closer look at recent work on turning MapReduce into an incremental
framework and compare those approaches with our stream-based approach. The
last but not the least, we will explore how fault-tolerance techniques in stream
processing can be utilized to make our solution more robust and reliable.

Acknowledgements. This work has been supported in part by an IBM faculty


award.

References
1. Amazon Elastic Compute Cloud, http://aws.amazon.com/ec2/
2. Apache Hadoop, http://hadoop.apache.org/
3. Functional Genomics Center Zurich, http://www.fgcz.ch/
4. Google MapReduce, http://labs.google.com/papers/mapreduce.html
5. IBM InfoSphere Streams,
http://www.ibm.com/software/data/infosphere/streams
6. The SAM Format Specification, samtools.sourceforge.net/SAM1.pdf
7. Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.,
Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.:
The Design of the Borealis Stream Processing Engine. In: Conference on Innovative
Data Systems Research (CIDR 2005), Asilomar, CA (January 2005)
8. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Align-
ment Search Tool. Journal of Molecular Biology 215(3) (October 1990)
9. Collins, F.S., Guyer, M., Chakravarti, A.: Variations on a Theme: Cataloging Hu-
man DNA Sequence Variation. Science 278(5343) (November 1997)
10. Deelman, E., Mehta, G., Singh, G., Su, M., Vahi, K.: Pegasus: mapping large-scale
workflows to distributed resources. In: Workflows for e-Science, pp. 376–394 (2007)
476 R. Kienzler et al.

11. Dudley, J.T., Butte, A.J.: In Silico Research in the Era of Cloud Computing. Nature
Biotechnology 28(11) (2010)
12. Fernald, G.H., Capriotti, E., Daneshjou, R., Karczewski, K.J., Altman, R.B.: Bioin-
formatics Challenges for Personalized Medicine. Bioinformatics 27(13) (July 2011)
13. Gedik, B., Andrade, H., Wu, K.L., Yu, P.S., Doo, M.: SPADE: The System S
Declarative Stream Processing Engine. In: ACM SIGMOD Conference, Vancouver,
BC, Canada (June 2008)
14. Goecks, J., Nekrutenko, A., Taylor, J., Team, G.: Galaxy: A Comprehensive Ap-
proach for Supporting Accessible, Reproducible, and Transparent Computational
Research in the Life Sciences. Genome Biology 11(8) (2010)
15. Graefe, G.: Query Evaluation Techniques for Large Databases. ACM Computing
Surveys 25(2) (June 1993)
16. Keich, U., Ming, L., Ma, B., Tromp, J.: On Spaced Seeds for Similarity Search.
Discrete Applied Mathematics 138(3) (April 2004)
17. Langmead, B., Schatz, M.C., Lin, J., Pop, M., Salzberg, S.L.: Searching for SNPs
with Cloud Computing. Genome Biology 10(11) (2009)
18. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and Memory-efficient
Alignment of Short DNA Sequences to the Human Genome. Genome Biology 10(3)
(2009)
19. Li, H., Homer, N.: A Survey of Sequence Alignment Algorithms for Next-
Generation Sequencing. Briefings in Bioinformatics 11(5) (September 2010)
20. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., Wang, J.: SNP Detec-
tion for Massively Parallel Whole-Genome Resequencing. Genome Research 19(6)
(June 2009)
21. Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M., Sidow, A., Brudno, M.:
SHRiMP: Accurate Mapping of Short Color-space Reads. PLOS Computational
Biology 5(5) (May 2009)
22. Sanger, F., Coulson, A.R.: A Rapid Method for Determining Sequences in DNA by
Primed Synthesis with DNA Polymerase. Journal of Mol. Biol. 94(3) (May 1975)
23. Schatz, M., Delcher, A., Salzberg, S.: Assembly of large genomes using second-
generation sequencing. Genome Research 20(9), 1165 (2010)
24. Schatz, M.C.: CloudBurst: Highly Sensitive Read Mapping with MapReduce.
Bioinformatics 25(11) (June 2009)
25. Stein, L.D.: The Case for Cloud Computing in Genome Informatics. Genome Bi-
ology 11(5) (2010)
26. Viedma, G., Olias, A., Parsons, P.: Genomics Processing in the Cloud. International
Science Grid This Week (February 2011),
http://www.isgtw.org/feature/genomics-processing-cloud
27. Voelkerding, K.V., Dames, S.A., Durtschi, J.D.: Next-Generation Sequencing:
From Basic Research to Diagnostics. Clinical Chemistry 55(4) (February 2009)
Author Index

Abad-Grau, Marı́a M. II-33 Cabel, Tristan II-355


Aktulga, Hasan Metin I-305 Cannataro, Mario II-1, II-43
Aldinucci, Marco II-3 Cantiello, Pasquale II-188
Alexander, Michael II-385 Cardellini, Valeria I-367
Ali, Qasim I-213 Carlini, Emanuele I-159
Altenfeld, Ralph II-198 Carlson, Trevor II-272
Anedda, Paolo II-446 Carrington, Laura II-178
Appleton, Owen I-64, II-53, II-64 Carrión, Abel I-25
Arabnejad, Hamid I-440 Çatalyürek, Ümit V. I-305
Aragiorgis, Dimitris II-407 Charles, Joseph II-355
Arnold, Dorian II-302 Chen, F. II-231
Aversa, Rocco II-106 Chen, Ting II-23
Badia, Rosa M. I-25 Chiara, Rosario De I-460
Bahi, Jacques M. I-471 Cicotti, Giuseppe I-15
Baker, Chris I-315 Ciżnicki, Milosz I-481
Balis, Bartosz II-76 Clarke, David I-450
Barbieri, Davide I-367 Cockshott, Paul W. I-260
Barbosa, Jorge G. I-440 Coppo, Mario II-3
Bataller, Jordi I-502 Coppola, Massimo I-159
Baude, Françoise I-115 Coppolino, Luigi I-15
Belloum, Adam S.Z. II-53, II-64, II-116 Cordasco, Gennaro I-460
Benkner, Siegfried I-54 Cordier, Hélène II-55
Bertolli, Carlo I-139, I-191 Corsaro, Stefania I-293
Berzins, Martin I-324 Costache, Stefania II-426
Besseron, Xavier II-312, II-322 Couturier, Raphaël I-471
Betts, Adam I-191 Cristaldi, Rosario I-15
Biersdorff, Scott II-156 Cruz, Imanol Padillo I-83
Bisbal, Jesus I-54 Cuomo, A. I-94
Bischof, Christian II-198 Cushing, Reginald II-116
Blanchard, Sean II-282
Blażewicz, Marek I-481 D’Ambra, Pasqua I-293
Bode, Arndt II-345, II-375 Damiani, Ferruccio II-3
Boku, Taisuke I-429 Dandapanthula, N. II-166
Boman, Erik I-315 Danelutto, Marco I-113, I-128
Bosilca, George I-417 D’Antonio, Salvatore I-15
Braby, Ryan II-211 Dazzi, Patrizio I-159
Brandt, J. II-231 DeBardeleben, Nathan II-282
Bridges, Patrick G. II-241, II-302 Deelman, Ewa II-23
Briffaut, J. II-416 Desprez, Frédéric I-113
Brightwell, Ron II-166, II-241 Didona, Diego I-45
Bruggmann, Rémy II-467 Di Martino, Beniamino I-1, II-106,
Bubak, Marian II-76, II-116 II-188
Bungartz, Hans-Joachim II-375 Dongarra, Jack II-436
Buyske, Steven II-23 Drocco, Maurizio II-3
478 Author Index

Duff, Iain S. I-295 Heroux, Mike I-315


Dünnweber, Jan I-408 Hoemmen, Mark II-241
Honda, Michio II-335
Eeckhout, Lieven II-272 Horikawa, Tetsuro II-335
Engelbrecht, Gerhard I-54 Hose, Rod D. I-54
Engelmann, Christian I-234, II-251 Hupca, Ioan Ovidiu I-355
Espert, Ignacio Blanquer I-25
Ezell, Matthew II-211 Ibtesham, Dewan II-302
Ilic, Aleksandar I-491
Fahringer, Thomas I-169 Iwainsky, Christian II-198
Falcou, Joel I-355
Feldhoff, Kim II-137 Jeanvoine, Emmanuel I-387
Ferreira, Kurt B. II-221, II-241, Jokanovic, Ana II-262
II-251, II-302
Ferschl, Gábor I-83 Kandalla, K. II-166
Fiala, David II-251 Kaniovskyi, Yuriy I-54
Filippone, Salvatore I-367 Kaya, Kamer I-334
Forsell, Martti I-245 Keceli, Fuat I-249
Fortiş, Teodor-Florin I-83 Keir, Paul I-260
Fu, Song II-282 Keiter, Eric I-315
Kelly, Paul I-191
Gabriel, Edgar I-511 Khodja, Lilia Ziane I-471
Gaggero, Massimo II-446 Kienzler, Romeo II-467
Galizia, Antonella II-96 Kilpatrick, P. I-128
Gautier, Thierry II-322 Kim, Hwanju II-387
Gentile, A. II-231 Kim, Sangwook II-387
Gerndt, Michael II-135, II-146 Kiriansky, Vladimir I-213
Getov, Vladimir I-113 Kitowski, Jacek II-76
Giles, Mike I-191 Klein, Cristian I-117
Gimenez, Judit I-511 Klemm, Michael II-375
Glettler, René I-408 Knittl, Silvia II-124
Glinka, Frank I-149 Knowles, James A. II-23
Gogouvitis, Spyridon V. I-35 Kocot, Joanna II-64
Gorlatch, Sergei I-149 Koehler, Martin I-54
Grasso, Luigi II-33 Kolodner, Elliot K. I-35
Greenwood, Zeno Dixon II-292 Kopta, Piotr I-481
Grigori, Laura I-355 Kortas, Samuel II-426
Guan, Qiang II-282 Koulouzis, Spiros II-116
Gustedt, Jens I-387 Kousiouris, George I-35
Guzzi, Pietro Hiram II-43 Kovatch, Patricia II-211
Koziris, Nectarios II-398, II-407
Haitof, Houssam I-73 Kozyri, Elisavet II-398
Harmer, Terence I-104 Krzikalla, Olaf II-137
Hast, Anders II-333 Kurowski, Krzysztof I-481
Hecht, Daniel I-223 Kyriazis, Dimosthenis I-35
Heikkurinen, Matti I-64, II-64
Heinecke, Alexander II-375 Labarta, Jesus II-262
Heirman, Wim II-272 Lanteri, Stéphane II-355
Hernández, Vicente I-25 Lastovetsky, Alexey I-450
Heroux, Michael A. II-241 Laurenzano, Michael A. II-178
Author Index 479

Leangsuksun, Chokchai (Box) II-209, Nanos, Anastassios II-398, II-407


II-231, II-292 Naughton, Thomas I-211, I-234
Lèbre, Adrien II-446, II-457 Németh, Zsolt I-181
Lee, Chee Wai II-156 Ng, Esmond G. I-305
Lee, Jinpil I-429 Nicod, Jean-Marc I-419
Lee, Joonwon II-387 Nikoleris, Nikos II-398
Lefebvre, E. II-416
Leser, Ulf II-13 Odajima, Tetsuya I-429
Lezzi, Daniele I-25 Oleynik, Yury II-146
Lichocki, Pawel I-481 Ouyang, Xiangyong II-312
Liljeberg, Pasi I-281, II-365
Lokhmotov, Anton I-270 Palmieri, Roberto I-45
Lopez, Gorka Esnal I-83 Panda, Dhabaleswar K. II-166, II-312
Luszczek, Piotr II-436 Parlavantzas, Nikos II-426
Pebay, P. II-231
Ma, Zhe II-272 Pedrinaci, Carlos I-54
Máhr, Tamás I-83 Peluso, Sebastiano I-45
Maiborn, Volker I-408 Pérez, Christian I-117
Malik, Muhammad Junaid I-169 Perla, Francesca I-293
Malony, Allen D. II-156 Petcu, Dana I-1, II-86
Mancuso, Ada I-460 Pflüger, Dirk II-375
Maris, Pieter I-305 Philippe, Laurent I-419
Mathieu, Gilles II-55 Ploss, Alexander I-149
Matise, Tara II-23 Prodan, Radu I-169
Mayo, J. II-231 Prudencio, Ernesto E. I-398
Mazzeo, Dario I-460 Psomadakis, Stratos II-398
Medina-Medina, Nuria II-33
Meek, Eric II-436 Quaglia, Francesco I-45
Mehta, Gaurang II-23 Quarati, Alfonso II-96
Meiländer, Dominik I-149 Quesnel, Flavien II-446, II-457
Membarth, Richard I-270
Mencagli, Gabriele I-139 Rafanell, Roger I-25
Meng, Qingyu I-324 Rajachandrasekar, Raghunath II-312
Meshram, Vilobh II-312 Rajamanickam, Siva I-315
Metzker, Martin II-64 Rajcsányi, Vilmos I-181
Mey, Dieter an II-198 Rak, M. I-94
Mihaylov, Valentin I-408 Rak, Massimiliano II-106
Montangero, C. I-128 Ranganathan, Anand II-467
Montes-Soldado, Rosana II-33 Rheinländer, Astrid II-13
Moore, Shirley II-436 Ricci, Laura I-159
Moreshet, Tali I-249 Richards, Andrew I-260
Morin, Christine II-292, II-426 Riesen, Rolf II-221
Mudalige, Gihan I-191 Righetti, Giacomo I-159
Mueller, Frank II-251 Riteau, Pierre II-292
Müller-Pfefferkorn, Ralph II-137 Rodrigues, Arun II-221
Muraraşu, Alin II-345 Roe, D. II-231
Romano, Luigi I-15
Nagel, Wolfgang E. II-137 Romano, Paolo I-45
Nakazawa, Jin II-335 Rouet, François-Henry I-334
Nandagudi, Girish I-511 Rouson, Damian I-367
480 Author Index

Rouzaud-Cornabas, J. II-416 Teich, Jürgen I-270


Rychkov, Vladimir I-450 Tenhunen, Hannu I-281, II-365
Terpstra, Dan II-436
Sánchez, Ma. Guadalupe I-502 Thanakornworakij, Thanadech II-292
Sancho, José Carlos II-262 Thompson, D. II-231
Sanzo, Pierangelo di I-45 Thornquist, Heidi I-315
Sato, Mitsuhisa I-429 Tiwari, Ananta II-178
Saverchenko, Ilya II-124 Tlili, Raja I-201
Scarano, Vittorio I-460 Toch, Lamiel I-419
Schaaf, Thomas II-53, II-64, II-124 Toinard, C. II-416
Schiek, Rich I-315 Tokuda, Hideyuki II-335
Schiffers, Michael II-96 Torquati, Massimo II-3
Schmidt, John I-324 Träff, Jesper Larsson I-245
Schulz, Karl W. I-398 Tran, Minh Tuan I-429
Sciacca, Eva II-3 Troina, Angelo II-3
Scott, Stephen L. I-211, I-234, II-209
Scroggs, Blaine II-292 Uçar, Bora I-334
Semini, L. I-128
Serebrin, Benjamin I-223 Vafiadis, George I-35
Serrat-Fernández, Joan II-53, II-64 Vallée, Geoffroy I-211, I-234
Sharma, Rajan II-292 Vanneschi, Marco I-139
Shende, Sameer II-156 Varela, Maria Ruiz II-221
Simons, Josh I-213 Vary, James P. I-305
Slawinska, Magdalena I-5 Venticinque, S. I-94
Slawinski, Jaroslaw I-5 Venticinque, Salvatore II-106
Slimani, Yahya I-201 Vidal, Vicente I-502
Slota, Renata II-76 Vienne, J. II-166
Snavely, Allan II-178 Villano, U. I-94
Sørensen, Hans Henrik Brandenborg Vishkin, Uzi I-249
I-377 Vöckler, Jens II-23
Soltero, Philip II-241
Wang, Ying II-23
Sousa, Leonel I-491
Weaver, Vincent M. II-436
Spagnuolo, Carmine I-460
Weidendorfer, Josef II-333, II-345
Spear, Wyatt II-156
Weiss, Jan-Philipp II-333
Spinella, Salvatore II-3
Wolff, Holger I-408
Stewart, Alan I-104
Wong, M. II-231
Stompor, Radek I-355
Wood, Steven I-54
Strijkers, Rudolf II-116
Wright, Peter I-104
Subhlok, Jaspal I-511
Wu, Kesheng I-345
Subramoni, H. II-166
Sun, Yih Leong I-104 Xu, Thomas Canhao I-281, II-365
Sunderam, Vaidy I-5
Sur, S. II-166 Yamazaki, Ichitaro I-345
Sutherland, James C. I-324 Yampolskiy, Mark II-96
Szepieniec, Tomasz II-53, II-64 Yang, Chao I-305

Taerat, N. II-231 Zanetti, Gianluigi II-385


Takashio, Kazunori II-335 Zaroo, Puneet I-213
Tatbul, Nesime II-467 Zhang, Ziming II-282
Taufer, Michela II-221 Ziegler, Wolfgang I-113

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy