Technical Safety, Reliability and Resilience
Technical Safety, Reliability and Resilience
Technical
Safety,
Reliability
and Resilience
Methods and Processes
Technical Safety, Reliability and Resilience
Ivo Häring
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2021
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
The preparation of this textbook has been
supported in parts by the Federal Ministry of
Education and Research (BMBF) within the
Federal Government—German Federal State
competition “Advancement through Educa-
tion: Open Universities.”
Acknowledgements
Because of the broad scope on risk control and resilience enhancement of (socio)
technical systems, system modeling, analytical system analysis, functional safety,
and development of reliable and safe systems that show resilience post disruptions,
there is a broad range of persons and bodies to be thanked.
Parts of the contents have been presented within study courses to numerous
students generating valuable feedback. This includes continuous academic training
(certificate of advanced study) as developed by Fraunhofer Ernst-Mach-Institut
(EMI), Freiburg, in collaboration with the University of Freiburg and the Fraunhofer
Academy Munich, and numerous classical courses held at the University of Freiburg,
at the Baden-Wuerttemberg Cooperative State University (DHBW), Lörrach, and at
the Hochschule Furtwangen University of Applied Science (HFU) in the areas of
technical safety, reliability, and functional safety and resilience engineering.
Thanks also go to the following bachelor, diploma, master, or Ph.D. students
at Fraunhofer EMI who contributed to applied research projects, in particular, in
the methodology tailoring and development, as referenced throughout the text-
book: Ranganatha (2020), Sudheeran (2020), Satsrisakul (2018), Jain (2018),
Michael (2014), Thielsch (2012), Meier (2011), Ringwald (2010), Voss (2011),
Brocke (2010), Ettwein (2010), Schubert (2009), Barlow (2009), Wolf (2009),
Neumann-Heyme (2008), Glose (2008), Larisch (2008), Hänle (2007), and Schmidt
(2007).
The text implemented valuable feedback from Scientists and Heads of Research
Groups of the Department of Security Technologies and Protective Structures
at Fraunhofer EMI, including A. Thein, J. Kaufman, B. Kanat, J. Scheidereiter,
G. Hormann, S. Ebenhöch, U. Siebold, J. Finger, M. Fehling-Kaschek, and K. Fischer.
The generation as well as the iteration of the text was very much supported by
the scientific lector work of Sina Rathjen who compiled first versions of the text
based on outlines by the author and content materials as provided in selected sources
that were generated during research project works. Without her tremendous effort
this work would not have been possible. Thanks also go to the librarian staff, the
media, and the publishing team at the institute: J. Helbling, B. Soergel, U. Galena,
B. Bindnagel, and J. Holz. The text owes much to the thorough support of the team
at Springer Publishing House.
vii
viii Acknowledgements
References
Barlow, L D (2009): Unterstützung Sicherheitsnachweis durch Schaltungssimulation, Support of
safety assessment using electronic circuit simulation. Diploma Thesis, Naturwissenschaftlich-
technische Akademie Isney (Fachhochschule und Berufskollegs), Fraunhofer EMI, Efringen-
Kirchen.
Brocke, Michael vor der (2010): Grundlagen für Sicherheitsanalyse von embedded Software zur
Umsetzung des Auswertealgorithmus eines Radarabstandssensors, Fundamentals for the safety
assessment of embedded software implementing an algorithm for radar distance measurement
determination. Bachelor Thesis. Hochschule Coburg; Fraunhofer EMI Efringen-Kirchen.
Ettwein, Peter (2010): Schädigungsmodelle für stumpfe Verletzungen, Damage models for blunt
injuries. Bechelor Thesis, Hochschule für Technik (HFT) Stuttgart, Fraunhofer EMI, Efringen-
Kichen.
Glose, D (2008): Zuverlässigkeitsvorhersage für elektronische Komponenten unter mechanischer
Belastung, Reliability predicition of electronic components for mechanical stress environment.
Hochschule Kempten, University of Applied Sciences; Fraunhofer EMI, Efringen-Kirchen.
Hänle, A. (2007): Modellierung und Spezifikation von Anforderungen eines sicherheitskritischen
Systems mit UML, Modeling and Specification of Requirements of a safety critical System with
UML. Hochschule Konstanz für Technik, Wirtschaft und Gestaltung (HTWG), University of
Applied Sciences; Fraunhofer EMI, Efringen-Kirchen. Computer Science.
Jain, Kumar Aishvarya (2018): Indoor ultrasound location system simulation for assessment of
disruptive events and resilience improvement options. Master Thesis. FAU Erlangen-Nürnberg,
Fraunhofer EMI.
Larisch, M. (2008): Unterstützung des Nachweises funktionaler Sicherheit nach IEC 61508 durch
SysML, Support of the functional Safety assessment according to EC 61508 with SysML.
Diploma Thesis. Hochschule Konstanz für Technik, Wirtschaft und Gestaltung (HTWG),
University of Applied Sciences; Fraunhofer EMI, Efringen-Kirchen.
Meier, Stefan (2011): Entwicklung eines eingebetteten Systems für eine Sensorauswertung gemäß
IEC 61508. Bachelor Thesis. Duale Hochschule Baden-Württemberg, Lörrach. Fraunhofer EMI.
Michael, Rajesh (2014): Electrical characterization of electronic components during high acceler-
ation: analytical and coupled continuum models for characterization of loading and explaining
experiments. Master Thesis. Jena, Ernst-Abbe-Fachhochschule, Fachberreich Sci-Tec: Präzi-
sion, Optik, Materialien. Fraunhofer EMI.
Neumann-Heyme, Hieram (2008): Numerische Simulation zum Versagen elektronischer Kompo-
nenten bei mechanischer Belastung, Numerical simulation of the failure of electronic compo-
nents under mechanical stress. Diploma Thesis. Technical University (TU) Dresden; Fraunhofer
EMI, Efringen-Kirchen.
Acknowledgements ix
Ranganatha, Madhusudhan (2020): Robust signal coding for ultrasound indoor localization. Master
Thesis, SRH Hochschule Heidelberg, Fraunhofer EMI.
Ringwald, Michael (2010): Entwicklung der Sensorauswerteschaltung für eine Sicherungsein-
richtung nach IEC 61508. Bachelor Thesis. Duale Hochschule Baden-Württemberg, Lörrach.
Fraunhofer EMI.
Schmidt, A. (2007): Analyse zur Softwaresicherheit eines Embedded Systems mit IEC 61508,
Analysis to assess the software safety of an Embedded System using IEC 61508. Diploma
Thesis. Hochschule Furtwangen University (HFU), Fraunhofer EMI.
Schubert, Florian (2009): Maschinenbautechnische Analyse generischer mechanischer Teilsys-
teme bei hoher mechanischer Belastung, Machine engineering construction analysis of
generic mechanical subsystems under high mechanical stress. Diploma Thesis, Rheinische
Fachhochschule (RFH) Köln, Fraunhofer EMI, Efringen-Kirchen.‘
Thielsch, P. (2012). Risikoanalysemethoden zur Festlegung der Gesamtssicherheitsanforderungen
im Sinn der "IEC 61508 (Ed. 2)". Bachelor, Hochschule Furtwangen.
Voss, Martin (2011): Methodik zur Bestimmung des Verhaltens von elektronischen Komponenten
unter hoher Beschleunigung, Method to determine the behaviour of electronic components under
high acceleration. Master Thesis. University of Freiburg, IMTEK; Fraunhofer EMI, Efringen-
Kirchen, Freiburg.
Wolf, Stefan (2009): Verhalten elektronischer Bauteile bei hochdynamischer Beschleunigung,
Behaviour of electronic components under high-dynamic acceleration. Master Thesis, TU
Chemnitz, Fraunhofer EMI, Efringen-Kirchen.
About This Book
safety integrity level (SIL), safe failure fraction (SFF), diagnostic coverage (DC),
and dangerous undetected (DU) and detected (DD) failures.
Also, some basic methods are introduced for securing the integrity of data within
electronic embedded systems, in particular, data stored with limited resources. It is
shown how to assess the efficiency of the methods in operational hardware contexts.
A further focus is on the description of the functional safety life cycle according
to IEC 61508, as well as the selection, combination, and joint tailoring of methods
for efficiently fulfilling functional safety requirements, in particular, of all tabular
methods introduced in detail.
Application domains of the processes and methods include automotive systems for
autonomous driving functions, process, and production industry; industry automation
and control systems; critical infrastructure systems; renewable energy generation,
storage, and distribution systems; and energy and water smart grids.
The book addresses developers and students up to Ph.D. level who want to gain a
fast overview and insight on practically feasible processes and methods for designing
and developing safe, reliable, and resilient systems, in particular, in the domain
of (more) sustainable systems. Some of the chapters summarize selected research
project results and should also be of interest for academia, in particular, when relating
processes and methods to novel safety and resilience concepts for technical systems.
Contents
xiii
xiv Contents
2.11 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Basic Technical Safety Terms and Definitions . . . . . . . . . . . . . . . . . . . . 27
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Acceptable Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.9 Safety-Relevant and Safety-Critical Systems . . . . . . . . . . . . . . . . 32
3.10 Safety-Relevant Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.11 Systems with High Requirements for Reliability . . . . . . . . . . . . . 32
3.12 Models for the Software and Hardware Development
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.13 Safety Function and Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.14 Safety Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.15 Techniques and Measures for Achieving Safety . . . . . . . . . . . . . . 35
3.16 System Description, System Modeling . . . . . . . . . . . . . . . . . . . . . 35
3.16.1 OPM (Object Process Methodology) . . . . . . . . . . . . . . 36
3.16.2 AADL (Architecture Analysis and Design
Language) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.16.3 UML (Unified Modeling Language) . . . . . . . . . . . . . . . 36
3.16.4 AltaRica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.16.5 VHDL (Very High-Speed Integrated Circuit
Hardware Description Language) . . . . . . . . . . . . . . . . . 37
3.16.6 BOM (Base Object Model) . . . . . . . . . . . . . . . . . . . . . . 37
3.16.7 SysML (Systems Modeling Language) . . . . . . . . . . . . . 37
3.17 System Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.18 System Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.19 Forms of Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.20 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.21 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Introduction to System Modeling for System Analysis . . . . . . . . . . . . 43
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Definition of a System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Boundaries of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Theoretical Versus Practical System Analysis . . . . . . . . . . . . . . . 45
4.5 Inductive and Deductive System Analysis Methods . . . . . . . . . . 46
4.6 Forms of Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 Failure Space and Success Space . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.8 Overview Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Contents xv
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Abbreviations
Mathematical Notations
π Adjustment factor
∧ And
ω Angular frequency
x̄ Arithmetic mean
ρ(k) Autocorrelation function
γ̂ (k) Autocovariance
xxvi Abbreviations
ρk Autocovariance coefficient
|A| Cardinality of A
A Complement of A
Complete space
P(B|A) Conditional probability
C Consequences
rx y Correlation coefficient
Rcrit Critical risk quantity
D Detection rating
∅ Empty set
∪ex Exclusive union
E(t) Expected value
f (t) Failure density
F(t) Failure probability
λ(t) Failure rate
F Feasibility rating
f Frequency
I Impulse
∩ Intersection
x̃ Median
N0 Non-negative integers
— Not
# Number of
O Occurrence rating
∨ Or
T Period
N Positive integers
P, p Pressure
P Probability
Pr Probit
Rb Reliability
R Risk
S Severity rating
α Smoothing constant
Solid angle surface element
sx Standard derivation
⊆ Subset
∪ Union
sx2 Variance
\ Without
List of Figures
xxvii
xxviii List of Figures
Fig. 16.7 Results of the bit error simulation program for three-bit
errors for all coding methods tested (Kaufman et al. 2012) . . . . 299
Fig. 16.8 Results of the bit error simulation program for four-bit
errors for all coding methods tested (Kaufman et al. 2012) . . . . 299
Fig. 16.9 Results of the bit error simulation program for five-bit
errors for all coding methods tested (Kaufman et al. 2012) . . . . 300
List of Tables
Table 8.4 Entries in the work sheet of the different types of hazard
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Table 8.5 Explanation of the parameters in Figs. 8.4 and 8.5
(IEC 61508 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Table 8.6 Example of data relating to risk graph in Fig. 8.4
(IEC 61508 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Table 8.7 Example of calibration of the general-purpose risk graph
in Fig. 8.3 (IEC 61508 2010). Copyright © 2010 IEC
Geneva, Switzerland. www.iec.ch . . . . . . . . . . . . . . . . . . . . . . . . 144
Table 8.8 Overview of norms and standards treating risk graphs,
translated extract of Table 10.3 in Thielsch (2012) . . . . . . . . . . 146
Table 8.9 Example of a risk classification of accidents, graphically
adapted, content from IEC 61508 S+ (2010). Copyright
© 2010 IEC Geneva, Switzerland. www.iec.ch . . . . . . . . . . . . . 147
Table 8.10 Comparison risk matrix, translated from Thielsch (2012) . . . . . 148
Table 8.11 The SILs for high and low demand rates as defined
in IEC 61508 (2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Table 8.12 Allocation of the different types of hazard analysis
to the phases of the Safety Life Cycle of IEC 61508,
following (Echterling 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Table 8.13 Allocation of the different types of hazard analysis
to a general development cycle (Echterling 2011) . . . . . . . . . . . 151
Table 8.14 Analysis methods and minimal required table columns
(Häring and Kanat 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Table 9.1 Common density functions of reliability prediction
(Glose 2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Table 9.2 Overview of mathematical definitions for reliability
predictions (Meyna and Pauli 2003) . . . . . . . . . . . . . . . . . . . . . . 167
Table 11.1 Concepts and terms from IEC 61508 in previous chapters . . . . 196
Table 11.2 Extract from Table A.1 in IEC 61508 (IEC 61508 2010).
Copyright © 2010 IEC Geneva, Switzerland. www.iec.ch . . . . 203
Table 12.1 Properties of safety requirements. If two properties
are not separated by a line only one property may be
simultaneously assigned to a safety requirement (Hänle
and Häring 2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Table 12.2 Evaluation of the properties of the example safety
properties (Hänle 2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Table 13.1 Explanation of the visibility symbols (Hänle 2007) . . . . . . . . . . 234
Table 13.2 Notation for the number of instances (Hänle 2007;
Wikipedia 2012a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Table 13.3 Overview of the relationships between classes . . . . . . . . . . . . . . 236
xxxvi List of Tables
1.1 Overview
The introductory chapter gathers several narratives why systematic processes, tech-
niques, and measures, in particular, classical and modified analytical system analysis
methods, are key for the ideation, concept and design, development, verification and
validation, implementation and testing of reliable, safe and resilient systems.
It provides disciplinary structuring and ordering principles employed by the text,
main dependencies of the chapters, and features of the text to increase readability
and relevancy for practitioners and experts. Chapters are highlighted that can be
understood as stand-alone introductions to key system analysis methods, processes,
and method combination options.
More recent applied research backgrounds are provided in the domains of automo-
tive engineering of electrical vehicles, indoor localization systems, and sustainable
energy supply of urban quarters. It shows that the book provides selected analysis
methods and processes sufficient for achieving reliable, safe, and to lesser extent also
resilient systems by resorting to classical system analysis methods including their
tailoring and expected extensions.
In summary, the chapter gathers introductory material ranging from a general
motivation (Sect. 1.2), disciplinary structuring and ordering principles employed by
the textbook (Sect. 1.3), main features of the text (Sect. 1.4) to research background,
and research project results used (Sect. 1.5).
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 1
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_1
2 1 Introduction and Objectives
When considering technical properties of (socio) technical systems and devices, the
list of requirements is ever increasing, by now including.
• reliability,
• availability,
• low carbon-dioxide footprint, i.e., sustainability regarding the environmental
effects,
• maintainability, and
• reusability or recyclability.
In addition, modern systems are increasingly connected, intelligent, flexible, used
over a long life time when compared with the development time, are increasingly
adaptive, reconfigurable, and autonomous. Hence, extending the classical risk land-
scape potential risk events that need to be covered range from expected to unexpected
or even unexampled (back swans, unknown unknowns), from accidental via inten-
tional (e.g., sabotage) to malicious (e.g., terroristic), external to internal. The risk
events concern safety of systems, physical and IT security, range from minor to larger,
unfold in short up long time scales, are minor up to disruptive, and catastrophic.
As the threat landscape increases with respect to failure modes and scales of
failures, the further classical requirement that (socio) technical systems adequately
provide
• control of risks, i.e., safety and security,
has to be amended and extended by providing
• resilience with respect to disruptive or (potential) damage events.
Resilience can be generated by engineering for resilience, e.g., resilience by
design, by development approaches followed, by using appropriate techniques and
measures during development, and appropriate maintenance of systems during
operation that support resilient behavior.
Another motivation of the textbook is to enable the reader to flexibly use, tailor
and extend existing and practically established methods (best practices) of technical
safety in novel contexts, and for pursuing novel system analysis aims. To this end,
the assumptions, boundary conditions, and restrictions of the methods and processes
are described. A particular focus is on how to adopt the methods for supporting the
development of resilient systems.
A further motivation is to support the best practice of processes and methods
established for the development of safety-relevant or safety–critical systems. Exam-
ples include the automotive domain, e.g., autonomous driving functions, produc-
tion, process, and logistics industry, e.g., industrial automation and control systems,
and critical infrastructure systems (live lines), e.g., supply grids for electricity, gas,
telecommunication, and the banking system.
1.3 Structure of Text and Chapter Contents Overview 3
reason is that the former can serve as a statement of the potential of the methods and
the later chapter as a formally well-based starting point to more elementary methods.
In particular, the advantages but also disadvantages of FMEA with respect to FTA
can be discussed on safe ground.
From a functional safety perspective according to the generic standard IEC 61,508,
the generic overall safety generation process of functional safety (safety life cycle)
is presented, which includes a nested software and hardware development processes
(V-Models for SW and HW). For the determination of requirements of safety func-
tions, selected methods are presented, in particular, hazard analysis. For system
analysis of electronic/electronic programmable/and electric (eepe)-safety-related
systems, system analysis methods like failure modes effects and diagnostic anal-
ysis (FMEDA) as well as further techniques and measures are presented including
reliability prediction, FTA, error detection and correction, and semi-formal graphical
system modeling. The overall terminology is strongly aligned with IEC 61,508.
In summary, independent of the functional safety concept a set of complementary
and most often and most successfully used methods for reliable and safe system
development methods is presented. In addition, for each method it is indicated how
it can be used to generate resiliency of technical systems. Furthermore, key methods
found important for safe and secure system development like semi-formal graphical
modeling, data integrity assessment, and reliability prediction are presented.
Each chapter starts with a section containing a broad introduction and critical
overview on the contents covered. Then content sections follow. The ambition is
to provide graphics for the illustration of key ideas as well as worked out examples.
Each chapter concludes with questions and answers. Each chapter can, in principle,
be read independent of the other chapters.
The following chapters have stronger links and should be read as sequence.
Chaps. 3 and 4 both introduce system analysis definitions and key concepts for analyt-
ical system modeling and analysis. They should be read before any of the chapters
that cover analysis methods, namely Chaps. 5 to 8. Chapter 10 should be read before
Chap. 11 since the former provides system development processes and the latter the
functional safety life cycle. Chapter 13 should be read before Chap. 14 since both
cover semi-formal graphical system modeling approaches for the application case of
safety development. All other chapters can be read stand-alone.
Chapter 2 can be read as an overview on application options of the methods for
developing resilient systems. It can also be read after any of the chapters on analysis
methods, techniques and measures for system development, Chaps. 6, 7, 8, 9, 13,
14 or 16, is finished to start the tailoring and modification of a selected method for
supporting the development of resilient technical systems.
1.5 Sample Background Research Projects 5
The subsequent sections mention selected applied research projects that used methods
as presented within the textbook to show representative application examples and
domains.
The pilot solution demonstrator project on multi-modal efficient and resilient local-
ization for intralogistics, production, and autonomous systems (MERLIN 2020) first
applies a systematic approach to analytical resilience analysis of technical systems
as described (Häring et al. 2017a, b) (Resilienzmaße 2018). In addition to indoor
localization via ultrasound (Bordoy et al. 2020; Ens et al. 2015), the technolo-
gies Bluetooth, ultra-wide band (UWB), radio-frequency identification (RFID), and
optical methods are assessed. The aim is reliable localization taking application
environments and potential disruptions into account.
6 1 Introduction and Objectives
Due to the generic challenge of best combination of technologies for better risk
control and resilience, preliminary hazard lists are a reasonable starting point to
identify potential hazards of sufficient precise localization. For identified potential
design options, this application also asks for extensions of (preliminary) hazard anal-
ysis methods for each localization technology and given operational environments,
see Chap. 8. For instance, the hazard (disruption) acoustic (ultrasound) noise can
be inductively assessed regarding its relevancy for a subset of localization tech-
nologies to determine whether it is robust against such an expectable disruption,
e.g., in production shop floors. Other sample disruptions include barriers preventing
line-of-sight communication and coverage of localization tags.
The most promising set of technologies can be assessed to exclude single-point
function failures using failure modes and effects criticality analysis (FMECA), see
Chap. 7, where the criticality level is defined in terms of a certain loss of localization
accuracy as well as by using FTA, see Chap. 6.
References
Bordoy, Joan; Schott, Dominik Jan; Xie, Jizhou; Bannoura, Amir; Klein, Philip; Striet, Ludwig et al.
(2020): Acoustic Indoor Localization Augmentation by Self-Calibration and Machine Learning.
In Sensors (Basel, Switzerland) 20 (4). https://doi.org/10.3390/s20041177.
Ens, Alexander; Hoeflinger, Fabian; Wendeberg, Johannes; Hoppe, Joachim; Zhang, Rui; Bannoura,
Amir et al. (2015): Acoustic Self-Calibrating System for Indoor Smart Phone Tracking. ASSIST.
In International Journal of Navigation and Observation 2015. https://doi.org/10.1155/2015/
694695.
References 7
Häring, Ivo; Sansavini, Giovanni; Bellini, Emanuele; Martyn, Nick; Kovalenko, Tatyana; Kitsak,
Maksim et al. (2017a): Towards a generic resilience management, quantification and development
process: general definitions, requirements, methods, techniques and measures, and case studies.
In Igor Linkov, José Manuel Palma-Oliveira (Eds.): Resilience and risk. Methods and application
in environment, cyber and social domains. NATO Advanced Research Workshop on Resilience-
Based Approaches to Critical Infrastructure Safeguarding. Dordrecht: Springer (NATO science
for peace and security series. Series C, Environmental security), pp. 21–80. Available online at
https://www.springer.com/de/book/9789402411225.
Häring, Ivo; Scheidereiter, Johannes; Ebenhoech, Stefan; Schott, Dominik; Reindl, L.; Koehler,
Sven et al. (2017b): Analytical engineering process to identify, assess and improve technical
resilience capabilities. In Marko Čepin, Radim Briš (Eds.): ESREL 2017 (Portoroz, Slovenia,
18–22 June, 2017). The 2nd International Conference on Engineering Sciences and Technologies.
High Tatras Mountains, Tatranské Matliare, Slovak Republic, 29 June – 1 July 2016. Boca Raton:
CRC Press, p. 159.
Häring, Ivo; Gelhausen, Patrick (2018): Technical safety and reliability methods for resilience
engineering: Taylor and Francis Group, pp. 1253–1260. Available online at https://doi.org/
10.1201/9781351174664, https://www.taylorfrancis.com/books/e/9781351174664, checked on
10/17/2019.
InnoTherMS (2020a): Innovative Predictive High Efficient Thermal Management System –
InnoTherMS. AllFraTech – DE-FR German-French Alliance for Innovative Mobility Solutions
(AllFraTech). Available online at https://www.e-mobilbw.de/fileadmin/media/e-mobilbw/Publik
ationen/Flyer/AllFraTech_Infoflyer.pdf, checked on 3/6/2020.
InnoTherMS (2020b): Innovative Predictive High Efficient Thermal Management System –
InnoTherMS. German-French research project. Elektromobilität Süd-West. Available online at
https://www.emobil-sw.de/en/innotherms, checked on 5/15/2020.
MERLIN (2020): Multimodale effiziente und resiliente Lokalisierung für Intralogistik, Produktion
und autonome Systeme. Demonstration Project, Funded by: Ministerium für Wirtschaft, Arbeit
und Wohnungsbau Baden-Württemberg, Ministerium für Wissenschaft, Forschung und Kunst
Baden-Württemberg, Fraunhofer-Gesellschaft für angewandte Forschung, e.V., Albert-Ludwigs-
Universität Freiburg. Edited by Sustainablity Center Freiburg. Available online at https://www.
leistungszentrum-nachhaltigkeit.de/demoprojekte/merlin/, checked on 9/27/2020.
OCTIKT (2020): Ein Organic-Computing basierter Ansatz zur Sicherstellung und Verbesserung der
Resilienz in technischen und IKT-Systemen. German BMBF Project. Available online at https://
projekt-octikt.fzi.de/, checked on 5/15/2020.
Resilienzmaße (2018): Resilienzmaße zur Optimierung technischer Systeme, Resilience measures
for optimizing technical systems, Research Project, Sustainabilty Center Freiburg, 2016–2018.
Available online at https://www.leistungszentrum-nachhaltigkeit.de/pilotphase-2015-2018/resili
enzmasse/, checked on 9/27/2020.
Tomforde, Sven; Gelhausen, Patrick; Gruhl, Christian; Häring, Ivo; Sick, Bernhard (2019): Explicit
Consideration of Resilience in Organic Computing Design Processes. In: ARCS Workshop 2019
and 32nd International Conference on Architecture of Computing Systems. Joint Conference.
Copenhagen, Denmark. Berlin, Germany: VDE Verlag GmbH, pp. 51–56.
Chapter 2
Technical Safety and Reliability Methods
for Resilience Engineering
2.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 9
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_2
10 2 Technical Safety and Reliability Methods …
Chapter 1 mainly rests on (Häring et al. 2016a, b) but mainly on (Häring and
Gelhausen 2018). Further sources include (Häring et al. 2016d; 2017b).
This shows that resilience analysis may benefit from the application of traditional
and more modern risk concepts.
The idea used as starting point in (Aven 2017) is that resilience behavior can be
defined to occur post-disruption events. Thus, resilience event, e.g., “system stabilizes
post disruption,” “system recovers,” “system bounces back better,” “recovery time
shorter than critical time,” or “sufficient system performance level reached within t”
are always conditional previous events,
P(B|A), (2.1)
A = A1 , A2 , . . . , An . (2.2)
This approach relates with the often used definition of conditional vulnerability
in risk expressions, see, e.g., (Daniel M. Gerstein et al. 2016),
R = P(C) C, (2.4)
and typical resilience expressions are further extending the definition of (2.1) and
(2.3), it is expected that classical system reliability and safety approaches are
challenged when used for assessing resilience. In particular, (very) simple tabular
approaches resort to risk concepts as described by (2.4) when applied in a traditional
way, i.e., they focus on avoidance of events and system robustness in case of events
only.
Generalizing (2.1), resilience expressions of interest typically are of the form
(Häring et al. 2016a)
N
P(B|A) = P(B|Di )P(Di |A), (2.5)
i=1
n
P(·|Di )P(Di |·) (2.6)
i=1
A = “Disruption event”,
{Di }i=1,...,n , Set of possible response and recovery
events/Set of transition states,
B = “Final state of interest”. (2.7)
When comparing risk and vulnerability expressions of the form (2.3) and (2.4)
with resilience expressions of the form (2.1) and (2.5), it becomes obvious that it is
not straightforward to expect that classical analytical system analysis methods can
deliver assessment results regarding resilience. This motivates the question how such
methods can contribute to resilience assessment.
For focusing the research question, the following system modeling, classical
system analysis, and system development methods are considered regarding their
suitability for resilience assessment:
• SysML, UML;
• HL, PHA, HA, O&SHA, HAZOP;
• FMEA, FMECA, FMEDA;
• RBD;
• ETA;
• DFM;
• FTA, time-dependent FTA (TDFTA);
• Reliability prediction with standards;
• Methods for HW and SW development; and
• Bit error-correction methods.
To assess the suitability of methods for resilience engineering, the following
resilience dimensions are used:
• Five-step risk management process (AS/NZS ISO 31,000:2009), for review:
(Purdy 2010; Luko 2013), for critical discussion mainly regarding the coverage
of uncertainty (Aven 2011).
2.3 Approach to Assess the Suitability of Methods 15
{1, 2, 3, 4, 5},
{−−, −, o, +, ++},
{not suited (adaptation useless),
rather not suited (or only with major modifications),
potentially suited (after adaptation),
suited (with minor modifications),
very well suited (straightforward/no adaptions)}. (2.8)
The five-step risk management scheme is only a very generic framework for iden-
tifying risks on resilience objectives. As discussed in the introduction, objectives in
the case of resilience analysis are more second order (e.g., “fast recovery in case of
disruption”) when compared to classical risk analysis and management (e.g., “avoid
disruption”).
Table 2.1 assesses the suitability of analytical system analysis and some devel-
opment methods for resilience analysis sorted along the five-step risk management
scheme using the scale of (2.8).
Table 2.1 Suitability of analytical system analysis and HW/SW development methods for resilience
analysis sorted along the five-step risk management scheme
Method \ (1) Establish (2) Identify (3) Analyze/ (4) Evaluate (5) Mitigate
five-step risk context risk/ compute risks risks
management hazards risks
process steps
SysML + ++ ++ o ++
UML + + + o +
HL + ++ + o −
PHA + + ++ + +
SSHA, HA o + ++ ++ ++
O&SHA, o + ++ ++ ++
HAZOP
FMEA, − o ++ + ++
FMECA
FMEDA − − ++ + ++
RBD o + ++ + ++
ETA + + ++ + ++
DFM − o ++ + ++
FTA, TDFTA − o ++ + ++
Reliability − − ++ + ++
prediction
with standards
Methods for − − − − ++
HW and SW
development
Bit − − – − ++
error-correction
methods
2.4 Method Suitability Assessment with Five-Step Risk Management Scheme 17
The catastrophe management cycle in four steps (e.g., preparation, prevention and
protection, response and recovery, learning and adaption) as well as in five steps as
shown in Table 2.2 takes advantage of a logic or time ordering of events with respect
to disruption events (Häring et al. 2016a): (far) before, during, immediately (after).
Another typical timeline as well as logic sequence example was given in Sect. 2.5.
The first observation in Table 2.2 is that the analysis methods should be conducted,
if considered relevant, mainly in the preparation phase. However, especially fast
analytical simple methods can also be applied during actual conduction of response
and recovery. For instance, during and post events, a PHA scheme could be used to
identify further possible second-order events given a disruption.
The second observation in Table 2.2 comprises the coverage of resilience cycle
phases. The suitability of method assessment stems from the fact that classical analyt-
ical approaches by definition cover prevention and protection when identified with
frequency of event assessment and immediate (first order) damage assessment.
If system failure, in case of variations of HA, FMEA, and FTA, is defined as failure
of adequate response (e.g., absorption and stabilization), of recovery (e.g., recon-
struction and rebuilding) or even of improving or bouncing forward using damage
as optimization opportunity, these methods can be used with adaptions also for these
resilience timeline phases. Similarly, RBD and ETA are assessed.
The system modeling and development methods for hardware and software
(HW/SW) can be used for all resilience cycle phases. As in the case of classical
system analysis methods, adaptions up to major new developments are believed to
be necessary.
18 2 Technical Safety and Reliability Methods …
Table 2.2 Suitability of system modeling, analytical system analysis, and selected development
methods for resilience analysis along the five phases of the resilience cycle: resilience event order
logic or timeline
Method \ Resilience timeline (1) (2) (3) (4) (5)
cycle phase Prepare Prevent Protect Respond Recover
SysML ++ ++ ++ ++ ++
UML ++ + + + +
HL ++ ++ ++ ++ ++
PHA ++ ++ ++ + +
SSHA, HA ++ ++ ++ + +
O&SHA, HAZOP ++ ++ ++ + +
FMEA, FMECA ++ ++ ++ + +
FMEDA ++ ++ ++ + +
RBD ++ ++ ++ + +
ETA ++ ++ ++ + +
DFM ++ ++ ++ + +
FTA, TDFTA ++ ++ ++ + +
Reliability prediction with ++ ++ ++ + +
standards
Methods for HW and SW ++ ++ ++ + +
development
Bit error-correction methods ++ ++ ++ ++ +
Even if especially the classical tabular system analysis methods were assessed as
very relevant for resilience assessment, it is noted that they are part of established
processes in practice. Therefore, even when adding only some additional columns,
their modified best practice of use is expected to be challenging in company develop-
ment environments. In this sense, Table 2.2 is a guideline for the expected usability
of the listed methods.
Sensor-logic-actor chains are basic functional elements used for active safety appli-
cations in safety instrumented systems, especially within the context of functional
safety as governed by (IEC 61,508 Series). The technical resilience capabilities can
be considered as a generalization of such functional capabilities.
They can also be related to the much more abstract and generic OODA (observe,
orient, decide, act) loop, which has found much application also in the catastrophe
response arena, see, e.g., (Lubitz et al. 2008; Huang 2015).
2.6 Method Usability Assessment Using Technical Resilience Capabilities 19
Table 2.3 Suitability of system modeling, analytical system analysis, and selected development methods for resilience analysis along the technical resilience
capability dimension attributes
Method\ Technical (1) Observation, sensing (2) Representation, (3) Inference, (4) Activation, action (5) Learning,
resilience capabilities modeling, Simulation decision-making modification,
adaption,
rearrangement
SysML ++ ++ ++ ++ ++
UML + ++ ++ + ++
HL + + + + +
PHA + + + + +
HA + + + + +
O&SHA, HAZOP + + + + +
FMEA, FMECA + + + + +
FMEDA + + + + +
RBD + + + + +
ETA + + + + +
DFM + + + + +
FTA, TDFTA + + + + +
Reliability prediction + + + + +
with standards
Methods for HW and + + + + +
SW development
Bit error-correction o o o o o
methods
2 Technical Safety and Reliability Methods …
2.8 Method Usability Assessment Using Resilience Criteria 21
Table 2.4 Suitability of system modeling, analytical system analysis, and selected development
methods for resilience analysis along system layers or generic management domains
Method\ (1) Physical (2) Technical, (3) Cyber, (4) (5) Societal,
System layer, hardware software- Operational, economic,
management wise,.protocols organizational ethical
domain
SysML ++ ++ ++ + +
UML ++ ++ ++ + +
HL + ++ + o o
PHA + ++ + o o
HA + ++ + o o
O&SHA, + ++ + + +
HAZOP
FMEA, + ++ + o o
FMECA
FMEDA + ++ + o o
RBD + ++ + o o
ETA + ++ + + +
DFM + ++ + + +
FTA, TDFTA + ++ + o o
Reliability – ++ – o o
prediction
with standards
Methods for + ++ ++ o o
HW and SW
development
Bit – + ++ o o
error-correction
methods
et al. 2003) and technically refined by (Pant et al. 2014). For the suitability of method
assessment, the following modified working definitions are used in this chapter:
(1) Robustness: measure for low level of damage (vulnerability) in case of event;
“good” absorption behavior; “good” protection.
(2) Redundancy: measure for low level of overall system effect in case of local (in
space, in time, etc.) disruption event; system disruption tolerance.
(3) Resourcefulness: measure for capability of successful allocation of resources in
the response phase to stabilize the system post disruptions.
(4) Rapidity: measure for fast recovery of system.
The results are similar to the suitability assessment along the logic or timeline
resilience cycle phases as conducted in Sect. 2.5: classical analytical approaches do
not focus beyond the damage events. Robustness, resourcefulness, and rapidity are
22 2 Technical Safety and Reliability Methods …
Table 2.5 Suitability of system modeling, analytical system analysis, and selected development
methods for resilience analysis along modified resilience criteria
Method \ (1) Robustness: (2) Redundancy: (3) (4) Rapidity:
Modified low initial system property Resourcefulness: fast
resilience damage of overall damage fast stabilization recovery
criteria tolerance and response and
reconstruction
SysML + ++ ++ ++
UML o ++ + +
HL + o – –
PHA + + – –
HA + + – –
O&SHA, – + – –
HAZOP
FMEA, FMECA + o – –
FMEDA o – – –
RBD + ++ – –
ETA + + – –
DFM o ++ – –
FTA, TDFTA + ++ + +
Reliability – – – –
prediction
with standards
Methods for HW – – – –
and SW
development
Bit – – – –
error-correction
methods
according to the working definitions strongly related to the resilience cycle phases,
absorption/protection, response, and recovery.
Redundancy is understood in the classical way as an overall system property.
Hence, in all cases, sufficient system understanding is required, which is supported
by graphical modeling.
The (also) graphical approaches RBD, ETA, and FTA are strong for redun-
dancy and resourcefulness assessment. Especially time-dependent FTA and under-
lying time-dependent Markov models are believed to be key for resourcefulness and
redundancy assessment, nevertheless with major adaptations.
2.9 Conclusions Regarding Selection and Adoption of Methods 23
2.10 Questions
(1) Which resilience analysis phases, concepts, and dimensions do you know?
(2) How can the suitability of a classical technical safety analysis method be
assessed regarding their use for resilience engineering?
(3) What are the main results regarding the potential use of classical system analysis
methods for resilience engineering of socio-technical systems?
2.11 Answers
(1) See column headlines of Tables 2.1, 2.2, 2.3, 2.4, 2.5 and respective short
explanations in text.
(2) There are several options, including (i) use of five-step risk management scheme
of ISO 31,000 to identify for each step which classical methods can support the
control of risks on resilience, e.g., use hazard list to identify potential disruptions
of the system; (ii) use resilience cycle steps and ask which classical method is
suited to support the step, e.g., use FTA to identify the combination of events
that leads to failed evacuation of people in case of emergency, use SysML
timing diagram to show the time constraints for successful evacuation, use event
try to identify possible event trajectories post crisis event; and (iii) use system
resilience properties to identify any weaknesses of socio-technical system under
consideration with the help of SysML use case or activity diagram.
(3) See Sects. 2.2 and 2.9.
References
Aven, Terje (2011): On the new ISO guide on risk management terminology. In Reliability
Engineering and System Safety 96 (7), pp. 719–726. https://doi.org/10.1016/j.ress.2010.12.020.
Aven, Terje (2017): How some types of risk assessments can support resilience analysis and manage-
ment. In Reliability Engineering & System Safety 167, pp. 536–543. https://doi.org/10.1016/j.ress.
2017.07.005.
Baum, Eric; Hutter, Marcus; Kitzelmann, Emanuel (Eds.) (2010): Artificial general intelligence.
Proceedings of the Third Conference on Artificial General Intelligence, AGI 2010, Lugano,
Switzerland, March 5–8, 2010. Conference on Artificial General Intelligence; AGI. Amsterdam:
Atlantis Press (Advances in intelligent systems research, 10).
Bruneau, Michel; Chang, Stephanie E.; Eguchi, Ronald T.; Lee, George C.; O’Rourke, Thomas
D.; Reinhorn, Andrei M. et al. (2003): A Framework to Quantitatively Assess and Enhance the
Seismic Resilience of Communities. In Earthquake Spectra 19 (4), pp. 733–752. https://doi.org/
10.1193/1.1623497.
Cai, Zhansheng; Hu, Jinqiu; Zhang, Laibin; Ma, Xi (2015): Hierarchical fault propagation and
control modeling for the resilience analysis of process system. In Chemical Engineering Research
and Design 103, pp. 50–60. https://doi.org/10.1016/j.cherd.2015.07.024.
References 25
Daniel M. Gerstein; James G. Kallimani; Lauren A. Mayer; Leila Meshkat; Jan Osburg; Paul Davis
et al. (2016): Developing a Risk Assessment Methodology for the National Aeronautics and
Space Administration. RAND Corporation (RR-1537-NASA). Available online at https://www.
rand.org/pubs/research_reports/RR1537.html.
Fox-Lent, Cate; Bates, Matthew E.; Linkov, Igor (2015): A matrix approach to community resilience
assessment. An illustrative case at Rockaway Peninsula. In Environ Syst Decis 35 (2), pp. 209–218.
https://doi.org/10.1007/s10669-015-9555-4.
Goertzel, Ben; Wang, Pei; Franklin, Stan (Eds.) (2008): Artificial general intelligence, 2008.
Proceedings of the First AGI Conference. ebrary, Inc; AGI Conference. Amsterdam, Washington,
DC: IOS Press (Frontiers in artificial intelligence and applications, v. 171).
Häring, Ivo; Ebenhöch, Stefan; Stolz, Alexander (2016a): Quantifying resilience for resilience
engineering of socio technical systems. In Eur J Secur Res (European Journal for Security
Research) 1 (1), pp. 21–58. https://doi.org/10.1007/s41125-015-0001-x.
Häring, Ivo; Scharte, Benjamin; Hiermaier, Stefan (2016b): Towards a novel and applicable
approach for Resilience Engineering. In Marc Stal, Daniel Sigrist, Susanne Wahlen, Jill Port-
mann, James Glover, Nikki Bernabe et al. (Eds.): 6-th International Disaster and Risk Confer-
ence: Integrative Risk Management – towards resilient cities // Integrative risk management -
towards resilient cities. IDRC. Davos, 28.08-01.09.2016. Global Risk Forum, GRF. Davos: GRF,
pp. 272–276. Available online at https://idrc.info/fileadmin/user_upload/idrc/proceedings2016/
Extended%20Abstracts%20IDRC%202016_final2408.pdf.
Häring, Ivo; Scharte, Benjamin; Hiermaier, Stefan (2016c): Towards a novel and applicable
approach for Resilience Engineering. In: 6-th International Disaster and Risk Conference (IDRC).
Integrative Risk Management – towards resilient cities.
Häring, Ivo; Scharte, Benjamin; Stolz, Alexander; Leismann, Tobias; Hiermaier, Stefan (2016d):
Resilience engineering and quantification for sustainable systems development and assessment:
socio-technical systems and critical infrastructures. In Igor Linkov, Marie-Valentine Florin,
Benjamin Trump (Eds.): IRGC Resource Guide on Resilience, vol. 1. Lausanne: International
Risk Governance Center (1), pp. 81–89. Available online at https://irgc.org/risk-governance/res
ilience/irgc-resource-guide-on-resilience/volume-1/, checked on 1/4/2010.
Häring, Ivo; Scharte, Benjamin; Stolz, Alexander; Leismann, Tobias; Hiermaier, Stefan (2016e):
Resilience Engineering and Quantification for Sustainable Systems Development and Assess-
ment: Socio-technical Systems and Critical Infrastructures. In: Resource Guide on Resilience.
Lausanne: EPFL International Risk Governance Center.
Häring, Ivo; Sansavini, Giovanni; Bellini, Emanuele; Martyn, Nick; Kovalenko, Tatyana; Kitsak,
Maksimet al. (2017a): Towards a generic resilience management, quantification and development
process: general definitions, requirements, methods, techniques and measures, and case studies.
In Igor Linkov, José Manuel Palma-Oliveira (Eds.): Resilience and risk. Methods and application
in environment, cyber and social domains. NATO Advanced Research Workshop on Resilience-
Based Approaches to Critical Infrastructure Safeguarding. Dordrecht: Springer (NATO science
for peace and security series. Series C, Environmental security), pp. 21–80. Available online at
https://www.springer.com/de/book/9789402411225.
Häring, Ivo; Scheidereiter, Johannes; Ebenhoech, Stefan; Schott, Dominik; Reindl, L.; Koehler,
Sven et al. (2017b): Analytical engineering process to identify, assess and improve technical
resilience capabilities. In Marko Čepin, Radim Briš (Eds.): ESREL 2017 (Portoroz, Slovenia,
18–22 June, 2017). The 2nd International Conference on Engineering Sciences and Technologies.
High Tatras Mountains, Tatranské Matliare, Slovak Republic, 29 June–1 July 2016. Boca Raton:
CRC Press, p. 159.
Häring, Ivo; Gelhausen, Patrick (2018): Technical safety and reliability methods for resilience
engineering. In S. Haugen, A. Barros, C. van Gulijk, T. Kongsvik, J. E. Vinnene (Eds.): Safety
and Reliability - Safe Societies in a Changing World. Safety and Reliability – Safe Societies in a
Changing World, Proceedings of the 28-th European Safety and Reliability Conference (ESREL),
Trondheim, Norway, 17–21 June 2018: CRC Press, pp. 1253–1260.
26 2 Technical Safety and Reliability Methods …
Huang, Yanyan (2015): Modeling and simulation method of the emergency response systems based
on OODA. In Knowledge-Based Systems 89, pp. 527–540. https://doi.org/10.1016/j.knosys.2015.
08.020.
Linkov, Igor; Florin, Marie-Valentine; Trump, Benjamin (Eds.) (2016): IRGC Resource Guide on
Resilience. L’Ecole polytechnique fédérale de Lausanne (EPFL) International Risk Governance
Center. Lausanne: International Risk Governance Center (1). Available online at http://dx.doi.
org/10.5075/epfl-irgc-228206, checked on 1/10/2019.
López-Cuevas, Armando; Ramírez-Márquez, José; Sanchez-Ante, Gildardo; Barker, Kash (2017):
A Community Perspective on Resilience Analytics: A Visual Analysis of Community Mood.
In Risk analysis: an official publication of the Society for Risk Analysis 37 (8), pp. 1566–1579.
https://doi.org/10.1111/risa.12788.
Lubitz, D. K. von; Beakley, James E.; Patricelli, Frederic (2008): ‘All hazards approach’ to disaster
management: the role of information and knowledge management, Boyd’s OODA Loop, and
network-centricity. In Disasters (32, 4), pp. 561–585. https://doi.org/10.1111/j.0361-3666.2008.
01055.x.
Luko, Stephen N. (2013): Risk Management Principles and Guidelines. In Quality Engineering 25
(4), pp. 451–454. https://doi.org/10.1080/08982112.2013.814508.
Meyer, M. A.; Booker, J. M. (2001): Eliciting and Analyzing Expert Judgment. A Practical Guide:
Society for Industrial and Applied Mathematics.
Mock, R.; Lopez de Obeso, Luis; Zipper, Christian (2016): Resilience assessment of internet of
things. A case study on smart buildings. In Lesley Walls, Matthew Revie, Tim Bedford (Eds.):
European Safety and Reliability Conference (ESREL). Glasgow, 25–29.09. London: Taylor &
Francis Group, pp. 2260–2267.
Pant, Raghav; Barker, Kash; Ramirez-Marquez, Jose Emmanuel; Rocco, Claudio M. (2014):
Stochastic measures of resilience and their application to container terminals. In Computers
& Industrial Engineering 70, pp. 183–194. https://doi.org/10.1016/j.cie.2014.01.017.
Purdy, Grant (2010): ISO 31000:2009–Setting a new standard for risk management. In Risk analysis:
an official publication of the Society for Risk Analysis 30 (6), pp. 881–886. https://doi.org/10.
1111/j.1539-6924.2010.01442.x.
Thoma, Klaus (Ed.) (2014): Resilien-Tech: “Resilience by Design”: a strategy for the technology
issues of the future. München: Herbert Utz Verlag; Utz, Herbert (acatech STUDY).
Thorisson, Heimir; Lambert, James H.; Cardenas, John J.; Linkov, Igor (2017): Resilience Analytics
with Application to Power Grid of a Developing Region. In Risk analysis: an official publication
of the Society for Risk Analysis 37 (7), pp. 1268–1286. https://doi.org/10.1111/risa.12711.
Yodo, Nita; Wang, Pingfeng; Zhou, Zhi (2017): Predictive Resilience Analysis of Complex Systems
Using Dynamic Bayesian Networks. In IEEE Trans. Rel. 66 (3), pp. 761–770. https://doi.org/10.
1109/tr.2017.2722471.
Zhao, S.; Liu, X.; Zhuo, Y. (2017): Hybrid Hidden Markov Models for resilience metrics in a
dynamic infrastructure system. In Reliability Engineering & System Safety 164, pp. 84–97. https://
doi.org/10.1016/j.ress.2017.02.009.
Chapter 3
Basic Technical Safety Terms
and Definitions
3.1 Overview
Well-defined key terms are crucial for developing safe systems. In particular, when
developments are interdisciplinary with respect to (technical) disciplines involved,
if more departments of a company participate, or different companies and industry
sectors. This also holds true if originally different safety standard traditions are
used, e.g., machine safety versus production process safety. Basic standards to be
considered include ISO 31000 (2018), ISO 31010 (2019), and IEC 61508 (2010),
where, in most cases, the terminology of the latter is taken up.
The chapter shows that a systematic ordering of key safety and reliability terms
can be obtained along the following lines: Sorting of terms with respect to input
terms needed by ascending from basic to higher level concepts. For example, from
system, to risks of system, to acceptable risks, to risk reduction, to safety integrity
level for functional safety, to safety-related system, etc.
Another ordering principle is to start out with basic terms before moving to defi-
nitions of time lines and processes, e.g., product life cycle, development life cycle,
and safety life cycle.
The more general idea is that methods (techniques and measures including to
some extent also organizational practices) can be understood as supporting and more
often enabling well-defined processes. For instance, documenting the qualifications
of employees is part of the prerequisites of successfully implementing a safety life
cycle resulting in the development of safe and resilient systems. In a similar way,
analytical tabular or graphical methods for determining safety requirements as well
as the multiple methods for developing hardware and software sufficiently free from
systematic and statistical errors can be understood as methods supporting processes.
In the context of technical safety, the chapter also distinguishes between system
understanding, system modeling, system simulation, system analysis, and evaluation
of system analysis results. It shows how to reserve the terms system simulation
and system analysis to the application of the core methodologies, by excluding the
preparation and modeling as well as the executive evaluations of the results of the
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 27
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_3
28 3 Basic Technical Safety Terms and Definitions
3.2 System
Sufficient in this context means that the information necessary as input for classical
system analysis methods are available, e.g., possible system states and transitions.
The interactions of elements have to be seen as part of the system. They often are
extremely complicated and an essential part of the system.
The life cycle of a system is the time when the system is capable of being used
(Branco Filho 2000). It contains all phases of the system from the conception to the
disposal. These phases include (Blanchard and Fabrycky 2005) the following:
– the system conception,
– design and development,
– production and/or construction,
– distribution,
– installation/setup,
– operation (in all credible ways),
– maintenance and support,
– retirement,
– phase-out, and
– disposal.
The term life cycle also often only refers to the phases after the production.
3.4 Risk
Risk considers a measure for the frequency/probability of events and a measure for the
consequences. Most definitions do not ask for a special relation between probability
and consequences on the one hand and risk on the other hand. The classical definition
of risk has the strong requirement of proportionality (Mayrhofer 2010):
“Risk should be proportional to the probability of occurrence as well as to the
extent of damage.” according to Blaise Pascal (1623-1662). Similar definitions have
been used in many variations, see, for instance, the discussion in Kaplan and Garrick
(1981), Haimes (2009).
Formalized this reads as follows.
Classical definition of risk: Risk is proportional to a measure for the probability
P of an event (frequency, likelihood) and the consequences C of an event (impact,
effect on objectives):
R = PC. (3.1)
Fig. 3.1 Three examples of combinations of given parameters that can lead to a similar risk although
they are very different. The dotted line can be associated with a car crash. Car crashes occur often
but most of them do not have very big consequences. The dashed line on the contrary represents
an event that occurs very seldom but has enormous consequences, for example, the explosion of a
nuclear power plant. The straight line could, for example, stand for the collapse of a building with
a medium probability of occurrence and a medium level of consequences
Figure 3.1 shows the meaning of the product in Eq. (3.1). It can be seen how
different combinations of levels of hazard, damage, frequency, and exposure can all
lead to the same risk.
Acceptable risk is the risk that is tolerated by the society. Of course, the level of
risk that is considered acceptable can be different for different events. The decision
whether a risk is acceptable is not only based on physical or statistical observations
but also on societal aspects.
Typically, the acceptance does not depend on R only but also on (P,C). This is
further discussed in Sects. 7.9.4 and 8.10.2.
3.6 Hazard
By hazard we mean the physical hazard potential that propagates from a hazard
source. Damage can occur if the hazard physically or temporally meets with objects
or persons. Hence, hazard is not the same as unacceptable risk, as hazard does not
include the frequency of hurting something or someone.
Remark Other sources often use hazard for a non-acceptable level of risk and safety
for an acceptable level of risk. This differs from the definition of hazard used here.
3.7 Safety 31
3.7 Safety
The norm IEC 61508 defines safety as “freedom from unacceptable risks” (IEC 61508
2010). In the standard MIL-STD-882D, safety of a system is defined as “the applica-
tion of engineering and management principles, criteria, and techniques to achieve
acceptable mishap risk, within the constraints of operational effectiveness and suit-
ability, time, and cost, throughout all phases of the system life cycle” (Department
of Defense 2000; Mayrr et al. 2011).
Remark 1 Both definitions allow the risk to be greater than zero in a safe system.
It only has to be sufficiently small.
Remark 2 All single risks must be acceptable as well as the total risk.
Increasing
Necessary risk reducon risk
Fig. 3.2 Presentation of the correlation of the terms danger, risk, safety, and risk reduction from
(IEC 61508 2010) with minor graphical modifications. Copyright © 2010 IEC Geneva, Switzerland.
www.iec.ch
32 3 Basic Technical Safety Terms and Definitions
A system is safety relevant or safety critical if its risks to humans, the environment,
and objects are only acceptable if at least a single technical subsystem is necessary
to achieve an acceptable (sufficiently low) overall risk level. This can be achieved,
in particular, by safety functions.
Remark 2 Possible distinctions between safety relevant and safety critical include
the level of risk that is controlled by safety functions as well as whether persons
might be severely injured or not, or even if fatalities are feasible.
Examples
– An important norm in this context is the basic norm on functional safety IEC
61508 (DIN IEC 61508 2005; IEC 61508-SER 2005; IEC 61508 S+ 2010).
– Several norms are derived from the IEC 61508, for example, the ISO 26262 (2008)
for the automotive sector (Siebold 2012).
– The railway system has its own norms and standards, for example, EN 50126
(1999) for safety and reliability (Siebold 2012).
Safety is the absence of intolerable risks, see Sect. 3.7. This implies that in a safe
system the components only are active in ways they are supposed to be and at times
when they should be active. Reliability, on the other hand, means that components
are always active when they are supposed to be and at every time they are supposed
to be active.
Example An airbag in a car that cannot be started at all is safe but not reliable.
Airbags that start when being initiated but also due to strong mechanical vibrations
due to rough roads are reliable but not safe. Only airbags that start when they are
initiated and only when they are lit are safe and reliable.
3.11 Systems with High Requirements for Reliability 33
Several well-known models such as the waterfall model, the spiral model, and the
“V” model are used to model the software and/or hardware development process
of a system. Those three listed examples will be introduced in Chap. 10 and their
properties will be explained.
Fig. 3.3 From left to right: A system without safety-related system, a safety-related system that is
included in the system, a separate safety-related system, and a safety-related system that is partly
included in the system but partly outside the system
34 3 Basic Technical Safety Terms and Definitions
depending on the mode of operation of the safety function. On the one hand, a SIL
for requirements with low demand rate and, on the other hand, a SIL for modes of
operation with high demand rate or continuous mode, see Table 3.5.
Table 3.5 The SILs for high and low demand rates as defined in IEC 61508 (2010). Copyright ©
2010 IEC Geneva, Switzerland. www.iec.ch
Low demand rate High demand rate
Per request Per hour
SIL Average probability of a dangerous SIL Average frequency of a dangerous failure
failure
−5
4 10 , 10−4 4 10−9 h −1 , 10−8 h −1
−4
3 10 , 10−3 3 10−8 h −1 , 10−7 h −1
−3
2 10 , 10−2 2 10−7 h −1 , 10−6 h −1
−2
1 10 , 10−1 1 10−6 h −1 , 10−5 h −1
3.13 Safety Function and Integrity 35
The Safety Life Cycle described in IEC 61508 represents the life cycle of systems
under development. It can be seen as a systematic concept to gain safety within the
development process of a system. The application of the norm is a way to ensure the
client that the system is safe. The safety life cycle will be explained in more detail
in Chap. 11.
Verbose system descriptions are one possible way of documenting system under-
standing. Often ad hoc pictures and schemes are added. Such descriptions are often
very intuitive and allow for a description intelligible to all. However, only a certain
level of rigor allows for (less) ambiguous system descriptions. This holds true even
more when system models are used as input for (semi-)automatic system simulation
and analyses.
Especially for the development of safety-critical systems, a consistent syntax in the
documentation is very useful (Siebold 2012). To this end, this section lists different
modeling languages.
Most of the languages listed below can be used at various levels of abstraction.
Hence, also for overall system descriptions in early stages of the system development,
even with rather little information on the system designs, corresponding models are
feasible.
In general, it is advisable to select modeling languages that can be used for many
or even during all development phases. By definition, SysML is such a systems
modeling language. However, in practice, each technical discipline and often also
technology has its own most established modeling language. Examples are UML for
software developments and VHDL for electronic hardware developments.
36 3 Basic Technical Safety Terms and Definitions
The Society of Automotive Engineers (SAE) released the standard AS5506 called
Architecture Analysis and Design Language (SAE Aerospace 2004). This system
modeling language supports the system development in early estimates regarding
performance, schedulability, and reliability (Feiler et al. 2006). AADL is good to
model system which have strong requirements concerning space, storage, and energy
but not so good to document a detailed design or the implementation of components
(SAE Aerospace 2004; Siebold 2012).
The first draft of the unified modeling language (UML) was first presented to the
object management group (OMG) in 1997 and can be seen as the lingua franca of
software development (Rupp et al. 2007; Siebold 2012).
For modeling with UML, several different diagrams can be used. They will be
explained in more detail in Chap. 13.
3.16.4 AltaRica
VHDL was created in the 1980 s as a documentation and simulation language and was
first standardized in 1987 as IEEE 1067-1987 (Reichardt 2009; Siebold 2012). The
language can be used in the design, description, simulation, testing, and a hierarchical
specification of hardware (Navabi 1997).
The specification of the “Base Object Model” was accepted as a standard by the
Simulation Interoperability Standards Organization (SISO) in 2006 (S.B.O.M.P.D.
Group 2006; Simulation Interoperability Standards Organization 2011; SimVentions
2011; Siebold 2012). BOM contains different components to model different aspects
of a system. The specification contains meta-information about the model, “Pattern
of Interplay” tables to display recurring action patterns, state diagrams, entities, and
events (Siebold 2012).
Example Heß et al. (2010): One simulation tool is “COMSOL Multiphysics” which
is based on the “Finite Element Method” (FEM) (Comsol 2013). Its strength lies
in solving coupled physical systems. It also has a graphical user interface where
geometry and boundary conditions can be easily defined.
System analysis methods are used to systematically gain information about a system
(Vesely et al. 1981).
There are two different concepts for system analysis: inductive and deductive
methods. Inductive system analysis methods analyze in which way components can
malfunction and the consequences on the whole system. Deductive methods analyze
how a potential failure of the whole system can occur and what has to happen for this
to occur. Figure 3.4 visualizes the differences of inductive and deductive methods.
The square represents the system.
In Sects. 3.16–3.18, the terms system modeling, system simulation, and system
analysis were introduced. Their relation is visualized in Fig. 3.5. The modeled system
can be used both for the system analysis and the system simulation. The results of
both methods can go into one overall result.
Figure 3.6 visualizes the three terms system modeling, system simulation, and
system analysis. The square represents the system. The first line shows the modeling
of the system. The different parts of the system are described with a modeling
language. The result is an overall description of the system. The modeled system
Fig. 3.5 Relation between the terms system modeling, system simulation, and system analysis
Fig. 3.6 Comparison of system modeling, system simulation, and system analysis (from top to
bottom)
and some additional input or measurements can be used to simulate the system.
Apart from the simulation of the system, there might be some further output. The
modeled system can also be used as input for system analysis methods. One gets
results about the safety and reliability of a system.
Two forms of documentation of system analysis are widely used: tables and graphics.
Example FMEA is a method where results are documented in a table while the
results of an FTA are documented in a graphic.
40 3 Basic Technical Safety Terms and Definitions
3.20 Questions
(1) Which definitions do you need at least to define the safety of a system according
to the functional safety approach? How are they related?
(2) What is the difference between functional safety and more classical (non-
functional) safety definitions?
(3) Name three phases in the life cycle of a system.
(4) What is the classical formula to compute risks?
(5) What is the difference between hazard (hazard source, hazard potential),
unacceptable, or high risk?
(6) Relate the terms “residual risk,” “tolerable risk,” “necessary risk minimiza-
tion,” and “actual risk minimization” to each other.
(7) Explain the difference between safety and reliability.
(8) What is a safety life cycle?
(9) What is a safety function according to IEC 61508?
(10) What does SIL stand for? How is it defined according to IEC 61508?
(11) Explain the difference between inductive and deductive system analysis
methods.
(12) Give the upper and lower bounds for SIL 1 to SIL 4 functions in case of low
and high demands. Illustrate them and motivate how they can be related to
each other.
3.21 Answers
(1) Technical terms used are: system, risk, acceptable risk, risk reduction,
qualitative, and quantitative description of safety functions.
In addition, when considering the technical implementation: EEPE-safety-
related system, SIL, HFT, DC, and complexity level (A or B).
A system is safe according to the functional safety approach, if all risks
are known and reduced single and in total using appropriate risk reduction
measures. At least one risk reduction measure of the system is qualitatively
and quantitatively a safety function with at least SIL level 1 and realized as
EEPE-safety-related system.
(2) The functional safety approach requires that an EEPE-safety-related system
function maintains actively a safe state of a system or mitigates a risk event to
3.21 Answers 41
an acceptable level during and post events. It asks to develop such functions,
which can be described qualitatively and quantitatively.
(3) See the list in Sect. 3.3.
(4) R = PC, see Sect. 3.4.
(5) Hazard was defined as the hazard potential which may cause damage only if the
hazard event occurs and objects are exposed, see Sect. 3.6. A non-acceptable
risk comprises a non-acceptable combination of damage in case a hazard event
occurs and its probability.
(6) Necessary risk minimization is the difference between the actual risk and
the tolerable level of risk. The chosen actual minimization measures lead to
a decrease of the risk to the residual risk. This is smaller than or equal to
the tolerable risk. Hazard is often also used in the sense high-risk event. See
Fig. 3.2.
(7) See Sect. 3.13.
(8) See Sect. 3.14.
(9) See Sect. 3.13.
(10) See Sect. 3.13. The safety integrity level (SIL) measures the reliability of the
safety function.
(11) See Sect. 3.18. Inductive is also termed bottom-up and deductive top-down.
(12) See Table 3.5. When assuming that 10% probability of failure on demand
translate to approximately a maximum failure frequency of 10%/10000 h =
1.0 · 10−5 h−1 , i.e., 10% failures per year, one has a relation for SIL 1. Similar
assumptions are made for SIL 2 to SIL 4.
References
Blanchard, B. S. and W. J. Fabrycky (2005). Systems Engineering and Analysis, Prentice Hall.
Branco Filho, G., Confiabilidade e Manutebilidade (2000). Dicionário de Termos de Manutenção.
Rio de Janeiro, Editora Ciência Moderna Ltda.
Comsol. (2013). “Multiphysics Modeling and Simulation Software.” Retrieved 2013-03-08, from
http://www.comsol.com/.
Cressent, R., V. Idasiak, F. Kratz and D. Pierre (2011). Mastering safety and reliability in a model
based process. Reliability and Maintainability Symposium (RAMS).
Department of Defense (2000). MIL-STD-882D - Standard Practice For System Safety.
DIN IEC 61508 (2005). Funktionale Sicherheit sicherheitsbezogener elek-
trischer/elektronischer/programmierbarer elektronischer Systeme, Teile 0 - 7. DIN. Berlin,
Beuth Verlag.
Dori, D. (2002). Object-Process Methodology: A Holistic Systems Pardigm. Berlin, Heidelberg,
New York, Springer.
EN 50126 (1999). Railway applications. The specification and demonstration of reliability, avail-
ability, maintainability and safety (RAMS), Comité Européen deNormalisation Electrotechnique.
EN50126:1999.
Feiler, P. H., D. P. Gluch and J. J. Hudak (2006). The Architecture Analysis & Design Language
(AADL): An Introduction, Carnegie Mellon University.
42 3 Basic Technical Safety Terms and Definitions
Griffault, A. and G. Point (2006). “On the partial translation of Lustre programs into the AltaRica
language and vice versa.” Laboratoire Bordelais de Recherche en Informatique.
Grobshtein, Y., V. Perelman, E. Safra and D. Dori (2007). System Modeling Languages: OPM
Versus SysML. International Conference on Systems Engineering and Modeling - ICSEM’07.
Haifa, Israel, IEEE.
Haimes, Yacov Y. (2009): On the complex definition of risk: a systems-based approach. In Risk
Analysis 29 (12), pp. 1647–1654. https://doi.org/10.1111/j.1539-6924.2009.01310.x.
Heß, S., R. Külls and S. Nau (2010). Theoretische und experimentelle Analyse eines neuartigen
hochschockfesten Beschleunigungssensor-Designs. Diplom, Hochschule Ulm.
IEC 61508 (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Edition 2.0 Geneva, International Electrotechnical Commission.
IEC 61508-SER (2005). Functional Safety of Electrical/Electronic/Programmable Electronic
Safety-related Systems Ed. 1.0. Berlin, International Electrotechnical Commission, VDE Verlag.
IEC 61508 S+ (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safe-
tyrelated Systems Ed. 2 Geneva, International Electrotechnical Commission.
ISO 26262 (2008). ISO Working Draft 26262: Road vehicles - Functional safety, International
Organization for Standardization.
ISO 31000, 2018-02: Risk management - Guidelines. Available online at https://www.iso.org/sta
ndard/65694.html.
ISO 31010, 2019-06: Risk management - Risk assessment techniques. Available online at https://
www.iso.org/standard/72140.html.
Kaplan, Stanley; Garrick, B. John (1981): On The Quantitative Definition of Risk. In Risk Analysis
1 (1), pp. 11–27. https://doi.org/10.1111/j.1539-6924.1981.tb01350.x.
Mayrhofer, C. (2010). Städtebauliche Gefährdungsanalyse - Abschlussbericht Fraunhofer
Institut für Kurzzeitdynamik, Ernst-Mach-Institut, EMI, Bundesamt für Bevölkerungsschutz
und Katastrophenhilfe, http://www.emi.fraunhofer.de/fileadmin/media/emi/geschaeftsfelder/Sic
herheit/Downloads/FiB_7_Webdatei_101011.pdf.
Mayrr, A., R. Ploumlsch and M. Saft (2011). “Towards an operational safety standard for software:
Modelling IEC 61508 Part 3.” Proceedings of the 2011 18th IEEE International Conference and
Workshops on Engineering of Computer Based Systems (ECBS 2011): 97-104104.
Navabi, Z. (1997). VHDL: Analysis and Modeling of Digital Systems, McGraw-Hill, Inc.
Neudörfer, A. (2011). Konstruieren sicherheitsgerechter Produkte, Springer.
Reichardt, J. (2009). Lehrbuch Digitaltechnik, Oldenbourg Wissenschaftsverlag GmbH.
Rupp, C., S. Queins and B. Zengler (2007). UML 2 Glasklar - Praxiswissen für die UML-
Modellierung. München, Wien, Carl Hanser Verlag.
SAE Aerospace (2004). Architecture Analysis & Design Language (AADL), SAE Aerospace.
S.B.O.M.P.D. Group (2006). Base Object Model (BOM) Template Specification. Orlando, FL,
USA, Simulation Interoperability Standards Organization (SISO).
Siebold, U. (2012). Identifikation und Analyse von sicherheitsbezogenen Komponenten in semi-
formalen Modellen. Doktor, Albert-Ludwigs-Universität Freiburg im Breisgau.
Simulation Interoperability Standards Organization. (2011). “Approved Standards.” Retrieved
2011-10-31, from http://www.sisostds.org/ProductsPublications/Standards/SISOStandards.aspx.
SimVentions. (2011). “Resources - Technical Papers.” Retrieved 2011-10-31, from http://www.sim
ventions.net/resources/technical-papers.
Vesely, W. E., F. F. Goldberg, N. H. Roberts and D. F. Haasl (1981). Fault Tree Handbook. Wash-
ington, D.C., Systems and Reliability Research, Office of Nuclear Regulatroy Research, U.S.
Nuclear Regulatory Comission.
Chapter 4
Introduction to System Modeling
for System Analysis
4.1 Overview
System analysis methods are used to systematically gain information about a system
(Vesely et al. 1981).
System analysis should start as early as possible because costs caused by unde-
tected deficiencies increase exponentially in time (Müller and Tietjen 2003). An
accompanying analysis during the development process of a product also helps the
analyst to better understand the product which simplifies the detection of problems.
In this chapter, the general concepts and terms of system analysis are introduced
which can be and should be applied in every system analysis method that is intro-
duced in the following chapters. In Sects. 4.2 and 4.3, system and its boundaries
are introduced. Section 4.4 explains why a theoretical system analysis is often more
practicable than testing samples.
Sections 4.5 to 4.8 introduce terms to classify system analysis methods, e.g., to be
able to address fault tree analysis as graphical deductive qualitative and quantitative
methods. Sections 4.9 to 4.11 explain different types of failures in system analysis
contexts. This is sufficient to introduce the terms reliability and safety in Sect. 4.12.
System-redundancy-related terms are covered in Sects. 4.13 to 4.15. Section 4.16
illustrates the need of system analysis approaches for efficient system optimization
and 4.17 stresses the need to cover failure combinations.
The chapter is a summary, translation, and partial paraphrasing of (Echterling
et al. 2009) with some additional new examples and quotations.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 43
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_4
44 4 Introduction to System Modeling for System Analysis
Before analyzing a system, the boundaries of the system have to be defined. One
distinguishes between four different types of boundaries, namely,
– External boundaries,
– Internal boundaries,
– Temporal boundaries, and
– Probabilistic boundaries.
Although a clear allocation of one type is not always possible (Vesely et al. 1981;
Bedford and Cooke 2001; Maluf and Gawadiak 2006).
External boundaries separate the system from the outside surroundings which
cannot be or are not influenced by the system. For example, it can be useful to see
the temperature of the surrounding area as part of the system and not part of the
surroundings if it is influenced by the system.
Internal boundaries determine the resolution of a system. That is, they declare
the different elements of a system. The resolution can be defined with respect to
geometrical size, functionality, reliability, energy flow, and heat flow. The internal
boundaries help to find a balance between the costs and the level of detail. The most
detailed resolution of the system that can be regarded is not always the most efficient.
Temporal boundaries declare the length of time a system is used. Systems that
are used for decades ask for a different system analysis than systems used for short
time intervals.
Probabilistic boundaries are boundaries that are set because one already has
sufficient information about the safety and reliability of certain parts of the system.
In this case, it might be useful to regard such a part as one element in the analysis or
to exclude it completely.
Figure 4.1 visualizes the different types of boundaries.
Deterministic behavior of systems may be achieved by the extension or refinement
of system boundaries. For example, application software tools often receive updates
that slightly modify their behavior or algorithms are often deterministic at a fine level
of resolution.
4.4 Theoretical Versus Practical System Analysis 45
We give an argument why theoretical system analysis often has to strongly comple-
ment or replace practical system tests. Suppose a system produces N parts with a
tolerance of K damaged parts. For assessing its reliability, the most obvious approach
is to test all N parts. However, in many cases this is not possible. If N is big, testing
all parts is very expensive and time-consuming. Another problem is that there are
products which are destroyed when being tested. In both cases, only a sample should
be tested. Similar arguments hold true when considering a system composed of N
components.
Rule of thumb: To show with 95% accuracy that the percentage of damaged
elements in a production is at most p = 10−x , a zero-defect sample of order n =
3 · 10x is necessary.
Explanation: The rule says that we want to find an n such that
where the binomial distribution is used on the left side and p = 10−x . Computing
the left side for n = 3 · 10x and p = 10−x yields
n 3·10x
B n,10−x (0) = · p 0 · (1 − p)n = (1 − p)n = 1 − 10−x ≈ e−3 ≈ 0.0498.
0
(4.2)
46 4 Introduction to System Modeling for System Analysis
For systems with strict regulations, the numbers derived from the rule of thumb
can soon lead to unrealistically big samples.
If the method of examining samples fails, theoretical system analysis method
becomes very important. Hence, it becomes the only method that can guarantee the
dependability of a production entity or the safety of a system.
There are two different concepts for system analysis: inductive and deductive
methods.
Inductive system analysis methods analyze in which way components can
malfunction and the consequences on the whole system. That is, an inductive
approach means reasoning from an individual consideration to a general conclusion.
Deductive methods analyze how a potential failure of the whole system or a critical
system fault, for example, the maximum credible accident, can occur and what has
to happen for this to occur. That is, a deductive approach means reasoning from a
general consideration to special conclusions.
Figure 4.2 visualizes the differences of inductive (top sequence) and deductive
(bottom sequence) methods. The big square represents the system. The smaller
squares are the components and dashed squares are malfunctioning components.
For more details about the classification of methods see Table 4.1. The listed
methods will be explained in Chap. 5. Fault tree analysis (FTA) will be regarded as
an example for a deductive system analysis method in Chap. 6 and failure modes and
effects analysis (FMEA) as an example for an inductive system analysis method in
Chap. 7.
It is also possible to combine inductive and deductive methods. This is, for
example, done in the “Bouncing Failure Analysis” (BFA) (Bluvband et al. 2005).
Two forms of documentation of system analysis are widely used: tables and graphics.
Figure 4.3 shows a first overview of the columns in the worksheets of different
classical inductive methods. The terms failure and fault are defined in Sect. 4.10.
Example FMEA is a method where results are documented in a table while the
results of an FTA are documented in a graphic, see Chaps. 6 and 7, respectively.
Fig. 4.3 Overview of the columns of inductive FMEA like methods according to (Vesely et al.
1981)
Besides choosing the method and the form of documentation, there is another choice
that has to be made. Analysis can take place in the failure space or in the success
space. In the failure space, the possibilities of failures are analyzed, for example,
the event that an airplane does not take off. In the success space, the opposite event
would be analyzed, that is, that the airplane does take off.
Theoretically, both approaches are of same value but analyzing the failure space
has turned out to often be the better choice in practical applications. One reason is that
it is often easier to define a failure than a success (Vesely et al. 2002). Another reason
is that there usually are more ways to be successful than to fail. Hence, working in
the failure space often is more efficient.
For some methods, it is possible to transfer the results from the failure space into
the success space. For example, in classical fault tree analysis, one can compute the
4.7 Failure Space and Success Space 49
logical compliment of the fault tree which then lies in the success space. However,
in general, this does not yield the same tree as an analysis in the success space.
Use of success space is heavily advocated by Safety 2.0 and resilience engineering
approaches, see, e.g., (Woods et al. 2017).
The necessary steps one has to go through before starting the actual analysis and the
choices one has to make are summarized in Fig. 4.4.
define system
set boundaries
theorecal
tesng
analysis
inducve deducve
take samples
method method
failure or
graphic table
success space
Double Failure
FTA
Matrix
PHA
Fig. 4.4 Overview of system analysis possibilities. We did not list all tabular inductive methods of
Table 4.1
50 4 Introduction to System Modeling for System Analysis
Two different types of risks are distinguished. Risks of which it is known that they
exist are called known unknowns. Risks of which it is not known that they exist are
called unknown unknowns or black swans (Taleb 2007).
It is important to distinguish between failures and faults (Vesely et al. 1981) since
they play different roles in the system analysis process. A failure is a malfunction
within a component while a fault can also occur because the component is addressed
in the wrong way.
One distinguishes between three different types of failures, “primary, secondary and
command failures” (Vesely et al. 1981).
Primary failures are failures where a component or system fails under conditions
for which it was developed.
Secondary failures are failures where a component or system fails under
conditions for which it was not developed.
Command failures are failures that are caused by a wrong command or a command
at a wrong time. They also comprise failures caused by wrong handling of a
component or the whole system.
Example Suppose a gas tank was developed for a maximal pressure p0 . A primary
failure would be if the tank leaks at a pressure p ≤ p0 . A secondary failure would be
if it leaks at a pressure p > p0 . A command failure would be if a person accidentally
opens the emergency valve.
These failure types are used in a systematic way when constructing fault trees,
see Sect. 6.5.
4.12 Safety and Reliability 51
Safety and reliability are two aims of system analysis and often contradict each other.
We have already seen a general definition in Sect. 3.11. In that section, safety was
defined as the absence of intolerable risks (IEC 61508 S+ 2010). Reliability meant
that components are always sufficiently active when they are supposed to be.
In terms of systems, this can be expressed in the following way:
Reliability means that reliability (comfort) functions of the systems are performed
reliably.
Safety, on the other hand, means that the safety functions of the system are
performed reliably.
This can be achieved if the system is reliable, and active safety functions further
reduce risks if necessary. The safety functions must be sufficiently reliable. So, safety
for typical systems (where safety functions are added to reliability functions or mixed
with them) considers both, reliability and safety functions.
Example A machine that cannot be started at all is safe but not reliable. A device that
starts when the power button is pressed but also when being dropped on the ground
is reliable but not safe. Only a device that starts every time the button is pressed and
only when it is pressed is safe and reliable.
The following example shows that it is not always easy to set clear boundaries
between safety and reliability.
Example The reliability function of an air bag is to function in case of a crash event.
The safety function of an airbag is to avoid unintended functioning during driving if
no crash occurs. When determining the necessary reliability of the safety function
of the airbag, the reliability of the failure rate of the reliability/comfort function has
to be considered. It might even not be necessary to use an additional safety function
and corresponding safety-related systems. Typical levels are SIL 1 for the reliability
function and SIL 3 for the safety function which can be derived from analyzing
the norms ISO 26262 and IEC 61508 in combination. In the latter case, the same
subsystem of the airbag has to fulfill the requirements for the reliability/comfort
function and the safety function. Obviously, in this case, making the airbag more
reliable might make it less safe.
4.13 Redundancies
Example To ensure that a pump is supplied with enough water, one can install two
pipes that transport water independently of each other. The pump only fails if both
pipes fail. Unfortunately, the probability of a failure of the pump is not the product
of the probability of the failure of each individual pipe because non-independent is
possible. For example, the probability of the remaining pipe to fail can increase if the
other pipe fails because it is used more intensely. A common-cause failure would,
for example, be a fire that destroys both pipes.
The example shows that redundancies often only improve the safety or reliability
of a system for certain types of failures and that one has to be careful when calculating
probabilities for a redundant system.
Due to the oppositeness of safety and reliability, see Sect. 4.12, redundancies that
make a system safer often make it less reliable and vice versa. To increase the safety
and reliability of a system by installing redundancies, it is, hence, necessary to have
separate redundancies for the safety and the reliability (sub)systems.
Example (compare to the example in Sect. 4.12): A further sensor of the airbag
system increases its reliability but decreases safety.
Example A sensor detecting if only a small child is sitting in front of the airbag
makes the airbag safer but decreases its reliability because it cannot be started if the
child detection has a failure.
4.15 Standby
The term cold standby is used if a redundant component is not used in normal
system operation, but only in case of major system degradation of failure. In case of
hot standby, the component is used in normal operating condition but typically there
is at least some or even significant reduction of capacity. Warm standby means that
the component is in an intermediate status of employment during standard system
operation.
Several system analysis methods, for example, fault tree analysis or the double matrix
method, include the possibility to analyze combinations of failures. This is important
for large and complex systems because minor failures might cause a lot more damage
if they happen simultaneously or in a certain order. An example where this happened
is the development of the American space shuttle (Vesely et al. 2002).
Chapter 4 introduced definitions used within system analysis methods. The most
important terms are roughly ordered according to their use within analysis processes.
An overview of different system analysis methods has also been given in Fig. 4.4.
Following as sample applications of the scheme of Fig 4.4. The paths
define system—set boundaries–theoretical analysis–failure space–deductive method–
graphic–FTA
and
54 4 Introduction to System Modeling for System Analysis
leads to the two most often used system analysis methods that have already been
mentioned in this chapter but will be introduced in detail in Chaps. 6 and 7 namely,
fault tree analysis (FTA) and failure modes and effects analysis (FMEA).
4.19 Questions
(1) Define system analysis and system simulation. Name typical steps. How are
they related to system definition?
(2) Which type of boundaries determine the resolution of the analysis?
(3) How many zero-defect samples are approximately necessary to show with
95% level of confidence that the percentage of damaged elements is at most
p = 0.001?
(4) How is a system classically defined?
(5) Why is the classical systems definition difficult to fulfill for modern systems?
(6) Give an example for each of the four system boundary types.
(7) What is the difference between the approach of deductive and inductive system
analysis methods?
(8) How can a fault tree be transferred from the failure space to the success space?
(9) What is the difference between failure and fault?
(10) Several mechanical components are slightly out of their tolerance values
combining in an unfavorable way.
(11) Distinguish three generic failure types of systems that can be expected in
systems!
4.20 Answers
(6) For instance, in case of a high-altitude wind energy system using soft kites
and tethers, the external boundary could include the output interfaces of the
generator and interfaces to the overall system steering. The internal boundaries
could be defined by the components that are not further changed during produc-
tion regarding their physical-engineering properties, e.g., hardware compo-
nents. The time boundary is the overall expected in-service time. Probability
boundaries could be the lowest failure probability of a component that is still
considered, e.g., 10−9 h −1 .
(7) See Table 4.1.
(8) See Sect. 4.7.
(9) See Sect. 4.10.
(10) See Sects. 4.10 and 4.11.
(11) See Sects. 4.10 and 4.11.
References
Bedford, T. and R. Cooke (2001). Probabilistic Risk Analysis: Foundations and Methods, Cambridge
University Press.
Bluvband, Z., R. Polak and P. Grabov (2005). “Bouncing Failure Analysis (BFA): The Unified FTA-
FMEA Methodology.” Annual Reliability and Maintainability Symposium, 2005. Proceedings.:
463–467.
Echterling, N., A. Thein and I. Häring (2009). Fehlerbaumanalyse (FTA) - Zwischenbericht.
ISA4.0 (2020): BMBF Verbundprojekt: Intelligentes Sensorsystem zur autonomen Überwachung
von Produktionsanlagen in der Industrie 4.0 - ISA4.0, Joint research project: Intelligent sensor
system for the autonomous monitoring of production systems in Industry 4.0, ISA4.0, 2020–2023.
IEC 61508 S+ (2010): Functional Safety of Electrical/Electronic/Programmable Electronic Safety
related Systems Ed. 2. Geneva: International Electrotechnical Commission.
Maluf, D. A. and O. Gawadiak (2006). “On Space Exploration and Human Error: A paper on
reliability and safety.”
Müller, D. H. and T. Tietjen (2003). FMEA-Praxis, Hanser.
Taleb, N. N. (2007). The Black Swan: The Impact of the Highly Improbable.
ULTRA-FLUK (2020): Ultra-Breitband Lokalisierung und Kollisionsvermeidung von Flug-
drachen, Ultra-broadband localization and collision avoidance of kites, BMBF KMU-Innovativ
Forschungsprojekt, German BMBF research project, 2020–2023.
Vesely, W. E., F. F. Goldberg, N. H. Roberts and D. F. Haasl (1981). Fault Tree Handbook. Wash-
ington, D.C., Systems and Reliability Research, Office of Nuclear Regulatroy Research, U.S.
Nuclear Regulatory Comission.
Vesely, W., M. Stamatelatos, J. Duagan, J. Fragola, J. Minarick and J. Railsback (2002). Fault Tree
Handbook with Aeorspace Applications. Washington, DC 20546, NASA Office of Safety and
Mission Assurance, NASA Headquaters.
Woods, David D.; Leveson, Nancy; Hollnagel, Erik (2017): Resilience Engineering. Concepts and
Precepts. 1st ed. Boca Raton: CRC Press.
Zehetner, Julian; Weber, Ulrich; Häring, Ivo; Riedel, Werner (2018): Towards a systematic evalu-
ation of supplementary protective measures and their quantification for use in functional safety.
In S. Haugen, A. Barros, C. van Gulijk, T. Kongsvik, J. E. Vinnene (Eds.): Safety and Reliability
- Safe Societies in a Changing World. Safety and Reliability – Safe Societies in a Changing
World, Proceedings of the 28-th European Safety and Reliability Conference (ESREL), Trond-
heim, Norway, 17–21 June 2018: CRC Press, pp. 2497–2505. Available online at https://www.
taylorfrancis.com/books/9781351174657, checked on 9/23/2020.
Chapter 5
Introduction to System Analysis Methods
5.1 Overview
This chapter gives an overview of classical system analysis methods. Mostly, a repre-
sentative example is given for each of the methods to get a general idea what is
hidden behind the name. It is not intended to be sufficient to actually use the method.
However, it aids to support the selection of the correct type of method by listing the
main purposes of the methods.
The chapter takes up the methods of Table 4.1 which was introduced in Sect. 4.5.
The methods listed in this table will be briefly introduced in this chapter. The main
categorization available in terms of graphical versus tabular, inductive versus deduc-
tive, and qualitative versus quantitative is refined in this chapter by considering in
addition typical implementation examples, phases of developments and life cycles
where the methods are used as well as typical application examples.
The first method explained in more detail is the fault tree analysis in Chap. 6.
In the Parts Count approach, every component failure is assumed to lead to a critical
system failure. It is a simple version of a failure modes and effects analysis (FMEA).
A very conservative upper bound for the critical failure rate is obtained. It also takes
common-cause and multiple failure modes into account, see Fig. 5.1.
“Suppose the probability of failure of amplifier A is 1 × 10−3 and the probability
of failure of amplifier B is 1 × 10−3 , i.e., fA = 1 × 10−3 and fB = 1 × 10−3 .
Because the parallel configuration implies that system failure would occur only if
both amplifiers fail, and assuming independence of the two amplifiers, the probability
of system failure is 1 × 10−3 × 1 × 10−3 = 1 × 10−6 . By the parts count method,
the component probabilities are simply summed and hence the ‘parts count system
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 57
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_5
58 5 Introduction to System Analysis Methods
Fig. 5.1 A system of two amplifiers in parallel (Vesely et al. 1981, Figure II-1)
The failure modes and effects analysis (FMEA) considers for every component the
(overall) failure probability, the modes, and their percentage that contribute to the
failure probability and the effects on the total system (Vesely et al. 1981). The effects
on system level are classified as critical and non-critical.
Table 5.1 shows an example of an FMEA for the system of Fig. 5.1.
Remark Here, the parts count critical system failure for the system of Fig. 5.1
according to the FMEA table given in Table 5.1 is 2 × 10−3 by summing up the
values in column 2 and the FMEA critical system failure reads 2 × 10−4 by summing
up the values in the “critical effects” column in column 5.
Table 5.1 Example of an FMEA for the system of Fig. 5.1 (Vesely et al. 1981, Table II-1)
1 2 3 4 5
Component Failure probability Failure mode % Failures by mode Effects
Critical Non-critical
A 1× 10–3 Open 90 X
Short 5 X
(5 × 10–5 )
Other 5 X
(5 × 10–5 )
B 1 × 10–3 Open 90 X
Short 5 X
(5 × 10–5 )
Other 5 X
(5 × 10–5 )
5.3 Failure Modes and Effects Analysis (FMEA) 59
Table 5.2 Example for failure modes of electronic components; S = short circuit; O = open
circuit/break; D = drift; F = failure, after Birolini (1994)
Component S (%) O (%) D (%) F (%)
Digital bipolar integrated circuits (IC) 30 30 10 30
Digital MOS-ICs 20 10 30 40
Linear ICs 30 10 10 50
Bipolar transistors 70 20 10 –
Field-effect transistors (FET) 80 10 10 –
Diode 70 30 – –
Zener diode 60 30 10 –
High-frequency (HF) diode 80 20 – –
Thyristors 20 20 60 –
Optoelectronic components 10 50 40 –
Resistor ≈0 90 10 –
Table 5.3 shows the FMEA of the withdrawn DIN 25488 (DIN 25448 1990).
Column 5 focuses on the detection of faults and column 6 on countermeasures. It is
interesting that there is no column for the estimation of a failure rate.
FMEA will be treated in more detail in Chap. 7, including more recent
developments.
The failure modes, effects and criticality analysis (FMECA) is very similar to the
FMEA. The criticality of the failure is analyzed in greater detail. Assurances, controls,
and compensations are considered that limit the probability of the critical system
failure (Vesely et al. 1981). A minimum set of columns includes fault identification,
potential effects of the fault, existing or projected compensation, and control and
summary of findings.
A typical FMECA form is shown in Fig. 5.2. It is based on the withdrawn
DIN 25488 (DIN 25448 1990). Additional material is added from the electro and
automotive industry domain (Leitch 1995; Witter 1995; Birolini 1997).
The risk priority number (RPN) is defined in withdrawn DIN 25488 (DIN 25448
1990). It is the product of the severity (S), the occurrence rating (0), and the detection
rating (D):
R P N = S O D. (5.1)
Occurrence rating O
Responsible person
Failure mitigation
Failure detection
Failure mode
Failure rate λ
Failure cause
Occurrence
Detection
Function
Severity
Element
RPZ
Fig. 5.2 Worksheet implementing an FMECA, based on DIN 25448 (1990) with additional material
from Leitch (1995), Witter (1995), Birolini (1997)
The sizes S, O, and D are integer numbers between 1 and 10. For instance, S =
1 means the severity of the consequences is negligible and O = 1 the occurrence is
very low.
For quantitative considerations, a correspondence table is necessary that relates the
semi-quantitative numbers that only lead to an ordering to real numbers. Examples
for S are measures for consequences, e.g., fatalities per failure, per hour, per year, or
individual risks for the user. Examples for O are failure rates per hour or on demand.
An example for D is percentages of detection.
Often quite arbitrary one requires that
R P N ≤ R P Ncritical (5.2)
S ≤ Scritical ,
O ≤ Ocritical ,
D ≤ Dcritical . (5.3)
Also, products of two factors can be used. FMECA will be treated in more detail
in Chap. 7.
5.5 Fault Tree Analysis (FTA) 63
Fault tree analysis (FTA) uses a tree-like structure and Boolean algebra to determine
the probability of one undesired event, the so-called top event. It is also determined
which combination of basic events leads to this top event. It can be used to find out
which components are the most efficient to improve. An example for such a tree is
given in Fig. 5.3.
An event above an “Or” gate occurs if at least one of the events below the gate
occurs. An event above an “And” gate occurs if all events below the gate occur. For
example:
– A student fails the exam if it is the student’s fault or if it is the teacher’s fault.
– It is a student’s fault if the student made too many mistakes and if the exam was
fair.
FTA will be explained in more detail in Chap. 6.
Fig. 5.3 A simple example fault tree, created with the software tool (Relex 2011). It is assumed that
the cause is not a combination of faults by the student and teacher, that is, we assume a simplified
“black and white” world. Otherwise, the tree would be considerably larger
64 5 Introduction to System Analysis Methods
Event tree analysis (ETA) uses a structure which resembles a lying tree where the
root of the tree is at the left-hand side of the worksheet. The root of the tree repre-
sents an initiating event. Moving to the right the possible outcomes of the event
are described, see Fig. 5.4. On the way, so-called chance nodes distinguish between
possible following events, for example, by whether the fire alarm does or does not
go off.
“Note that each chance node must have branches that correspond to a set of
outcomes that are mutually exclusive (i.e. only one of the outcomes can happen) and
collectively exhaustive (i.e. no other possibilities exist, one of the specified outcomes
has to occur). The probabilities at each chance node must add up to 1.” (ReliaSoft
Corporation 2013)
The systematic combination of FTA and ETA is often called bow-tie analysis
(Ruijter and Guldenmund 2016).
5.8 FHA
The fault hazard analysis (FHA) starts to take more than only a single component
or mode into account. It looks in more detail on causes and consequences of faults:
it looks upstream and downstream.
“A typical FHA form uses several columns as follows:
Column (1)-Component identification.
Column (2)-Failure probability.
Column (3)-Failure modes (identify all possible modes).
Column (4)-Percent failures by mode.
Column (5)-Effect of failure (traced up to some relevant interface).
Column (6)-Identification of upstream component that could command or initiate
the fault in question.
Column (7)-Factors that could cause secondary failures (including threshold
levels). This column should contain a listing of those operational and environ-
mental variables to which the component is sensitive.
Column (8)-Remarks” (Vesely et al. 1981).
5.9 DFM
The double failure matrix (DFM) determines the effects of two faults in a systematic
way. The possible outcome of faults is assessed using a verbal qualitative description
of system effects, see, for example, Tables 5.4 and 5.5.
Table 5.6 shows the effects of the double failure modes of Fig. 5.5.
Table 5.5 Fault categories of the NERVA (Nuclear Engine for Rocket Vehicle Application) Project
(Vesely et al. 1981, Table II-3)
Fault category Effect on system
I Negligible
IIA A second fault event causes a transition into Category III (critical)
IIB A second fault event causes a transition into Category IV (catastrophic)
IIC A system safety problem whose effect depends upon the situation (e.g., the
failure of all backup onsite power sources, which is no problem as long as
primary, offsite power service remains on)
III A critical failure and mission must be aborted
IV A catastrophic failure
5.10 Questions
(1) Suppose the independent components A and B with known failure probabilities
are connected in series (see Fig. 5.6) and the Parts Count method is used to
compute the total probability of failure.
In this case, the Parts Count method.
(a) underestimates the failure,
(b) overestimates the failure, and
(c) computes the failure correctly.
Assume that the transistor in your project has the failure rate 7.4 × 10−4 h −1 (base
failure rate for bipolar transistors with low frequency from the standard MIL-HDBK
217F).
Take the following extract of an FMEA table:
How would you fill this table for your bipolar transistor?
5.10 Questions
Table 5.6 Fuel system double failure matrix (Vesely et al. 1981, Table II-4)
BVA CVA BVB CVB
Open Closed Open Closed Open Closed Open Closed
BVA Open (One Way) IIA III IIB IIA IIA or IIB IIA IIA
Closed (Two Ways) IIB IIB IIB IIA or IIB IV IIA or IIB IV
CVA Open III IIB (One Way) IIA IIA IIA or IIB IIA IIA or IIB
Closed IIB IIB (Two Ways) IIB IIA IV IIA or IIB IV
BVB Open IIA IIA or IIB IIA IIA or IIB (One Way) IIA III IIB
Closed IIA or IIB IV IIA or IIB IV (Two Ways) IIB IIB IIB
CVB Open IIA IIA or IIB IIA IIA or IIB III IIB (One Way) IIA
Closed IIA or IIB IV IIA or IIB IV IIB IIB (Two Ways) IIB
67
68 5 Introduction to System Analysis Methods
Fig. 5.5 Fuel system schematic (Vesely et al. 1981, Figure II-2)
5.11 Answers
(1) Parts count estimate P(S) = P(A ∪ B) ≈ P(A) + P(B), which is more or
equal than P(A) + P(B) − P( A ∩ B) = P(A) + P(B) − P(A)P(B). Hence,
answer (b) is correct.
(2) The table is as follows:
(3) The basic events 4 and 5 lead to the top event by themselves because there are
only “Or” gates above them. Basic event 1 only leads to the top event if also 2
or 3 occurs. Hence, the answer is (1, 2), (1, 3), 4, 5.
(4) The answer is given in Fig. 5.7.
(5) FMEA, to know the system better and all single faults. DFM to know all combi-
nations of faults and to start considering combinations of faults. Generalizations
of FMEA (e.g., with column which additional failure leads to system failure)
and DFM (e.g., considering triple failures, etc.).
References
Birolini, A. (1994). Quality and Reliability of Technical Systems. Berlin, Heidelberg, Springer-
Verlag.
Birolini, A. (1997). Zuverlässigkeit von Geräten und Systemen, Springer-Verlag.
DIN 25448 (1990). Ausfalleffektanalyse (Fehler-Möglichkeits- und Einfluss-Analyse); Failure
mode and effect analysis (FMEA). Berlin, DIN Deutsches Institut für Normung e.V.; Beuth
Verlag.
Leitch, R. D. (1995). Reliability Analysis for Engineers, An Introduction, Oxford Science
Publications.
Relex (2011). Relex reliability Studio Getting Started Guide.
ReliaSoft Corporation. (2013). “Modeling Event Trees in RENO.” Retrieved 2013-10-28, from
http://www.weibull.com/hotwire/issue63/hottopics63.htm.
Ruijter, A. de; Guldenmund, F. (2016): The bowtiemethod: A review. In Safety Science 88, pp. 211–
218. https://doi.org/10.1016/j.ssci.2016.03.001.
Vesely, W. E., F. F. Goldberg, N. H. Roberts and D. F. Haasl (1981). Fault Tree Handbook. Wash-
ington, D.C., Systems and Reliability Research, Office of Nuclear Regulatroy Research, U.S.
Nuclear Regulatory Comission.
Witter, A. (1995). “Entwicklung eines Modells zur optimierten Nutzung des Wissens potentials
einer Prozess-FMEA.” Vortschrittsberichte VDI, Reihe 20, Nr.176.
Chapter 6
Fault Tree Analysis
6.1 Overview
Fault tree or root cause analysis (FTA) is the most often used graphical deductive
system analysis. It is by now the working horse in all fields with higher demands
regarding reliability, safety, and availability. It is increasingly also used in the domain
of engineering resilience for assessing sufficient efficient response to expectable
disruptions in terms of time needed to detect the disruption, to stabilize the system,
and to repair or respond and recover.
The FTA approach is conducted in several steps including the identification of the
goal of the analysis, of relevant top events to be considered, system analysis, FTA
graphical construction, translation into Boolean algebra, determination of cut sets,
probabilistic calculation of top events, importance analysis, and executive summary.
Each step is illustrated and several computation examples are provided.
Major advantages of FTA include its targeted deductive nature that focuses on
the system understanding and modeling used for the analysis; the option to conduct
the analysis qualitatively and quantitatively; and the targeted evaluation options in
terms of most critical root cause combinations (minimal cut sets), most critical cut
sets, and root causes.
Further developments of FTA shortly discussed include its application to secu-
rity issues in terms of attack trees, its application to the cybersecurity domain by
typically assuming that all (or most) possible events also occur, and time-dependent
FTA (TDFTA). TDFTA goes beyond considering system phase or use case-specific
applications of FTA by considering time-dependent system states. Typically, such
approaches are based on Markov or Petri models of systems. Another topic is to
combine Monte Carlo approaches or fuzzy theory (fuzzification) to fault tree. Using
distributions instead of failure probabilities allows to consider statistic uncertainties
and deep uncertainties within FTA models of systems.
In summary, Chap. 1 introduces the deductive system analysis method FTA with
focus on classical concepts, the process of successful application, and an overview
on the many extension options. Section 6.2 gives a brief introduction to the method
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 71
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_6
72 6 Fault Tree Analysis
and mentions advantages, disadvantages, and application areas. Section 6.3 defines
the most important terms and definitions used in FTA.
Three sections follow on the procedure of the analysis. Section 6.4 introduces the
structure of FTA by dividing the analysis into eight well-defined steps with inputs
and outputs. Section 6.5 explains three key concepts that should be applied during
the analysis to simplify the search of possible events that cause failures, in particular,
to search for immediate, necessary, and sufficient events for construction of the tree
as well as by considering all types of possible failures. Section 6.6 lists more formal
rules that should be followed when constructing a fault tree, e.g., that FTAs should
not contain any gate-to-gate sequence.
Sections 6.7 to 6.11 deal with the mathematics of fault tree analysis. Rules of
the Boolean algebra are introduced and applied in the computation of minimal cut
sets. Here, the top-down and the bottom-up methods are explained with an example.
After a short section about the computation of dual trees, Section 6.10 shows how
the probability of the top event is computed using the inclusion/exclusion principle.
Section 6.11 introduces importance measures which help to identify the compo-
nents of a fault tree in which improvement is most effective. The chapter ends with
a brief outlook on ways to extend classical fault tree analysis in Sect. 6.12.
This chapter is a summary, translation, and partial paraphrasing of (Echterling
et al. 2009) with some additional new examples and quotations.
6.3 Definitions
A basic event is an event that should not or cannot be analyzed further. Basic events
have to be stochastically independent for a meaningful analysis, see Sect. 6.10.
The top event is the event for which the fault tree is set up.
Example A possible top event is “The computer does not start up to 5 seconds after
pressing the starting button.”
74 6 Fault Tree Analysis
A cut set is a combination of basic events that causes the top event.
A minimal cut set is a cut set where every event in the combination is necessary
in order for the combination to cause the top event. Minimal cut sets that contain one
event are called minimal cut sets of order one, those with two events are of order
two, etc.
Example Consider a chain with n links that is arranged in a circle, that is, a chain
where the ends are also connected by a link. Define the top event as “The chain falls
into two or more pieces.” For this top event, two links have to break. So, there are
n(n − 1) minimal cut sets (of order two) of the form “link a breaks”∩“link b breaks.”
The intersection of order three “link a breaks”∩“link b breaks” ∩“link c breaks”
would be a cut set of order 3, but not minimal. The basic event “link a breaks” is
not a cut set because it does not yield the top event. It would be a minimal cut set of
order one for the top event “the chain breaks.”
The union of all minimal cut sets contains all possible combination of events that
lead to the top event. In the quantitative analysis, the probability of the union of all
minimal cut sets hence yields the probability of occurrence of the top event. The
minimal cut set with the highest probability of occurrence is called the critical path.
It is essential for the calculation of the probabilities that the basic events are
independent. Otherwise, the fault tree computations in Section 6.10 underestimate
the probability that the top event occurs.
A multiple occurring event (MOE) is a basic event that occurs more than once in the
fault tree.
A multiple occurring branch (MOB) is a branch of the fault tree that occurs more
than once in the fault tree. All basic events in a MOB are by definition MOEs.
Exposure time is the time a component is effectively exposed to a force. It has a big
effect on the calculation of probabilities of failure.
6.4 Process of Fault Tree Analysis 75
When fault trees were first developed in the 1960s, they were seen as art (Vesely
et al. 1981). It soon turned out that a set of commonly accepted rules eases the
understanding, prevents misunderstandings, and allows different groups of persons
to work in a team without complications.
We follow the structure of (Vesely et al. 2002) by dividing FTA into the following
eight steps where steps 3 to 5 can be done simultaneously.
(1) Identification of the goal of the FTA: The goal of the analysis is phrased in
terms of failures and faults to ascertain that the analysis fulfills the decision-
maker’s expectations.
(2) Definition of the top event: The top event should be phrased as precisely as
possible, ideally using physical units (DIN IEC 61025 Entwurf 2004).
(3) Definition of the scope of the analysis: The outer boundaries of a system have
to be set. If the system exists in different versions or has different modes, the
used version or mode has to be specified.
(4) Definition of the resolution: It has to be decided how detailed the analysis
should be. That is, the inner boundaries of the system have to be defined. It is
also possible to include probabilistic or temporal boundaries, see Section 4.3.
The right balance between a meaningful and precise analysis and cost efficiency
has to be found.
(5) Agreement on a set of rules: To make the FTA accessible to other persons,
one standard has to be chosen, for example, (DIN) IEC 61025 (DIN IEC 61025
2007), and applied consequently.
(6) Construction of the fault tree: The fault tree is constructed according to the
chosen standard. The symbols that are used for events and gates vary in the
different standards. Table 6.1 lists some basic and often used symbols. (DIN
The circle stands for a basic event that cannot or should not be resolved further.
The triangle informs that the fault tree is continued at a different place.
“and gate: The events under this symbol cause this event if and only if all of them
occur.
“or” gate: The events under this symbol cause this event if and only if at least one of
them occurs.
76 6 Fault Tree Analysis
IEC 61025 2007) contains tables with the complete list of symbols for the
standard IEC 61025 in its appendix.
(7) Evaluation of the fault tree: The qualitative evaluation yields the minimal cut
sets (see Sect. 6.3.2). A quantitative evaluation additionally gives the probability
of occurrence of the top event and, if desired, of the minimal cut sets. The failure
tree can also be converted into a success tree which is its logical compliment
(Vesely et al. 2002).
(8) Interpretation and presentation of results: The interpretation of the analysis
and the presentation of results should not only show the probability that the top
event occurs but also which minimal cut sets are mainly responsible and how
the situation can be improved.
Table 6.1 lists the basic symbols used in FTA. The triangle is used for convenience.
The construction of a fault tree is an iterative process which starts at the top event and
is repeated in every level of the tree down to the minimal cut sets and basic events.
There are different concepts that can be applied on each level. Three important
concepts are described in this section.
On each level one asks: What is immediate (I), necessary (N), and sufficient (S) to
cause the event?
The I-N-S concept ensures that nothing is left out in the analysis by focusing on
immediate rather than indirect reasons.
In the SS-SC concept, failures are divided into two categories: state of system (SS)
and state of component (SC). A fault in an event box is classified by SS or SC
depending on which type of failure causes it.
The advantage of the classification is that SC events always come out of an “or”
gate while SS events can come out of an “or” or an “and” gate.
6.5 Fundamental Concepts 77
In the P-S-C concept, failures are divided into primary (P), secondary (S), and
command (C) failures, see Sect. 4.11.
Command failures form the failure flow in a fault tree. If an analysis is complete,
a comparison of the fault tree and the signal flow graph shows similarities.
The following selection of important rules for the construction of a fault tree is from
(Ericson 2005).
– Each event in the fault tree should have a name and be described by a text.
– The name should uniquely identify the event.
– The text should be meaningful and describe failures and faults. Text boxes should
never be left empty.
– The failure or fault should be described precisely. If necessary, the text box should
be extended. Abbreviating words is permitted, abbreviating ideas is not.
– All entrances of a gate should be completed before the analysis is continued on
the next level.
– Gate-to-gate connections are not allowed.
– Miracles should not be assumed. It is not permitted to assume that failures
compensate other failures.
– I-N-S, SS-SC, and P-S-C are concepts. Those concepts should be applied, but
their abbreviations should not be written in the text boxes.
For the computation of probabilities in fault trees, concepts from probability theory
are used. For this, gates are modeled by mathematical symbols. The only two gates
used in this chapter are the “or”- and the “and” gates. They can be expressed by
union and intersection, respectively.
Example An “or” gate below event C with the events A and B as children (see
Fig. 6.1) would be expressed by C = A ∪ B.
For the standard methods used in this chapter, only standard stochastic formulas
and the rules of the Boolean algebra are needed. They are summarized in Tables 6.2
and 6.3. Most of them can be understood with the diagram in Fig. 6.2.
For more complex methods, other mathematical concepts such as Markov chains
are necessary. Those are not covered here. For an introduction to those more complex
78 6 Fault Tree Analysis
C
80 6 Fault Tree Analysis
mathematical concepts and a more detailed and more elaborate introduction to prob-
ability theory, please refer to standard works on probability theory, for example
(Klenke 2007).
There are various methods to compute minimal cut sets which solve the fault tree.
Two of these methods are the top-down method and the bottom-up method. Those
two methods are explained in this section based on the example fault tree in Fig. 6.3.
The fault tree in Fig. 6.3 can be described by the following system of equations:
T = E1 ∩ E2 ,
E1 = A ∪ E3,
E2 = C ∪ E4,
E 3 = B ∪ C,
E 4 = A ∩ B. (6.1)
Fig. 6.3 An example fault tree created with relex (Relex 2007; Echterling et al. 2009)
6.8 Computation of Minimal Cut Sets 81
T = E1 · E2 ,
E1 = A + E3,
E2 = C + E4,
E 3 = B + C,
E 4 = A · B. (6.2)
In the following, we use the notation from (6.2) but keep in mind that the rules of
the Boolean algebra have to be applied. For example, A · A = A ∩ A = A and not
A · A = A2 .
To avoid too many brackets, A· B +C · D should be understood as (A· B)+(C · D),
as one would understand it for regular addition and multiplication.
In the top-down method, one starts with the top event and substitutes the equations
in (6.2) from top to bottom into the first equation:
T = E1 · E2
= (A + E 3 ) · (C + E 4 )
= (A + (B + C)) · (C + (A · B)). (6.3)
T = (A + (B + C)) · (C + (A · B))
= ((A + B) + C) · C + ((A + B) + C) · (A · B)
= C + (A + B) · (A · B) + C · (A · B)
= C + A · B + C · (A · B)
= C + A · B, (6.4)
where we have used associativity and distributivity in the first line, absorption and
distributivity in the second, and absorption in the third and fourth lines.
In (6.4), it can be seen that the minimal cut sets are {C} and {A, B}. Hence, an
equivalent fault tree to the one in Fig. 6.3 is the one in Fig. 6.4.
82 6 Fault Tree Analysis
The bottom-up method calculates the minimal cut sets in a similar way to the top-
down method. The only difference is that one starts by substituting the last equations
in (6.2) into the equations above them and then continues toward the first equation.
This yields
E 1 = A + (B + C),
E 2 = C + (A · B), (6.5)
The dual fault tree is the logical complement of the original fault tree and is located
in the success space. It is created by negating all events in the tree and substituting
“union” by “intersection” and vice versa.
Negating the top event yields the minimal cut sets in the success space. For the
fault tree from Fig. 6.3, the calculation would be
T =C+ A·B
=C· A·B
=C· A+B
= C · A + C · B. (6.7)
So, the top event in the tree in Fig. 6.3 does not occur if either A and C or B and
C do not occur.
Suppose we have three minimal cut sets M1 , M2 , and M3 and want to compute the
probability of the top event P(T ) = P(M1 ∪ M2 ∪ M3 ). According to Table 6.2, this
is
P(T ) = P(M1 ∪ M2 ∪ M3 )
= P(M1 ) + P(M2 ) + P(M3 ) − P(M1 ∩ M2 ) − P(M2 ∩ M3 )
− P(M1 ∩ M3 ) + P(M1 ∩ M2 ∩ M3 ). (6.8)
If the minimal cut sets are independent, one can obtain the following:
If the minimal cut sets are not independent, one has to look at their inner structure.
For example, if A, B, and C are basic events (and therefore independent) and M1 =
A ∩ B and M2 = B ∩ C, one gets
Here one sees that it is important to assume that the basic events are independent.
Otherwise, it would not be possible to compute the probability of the top event
84 6 Fault Tree Analysis
because one would have to compose the basic events which contradicts the definition
of basic event.
Since looking at events is more complex than just multiplying the probabilities,
we can use (6.9) to estimate the minimal computing time of P(T ). The estimated
number of computing operations EC T (n) for the probability of a top event with n
minimal cut sets is
n
n k
where the last equality follows by differentiating the binomial theorem x =
k=0 k
(1 + x)n with respect to x and by then setting x = 1.
Table 6.4 shows that EC T (n) increases very fast. Hence, even with modern
computers it is still necessary to find good approximations for the probability of
top events with many minimal cut sets. The easiest approximation is
The two sides are equal for minimal cut sets that cannot happen at the same time
and the approximation is good if the probabilities of the minimal cut sets are small.
This is normally true for fault trees. Because
Remark The number of theoretically possible minimal cut sets n(b) increases
exponentially with the number b of basic events:
b
b
n(b) := = 2b − 1. (6.14)
k
k=1
Let M1 , M2 , . . . , Mn be the minimal cut sets and T the top event. Then
P(Mi )
M(Mi ) := (6.15)
P(T )
computes the relative contribution of the minimal cut set Mi to the probability of the
top event.
Next, we regard an importance measure for basic events. This importance measure
is called Fussell–Vesely importance (also: FV importance) or “top contribution
importance” (Zio et al. 2006) and is calculated by
P ∪ Mj
j∈K i
F V (E i ) = , (6.16)
P(T )
86 6 Fault Tree Analysis
where K i contains the indices of the minimal cut sets M j that contain the event E i
(Zio 2006; Pham 2011). In particular, F V (E j ) = M(E j ) for all minimal cut sets E j
of order 1.
If the minimal cut sets M j are disjoint for j ∈ K i , (6.16) can be written as
P(M j )
F V (E i ) := . (6.17)
j∈K
P(T )
i
Remark If the probabilities of the minimal cut sets are small, (6.17) is a good
approximation of (6.16).
The “Risk Reduction Worth (RRW) […] is the decrease in risk if the [component] is
assumed to be perfectly reliable” (PSAM 2002). It is computed by
P(T )
R RW (E i ) := , (6.18)
PP(Ei )=0 (T )
where PP(Ei )=0 (T ) is the probability that the top event occurs in the modified tree
where P(E i ) = 0 (Pham 2011).
The “Risk Achievement Worth (RAW) […] is the increase in risk if the [component]
is assumed to be failed at all times” (PSAM 2002). It is defined as
PP(Ei )=1 (T )
R AW (E i ) := , (6.19)
P(T )
where PP(Ei )=1 (T ) is the probability that the top event occurs in the modified tree
where P(E i ) = 1 (Pham 2011).
The Birnbaum importance measure computes the change in the probability of occur-
rence of the top event in dependence of a change in the probability of occurrence of
a given event. It is usually calculated by
6.11 Importance Measures 87
where the right-hand side denotes the difference of the top event probability for the
initial and the improved probability of failure of the basic event under consideration
over the cost increase for this improvement.
Classical fault tree analysis as it has been described so far cannot be used to analyze
time- or mode-dependent components of complex systems. It also has the strong
restriction of independent basic events and one needs sharp probabilities for the
calculations. These limitations are broadly discussed in the recent literature on fault
trees. The present section gives a short introduction to the ideas to extend fault tree
analysis.
A classical fault tree refers to one top event. It does not consider different modes of
the system or different time intervals. The standard software does not cover these
problems.
The most obvious approach to analyzing a system with different modes with
standard software is to divide the occurrence of the top event into the occurrence of
the top event in the different phases using an “exclusive or” gate, that is, T = T1 ∪ex T2 .
This approach does not yield correct results. Consider, for example, a top event
of the form T = A ∩ B. The approach yields
where the indices indicate in which phase an event occurs. This decomposition
is correct if A and B must happen simultaneously for the top event to occur. In
88 6 Fault Tree Analysis
general, a system could also fail if A and B happen in different phases. The correct
decomposition is
Another problem of this approach is that it increases the number of basic events
which drastically increases the number of minimal cut sets, see (6.14). This, on the
other hand, increases the number of computing operations exponentially, see (6.11).
Despite the problems, this approach is still used when dividing the fault tree in
mode- or time-interval-dependent segments to calculate it with the standard software
if specialized software is not available (Ericson 2005).
There is no consistent definition of dynamic fault tree analysis (DFTA) in the litera-
ture. It often only says that it is more complex than FTA as it is documented in the
“Fault Tree Handbook” (Vesely et al. 1981).
Here DFTA is understood as FTA that uses Markov analysis to solve sequence-
dependent problems, that is, problems where it is essential in which order events
occur.
Example Consider a system consisting of a component and its backup and the top
event “the system fails.” The system only uses the backup if the component fails. If
the mechanism that switches between the component and its backup fails after the
component has failed and the backup has started, the top event does not occur. If, on
the other hand, the switching mechanism fails before the component fails, the top
event occurs. Hence, the order in which the events occur decides whether the top
event occurs.
For classical FTA, independence of the basic events is essential. There have been
many attempts to extend FTA to dependent basic events, see, for example, (Vesely
1970; Gulati and Dugan 1997; Andrews and Ridley 1998; Andrews and Dugan 1999;
Rideley and Andrews 1999).
Most approaches keep the classical fault tree structure and introduce new gates.
These gates identify the dependent sections of the tree and regard the sort of depen-
dence. The most often applied dependency model is the Beta factor model, see
(Amjad et al. 2014) for an example.
The analysis of such a modified fault tree uses the so-called Markov state transi-
tion diagrams (Ericson 2005). Here the probabilities of failure for those dependent
components are being calculated and are included into the fault tree. This process
is re-done until the fault tree has a structure where all basic events are independent.
Now this tree can be analyzed with the standard methods.
Classical FTA only considers sharp probabilities. If there are inaccuracies, one
assumes that inaccuracies cancel each other out throughout the analysis process
but it is not totally clear that this assumption is true.
Song, Zhang, and Chan (Hua Song et al. 2009) developed a method, the so-called
TS-FTA (after the Takagi–Sugeno (Takagi and Sugeno 1985) model it is based on),
where events in a classical fault tree can be expressed by fuzzy probabilities. This
works by exchanging normal gates with so-called “T-S fuzzy” gates. They do not
assume the convergence of fuzzy probabilities to a sharp final result but include
the convergence in the model. Furthermore, their TS-FTA model can regard the
significance of an analyzing mistake for the model.
6.13 Questions
(1) For which purpose can FTA be used during development and safety validation?
(2) What are the steps to conduct an FTA?
(3) Which key concepts and construction rules should be applied during construc-
tion of an FTA?
(4) Which mathematical approaches are used to quantify an FTA?
(5) Regard the tree in Fig. 6.5.
(a) How many minimal cut sets does this tree have?
(b) What is the highest order of a minimal cut set in this tree?
(c) Determine the minimal cut sets of the dual tree.
90 6 Fault Tree Analysis
Fig. 6.5 Example fault tree generated with a software. Here “wiederho” means “repeated” and
denotes basic events which occur more than once in the fault tree. They have a different color than
events which occur only once. The tree was created with Relex 2011
(6) Consider the event E = A + B + C where A is the event that component a fails,
B the event that component b fails, and C the event that component c fails. Let
P(A) = 10−3 , P(B) = 10−5 , and P(C) = 10−4 .
(a) What is the probability of E if A, B, and C are independent?
(b) What is the probability of E if component b fails only if component c fails
and A and C are still independent?
(c) What is a conservative estimate for the probability of E if one does not
know whether A, B, and C are independent?
(7) Regard independent A, B, and C, P(A) = 2 · 10−3 , P(B) = 3 · 10−3 , P(C) =
10−5 , and T = A · B + C.
(a) Compute
(i) the importance of both minimal cut sets, that is, M(A · B) and M(C),
(ii) the top contribution importance of A, that is, F V (A),
(iii) the risk reduction worth of A and C, that is, R RW (A) and R RW (C),
and
(iv) the risk achievement worth of A and C, that is, R AW (A) and
R AW (C).
(b) Interpret the results of (a).
(8) Consider the following situation (see Fig. 6.6): A train approaches a gated
railroad crossing. Before reaching the crossing, it contacts a sensor. The sensor
sends a signal to an engine which closes the gate.
It is assumed that any car that drives onto the tracks before the gate is completely
closed drives away fast enough not to be hit. Only cars that drive onto the tracks
when the gate should be closed but is not due to a failure get hit.
The following (arbitrary) frequencies are also given, see Table 6.5.
6.13 Questions 91
(a) Set up a fault tree for the top event “train hits car” where all “or” gates
represent disjoint unions.
(b) Apply the top-down and bottom-up methods to the fault tree from (a) to
calculate the minimal cut sets.
(c) Compute the probability of the top event with the results from (b).
(d) Set up a fault tree for the top event “train hits car” with only one “or” gate.
(e) Compare the trees from (a) and (d). What are the advantages and
disadvantages?
(f) Compute the minimal cut sets of the tree from (d).
(g) Compute the probability of the top event with the results from (f).
(h) How would the safety mechanism “Emergency stop of the train if there is a
car on the tracks when the train arrives at the sensor” change the probability
of the top event? Which assumption in the initial situation would have to
be changed to make this safety mechanism very valuable?
(i) Compute the relative contribution of each minimal cut set in (d) to the
probability of the top event.
(j) Compute the Fussell–Vesely importance, the RRW, the RAW, and the BB
for the event B =”Sensor fails” using the tree from (d).
(9) The formulas for the estimated number of computing operations EC T (n) have
only been briefly explained in the chapter. Here they will be treated in more
detail.
(a) Show that EC T (n) = n2n−1 −1 holds for n = 4 in the case of independent
minimal cut sets.
(b) Use (a) to explain why the second equality (6.11) looks the way it does.
92 6 Fault Tree Analysis
n
Hint: = #possibilities to choose k out of n.
k
EC T (n)
(c) What does EC T (n−1)
converge to? What does that mean?
6.14 Answers
(1) Aims of FTA application include to compare different system designs (e.g.,
parallel versus more sequential designs), to determine combinations of failures
that lead to top event of interest (qualitative analysis), to determine the failure
probabilities of systems and subsystems (quantitative analysis), to determine
critical components regarding the occurrence of the top event, to determine the
most efficient improvement options of systems, to comply with standards, and
to identify root causes of system failures.
(2) See Sect. 6.4.
(3) See Sects. 6.5 and 6.6.
(4) The following steps are conducted:
(i)Graphical fault tree using well-defined symbols (gates and events);
(ii)Use top-down or bottom-up transformations in Boolean expression;
(iii)
Use Boolean algebra to simplify (e.g., absorption, distributive law, etc.);
(iv)Use inclusion/exclusion formula for computation of top event probability;
and
(v) Use statistical independence for final quantification step.
(5) (a) 3, (b) 2,
(c) We express the tree in an equation:
T = E1 + E2 + E3
= A· B +C + D+ A·C
= A · B + C + D,
P(E) =P(A + B + C)
=P(A) + P(B) + P(C) − P(A · B) − P(A · C) − P(B · C)
+P(A · B · C)
=P(A) + P(B) + P(C) − P(A)P(B) − P(A)P(C) − P(B)P(C)
+P(A)P(B)P(C)
=10−3 + 10−5 + 10−4 − 10−8 − 10−7 − 10−9 + 10−12
6.14 Answers 93
≈1.10989 · 10−3 .
P(E) = P(A + B + C)
= P(A + C)
= P(A) + P(C) − P(A)P(C)
≈ 1.0999 · 10−3 .
P(E) =P(A + B + C)
=P(A) + P(B) + P(C) − P(A · B) − P(A · C) − P(B · C)
+P(A · B · C)
≤P(A) + P(B) + P(C)
=1.11 · 10−3 .
One can see that this estimate is quite good when comparing it to (a).
(7) (a) The importance of the minimal cut sets is computed by
2 · 10−3 · 3 · 10−3
(i) M(A · B) ≈ ≈ 0.375,
1.59999 · 10−5
10−5
M(C) ≈ ≈ 0.625.
1.59999 · 10−5
(ii) The top contribution importance of A can be derived from (a) because
F V (A) = M(A · B).
P(T ) 1.59999 · 10−5
(iii) R RW (A) = = = 1.59999,
P(C) 10−5
(b) One can see that improving C is more effective than improving A.
(8) (a) The fault tree yields the following equations:
T = E 1 · A,
E 1 = E 2 + B,
94 6 Fault Tree Analysis
E2 = B · E3,
E 3 = E 4 + C,
E 4 = C · D,
where T = train hits car, E 1 = a part fails, A = car comes at the wrong
moment, E 2 = another part than sensor fails, B = sensor fails, E 3 = signal
or engine fail, E 4 = signal works and engine fails, C = signal fails, and
D = engine fails.
(b) Bottom-up:
E 3 = C · D + C,
E 2 = B · C · D + C = B · C · D + B · C,
E 1 = B · C · D + B · C + B,
T = B · C · D + B · C + B · A = B · C · D · A + B · C · A + B · A.
Top-down:
T = E1 · A
= (E 2 + B) · A = E 2 · A + B · A
= B · E3 · A + B · A
= B · (E 4 + C) · A + B · A = B · E 4 · A + B · C · A + B · A
= B · C · D · A + B · C · A + B · A.
All events are independent of the other events in the intersections. Hence,
T = E 1 · A,
E 1 = B + C + D,
6.14 Answers 95
where
T = train hits car, E 1 = a part fails, A = car comes at the wrong moment,
B = sensor fails, C = signal fails, and D = engine fails.
(e) The advantage of the tree in (a) is that all unions are disjoint, so the compu-
tation of the probabilities is easier, because P(X + Y ) = P(X ) + P(Y )
can be used. The disadvantage is that it is a more complex tree structure.
(d) has a very simple and intuitive tree structure. The computation of the
probability of the top event is more complex though, because
has to be used.
(f) T = E 1 · A
= (B + C + D) · A
= B · A + C · A + D · A.
P(T ) =P(B · A + C · A + D · A)
=P(B · A) + P(C · A) + P(D · A)
Hence,
PP(B)=1 (T ) 10−2
R AW (B) = ≈ ≈ 9, 009.01,
P(T ) 1.11 · 10−6
computing operations.
EC T (n) n2n−1 − 1 2n − 1/2n−2
(c) = =
EC T (n − 1) (n − 1)2n−2 − 1 (n − 1) − 1/2n−2
2 − 1/(n2 )
n−2
2
= −→ n → ∞ = 2.
1 − 1/n − 1/(n2 )n−2 1
This means that if there already is a big number of minimal cut sets, every
additional minimal cut set doubles the computation time.
References
Amjad, Qazi Muhammad Nouman; Zubair, Muhammad; Heo, Gyunyoung (2014). Modeling of
common cause failures (CCFs) by using beta factor parametric model. In: ICESP (Hg.): Inter-
national Conference on Energy Systems and Policies, Proceedings. ICESP. Islamabad, Pakistan,
24.-26.11.2014, pp. 1–6.
Andrews, J. D. and L. M. Ridley (1998). Analysis of systems with standby dependencies. 16th
International System Safety Conference, Seattle.
Andrews, J. D. and J. B. Dugan (1999). Dependency Modeling Using Fault Tree Analysis. 17th
International System Safety Conference.
Blischke, W. R. and D. N. Prabhakar Murthy (2003). Case Studies in Reliability and Maintenance,
Wiley Series in Probability and Statistics.
Echterling, N., A. Thein and I. Häring (2009). Fehlerbaumanalyse (FTA) - Zwischenbericht.
Ericson, C. (2005). Hazard Analysis Techniques for System Safety, Wiley.
DIN IEC 61025 (2007). Fehlzustandsbaumanalyse (IEC 61025:2006); Fault Tree Analysis (FTA)
(IEC 61025:2006). Berlin, DIN Deutsches Institut für Normung e.V., DKE Deutsche Komission
Elektrotechnik Informationstechnik im DIN und VDE.
Dugan, J. B. and H. Xu (2004). Method and system for dynamic probability risk assessment. U. o.
V. P. Foundation. US 20110137703 A1.
Dugan, J. B., G. J. Pai and H. Xu (2007). “Combining Software Quality Analysis with Dynamic
Event/Fault Trees for High Assurance Systems Engineering.” Fraunhofer ISE.
Gulati, R. and J. B. Dugan (1997). A modular approach for analyzing static and dynamic fault trees.
Reliability and Maintainability Symposium. 1997 Proceedings, Annual.
Huang, C.-Y. and Y.-R. Chang (2007). “An improved decomposition scheme for assessing the
reliability of embedded systems by using dynamic fault trees.” Reliability Engineering and System
Safety 92: 1403-1412.
Hua Song, H., H.-Y. Zhang and C. W. Chan (2009). Fuzzy fault tree analysis based on T-S model
with application to INS/GPS navigation system.
References 99
7.1 Overview
Chapter 1 introduces the inductive system analysis method failure modes and effects
analysis (FMEA). It presents the classical definitions of the FMEA method as well
as the classical application domains on system concepts, system designs, and the
production process. Such distinctions are by now often only considered as application
cases, e.g., in the current German automotive association (VDA) standards.
The steps to conduct successful FMEAs are provided and discussed. It is empha-
sized that modern FMEAs take up many elements of similar tabular analysis such as
hazard analysis as well as classical extensions such as asking for root causes, imme-
diate consequences of failures, or a criticality assessment of failures on system level
(FMECA). Some templates even ask for potential additional failures to the failure
considered which would lead to catastrophic events pointing toward double failure
matrix like approaches.
Modern extensions such as FMEDA are introduced, which assesses the diagnostic
coverage (DC) and ultimately the safe failure fraction (SFF) of failures as required
for functional safety. In all cases, an emphasis is also on valid evaluation options
of obtained results sufficient for decision-making regarding system improvements.
For instance, the following distinctions are made: semi-quantitative assessment of
risks, of risks considering detection options and quantification of risks, respectively,
of single failure events only.
When comparing with the more formal method fault tree analysis (FTA), major
limitations of FMEA are discussed, in particular, that it considers single failures only,
that it considers immediate effects on system level only, and is a rather formalistic
cumbersome inductive approach.
The chapter will introduce the FMEA method as follows. Sections 7.2 and 7.3
focus on the method, in general, and the different steps of the execution of a classical
FMEA. The form sheet that has to be filled out is described in Sect. 7.4 by discussing
each row and column headline. Here, it is explained what has to be written in which
column of the sheet also by providing examples.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 101
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_7
102 7 Failure Modes and Effects Analysis
The chapter ends with a short summary of methods to extend classical FMEA, see
Sects. 7.9 to 7.11. Examples include how to consider expert ratings and uncertainties
and non-classical evaluation schemes that offer further insights into system criticality.
These last sections also take a more critical look at the method by regarding its
disadvantages. Disadvantages not yet mentioned include the need of knowledgeable
and iterative conduction and often lack of acceptance in development teams, in
particular, for early phases of development where subsystems or function FMEAs
are most advantageous.
FMEA is a good basis for safety analysis, in particular, fault tree analysis. The
method can be used to analyze complex systems on the level of single elements
or system functions but the team also has the possibility to define the depth of the
analysis as needed, e.g., on system concept level to scrutinize ideas. An experienced
and motivated team with a realistic vision or good knowledge of the system is essential
for a successful FMEA.
For more details on FMEA, see, for example, the standards (VDA 2006) and (DIN
2006). The chapter is a summary, translation, and partial paraphrasing of (Hofmann
2009) in Sects. 7.2 to 7.9.3 and of (Ringwald and Häring 2010) in Sects. 7.9.4 and
7.9.5, both with additional new examples and quotations.
FMEA is a standard method for risk analysis that accompanies the design and
development of a system. It is used to systematically detect potential errors in
the design phases of the development process and prevent them with appropriate
countermeasures.
Section 7.2.1 defines FMEA and presents some basic properties of the method.
It is followed by a list of situations where FMEA should be applied, see Sect. 7.2.2.
One distinguishes between system, construction, and process FMEA. Those three
types are explained in Sect. 7.2.3.
After reading this section, one has a brief overview of the method and can look
at the different steps and properties of the method in more detail.
The method was developed by the NASA in the 1960s for their Apollo program
(Ben-Daya and Raouf 1996).
The Rule of 10 according to which the costs to fix failures multiply by 10 in each
step of the life cycle (Wittig 1993; Müller and Tietjen 2003) indicates that the costs
to apply FMEA are saved in later stages of the development process of a product
because with FMEA failures are detected earlier in the process.
FMEA consists of three phases which are important for the success of the analysis
(Müller and Tietjen 2003), which are as follows:
(1) Determination of all system elements, possible failures, and their consequences
in a group process, for example, by using a “mind map” (Eidesen and Aven
2007).
(2) Research on reliable data on frequency, severity, and detection of the failure.
(3) Modification of the system and the process design and the development of a
control process that is based on the FMEA report.
Table 7.1 Comparison of the three main FMEA types (Hofmann 2009)
System FMEA Construction FMEA Process FMEA
Product concept Design FMEA Production FMEA
FMEA
Content Superior Element/component Production process
product/system
Requirements Concept of the Design documents Production plan
product
Field of responsibility Development Construction General planning
areas
System FMEA analyzes the functional interactions of the components in a system and
their interfaces (Dittmann 2007). It is used in early stages of the development process
to detect possible failures and weaknesses on the basis of the requirements speci-
fication and to guarantee a smooth interaction of the system’s components (Müller
and Tietjen 2003). By analyzing wiring and functional diagrams, the early system
FMEA can guarantee the system’s reliability and safety as well as the adherence of
requirements.
The construction FMEA starts when the design documents have been completed.
The parts lists and drawings are used to analyze whether the process fulfills the
construction rules of the requirements specification. This analysis takes place on
the component level and examines the system in terms of developing mistakes and
process mistakes in which probability of occurrence can be reduced by a change in
the construction (Müller and Tietjen 2003).
The procedure of the construction FMEA is similar to the system FMEA but now
specific elements or components are analyzed instead of bigger parts of the system.
To avoid a too long and expensive analysis, the system is divided into functional
groups and only those with new parts or problematic elements are analyzed.
The process FMEA finds and eliminates failures in the production process. It takes
place when the production plan is completed, hence after the system and construction
FMEA.
The goal of the process FMEA is the quality assurance of the final product. It
has to satisfy the client’s expectations and the process should be optimized and safe
(Deutsche Gesellschaft für Qualität e.V. 2004; Hofmann 2009; Ringwald and Häring
2010).
The process FMEA starts with the development of a so-called process flow
diagram that gives an overview of the production processes. Afterward, the flow
diagram is used to examine the different work steps for possible effects of a defect.
External processes such as storage or shipping can be included in this process.
To assure a systematic procedure, FMEA can be divided into the following five steps:
(1) Structural analysis,
(2) Functional analysis,
(3) Failure analysis,
(4) Measure analysis, and
(5) Optimization.
These steps will be explained in Sects. 7.3.2 to 7.3.6. Section 7.3.1 describes the
preparations that are necessary before starting the FMEA.
7.3.1 Preparation
A team of at most eight members should be chosen to execute the FMEA. Here, it
is important to have a good selection of team members. The group should include
technicians or engineers from the different working areas of development, produc-
tion, experiments, quality management, and sales but also clients of the product. The
success depends highly on the experiences, the imagination, and the motivation of
the team members although it often cannot be directly measured (Mariani 2007).
Thinking and imagination are encouraged when working in such a diverse group and
the necessary, extensive knowledge is assured to be present (DIN 2006; Mariani,
Boschi et al. 2007).
106 7 Failure Modes and Effects Analysis
In the first step of the FMEA, the system is divided into its system elements and
documented in a hierarchical structure, for example, a tree, to have a systematic
overview of all components, elements, and processes (Ringwald and Häring 2010).
For this, the interfaces of the system elements are determined and systems with many
interactions become comprehensible (Ringwald and Häring 2010). Each element
should only appear once in the diagram (Dittmann 2007). The diagram should be
available to the group members during the analysis or even be developed in the group
(Naude, Lockett et al. 1997).
The structure is used to determine the depth of the analysis. The depth is not fixed
beforehand. In this way, it can be adapted to the system and to the assessment of the
group members on how to subdivide the system (Deutsche Gesellschaft für Qualität
e.V. 2004; Dittmann 2007).
The structural analysis hence lays the foundation for the further procedure because
with the developed diagrams every element can now be regarded concerning its
function or malfunction (Hofmann 2009).
In the second step, every system element is analyzed regarding functionality and
malfunction. For this, it is also important to have information about the environment,
for example, about the temperature, dust, water, or electrical disturbances (VDA
2006).
The description of the function must be unique and comprehensive. If functions of
different system elements work together, this is demonstrated in a functional struc-
ture, for example, a tree, where subfunctions of a function are logically connected
within the structure. The structure also shows the interfaces of system elements. An
example of such a structure is given in Fig. 7.1.
The next step is failure analysis which is based on the system and functional structure.
Here three things are analyzed (Eberhardt 2008; Hofmann 2009) as follows:
(1) Failure/malfunction: Determination of all possible failures of the regarded
system element. These can be derived from the function of the system element.
In particular, more than one failure can occur in a system element.
(2) Failure result: The results of a failure are the failures in superior or related
system elements.
7.3 Execution of an FMEA 107
Measure analysis judges the risk and consequences of measures which are completed
at the time of the analysis. Measures that have not been completed are evaluated
according to the expected result and a responsible person is appointed and a deadline
is set (Pentii and Atte 2002). Measures can be qualitative, semi-quantitative, or
quantitative.
One distinguishes between preventive and detection measures:
Preventive measures are already achieved during the development process of
a system by the right system setup. The probability of occurrence of a failure is
kept small. Preventive measures should be comprehensively documented and the
documentation should be referred to on the form sheet.
Detection measures find possible failures or test implemented preventive
measures. Again, a comprehensive documentation is important. Detection measures
should also be tested for effectiveness.
For every measure, a responsible person and a deadline should be named.
An example of documentation in this step is shown in Fig. 7.2.
108 7 Failure Modes and Effects Analysis
Fig. 7.2 Exemplary mind map of the malfunctions of an element, their causes, consequences on
system and subsystem levels, and the established measures (Hofmann 2009)
If the results of a temporal or final evaluation are not satisfying, new measures are
being proposed which are treated according to Step 4.
Step 5 is repeated until an effectiveness control shows the desired result, e.g.,
regarding system functions in the concept phase, regarding construction solutions in
the design phase, and production process details in case of production processes.
7.4.1 Introduction
There is no clearly defined standard for an FMEA form sheet. Applicable norms
suggest form sheets but those should only be understood as guidelines. The user can
adapt them to his or her project’s requirements. Examples for standards that include
form sheets are (VDA 2006) and (DIN 2006). A further example from the North
American automotive domain is (AIAG 2008).
7.4 FMEA form Sheet 109
Name of FMEA
FMEA team
No.
Legal requirements
Failure mode
Failure cause
Component
RPZ
O
O
D
D
S
S
Fig. 7.3 Form sheet as introduced by (Müller and Tietjen 2003)
7.4.2 Columns
Here the columns are explained as numbered in Fig. 7.3, the columns in other form
sheets might vary from this list but essentially contain the same information.
The first column varies, depending on the type of FMEA that is executed, see
Sect. 7.2.3.
110 7 Failure Modes and Effects Analysis
Recovering measures
Completion date
Possible failure
RPN
RPN
O
O
D
D
S
Fig. 7.4 Form sheet after (VDA 2006) by Ringwald and Häring (2010)
On a system FMEA form sheet, the first column contains the functions of the
superior system.
On a construction FMEA form sheet, the first column contains the components
of the regarded product. Information about the components can be generated from
the functional block diagram, customer requirements, functional descriptions, parts
lists, drawings, and wiring diagrams.
On a process FMEA form sheet, the first column contains the process steps.
Information can be generated from flowcharts; the requirements specification, norms,
and standards; and work sheets of comparable products.
In Column 2, all possible failures for a component or process are named. To find
possible failures, it is helpful to negate the component’s function and find the plausible
causes.
Example Common failures for electromechanical systems are, for example,
“broken,” “decline in performance,” “interrupted,” “short-circuit,” “fatigue,”
“sound,” “high resistance,” and “corrosion.”
7.4 FMEA form Sheet 111
Column 3 lists all potential consequences of the failures from Column 2. Conse-
quences should be regarded for the system, the product, and the production process
and also include the client’s perspective.
Remark It should be kept in mind that a failure might have more than one
consequence.
7.4.2.4 Column 4: D
If the failure affects the adherence to legal requirements, this should be marked in
this column. The corresponding requirement should be mentioned, for example, in
a footnote.
Column 5 lists the failure causes. Again, note that a failure might have multiple
causes.
Column 6 lists preventive and detection measures, see Sect. 7.3.5. Since those
measures decrease the occurrence of the failure or its causes, they are considered
in the failure evaluation.
In Columns 7 to 9, the results of the failure evaluation are noted. This includes the
occurrence rating O, the severity rating S, and the detection rating D. A number from
1 to 10 is assigned to each parameter. For details, see Sect. 7.5. The RPN (risk priority
number) in Column 10 is the product of O, S, and D, see Sect. 7.6.
For the failures with the biggest RPN in the computation from Sect. 7.4.2.7, recom-
mended actions are written in Column 11. The name of the responsible person and
a timeline for implementation should also be noted.
112 7 Failure Modes and Effects Analysis
7.6 RPN
R := O · S. (1.1)
R P N := O · S · D. (7.2)
By definition the risk priority number can take values between 1 and 1,000 where
1 stands for very low and 1,000 for very high risk. The RPN makes it possible to
notice possible risks, rate them numerically, and distinguish big from small risks
(Sony Semiconductor 2000).
A fixed bound of the RPN for a project is, in general, not considered reasonable
(Müller and Tietjen 2003) because the same RPN does not mean that the risk is equal,
too. We have also seen in Sect. 7.5 that investigations should follow if one of the
values O, S, and D is higher than eight (Müller and Tietjen 2003; Syska 2006). So,
the RPN should only be seen as an indicator that helps to prioritize the improvement
measures.
Instead of using RPN itself as an indicator, it is also possible to use the Pareto
method to determine critical failures. For this, the RPNs are sorted from highest to
lowest. Then for every RPN one computes the fraction (Satsrisakul 2018)
R P Ni
R P Ninor mali zed = n , (7.3)
i=1 R P Ni
that is, the RPN divided by the sum of all RPNs. The RPNs for which the fraction in
(7.3) are high, are considered as critical failures, and countermeasures are decided in
the team. The advantage is that the normalized RPNs for different systems in the same
domain typically look similar, which allows to better motivate critical thresholds for
countermeasures. Here, preventive measures for the failure cause are given more
priority than detection measures (NASA Lewis Research Center 2003).
114 7 Failure Modes and Effects Analysis
In the case of a quantitative analysis, one can compute the probability of default of
a component B by
where N is the total number of components and P Dtotal can be used as an estimate
for the probability that the system fails.
Remark The computation in (7.5) is only meaningful if all components of the system
are regarded in the FMEA.
Table 7.3 Probabilities of occurrence and detection following (VDA 1996) where 5E–01 is short
for 5 · 10−1 (Hofmann 2009)
Rating Probability of occurrence Detection probability
Construction or system FMEA Process FMEA Construction, process, or system
FMEA
10 5E–01 1E–01 1E–01
9 1E–01 5E–02 1E–01
8 5E–02 2E–02 2E–02
7 1E–02 1E–02 2E–02
6 1E–02 5E–03 2E–02
5 1E–03 2E–03 3E–03
4 5E–04 1E–03 3E–03
3 2E–04 5E–04 3E–03
2 1E–04 5E–05 3E–03
1 0 0 1E–04
7.7 Probability of Default 115
The method to estimate the probability of a system failure by P Dtotal is also called
Parts Count. It is the simplest and most pessimistic estimate because it is assumed
that every failure of a component causes a failure of the whole system. Redundancies
and the fact that some failures only cause a system failure if they occur in a special
combination are not regarded.
The total reliability is defined as
In (Krasich 2007), Krasich notes that even if all components are analyzed in terms
of their failures, the FMEA cannot make a credible statement about the reliability
of the product because all failures are regarded independently of each other. The
failures’ interactions have to be analyzed with different methods such as Markov
analysis, event tree analysis, or fault tree analysis.
There are several norms which define the approach and the implementation of FMEA.
They are divided into
– Worldwide norms (ISO),
– European norms (EN),
– National norms (DIN, ÖNorm, BS, …),
– Sectoral norms (VDA 6 (automotive), GMP (pharmaceutical), …), and
– Company-specific norms (Q101 (Ford), Formel Q (Volkswagen), …).
ISO is short for “International Organization for Standardization.” It was founded
in 1946 and is a voluntary organization based in Geneva. Its decisions are not inter-
nationally binding treaties and the goal is rather to establish international standards.
National institutions for standardization include (Hofmann 2009) the following:
– ANSI—American National Standards Institute: www.ansi.org.
– DIN—Deutsches Institut für Normung: www.din.de.
– BSI—British Standard Institute: www.bsi.org.uk.
– AFNOR—Association Française de Normalisation: www.afnor.fr.
– UNI—Ente Nazionale Italiano di Unification: www.uni.com.
– SNV—Schweizerische Normenvereinigung: www.snv.ch.
– ON—Österreichisches Normeninstitut: www.on-norm.at.
Table 7.4 lists examples of FMEA standards with different application areas.
116 7 Failure Modes and Effects Analysis
Table 7.4 Examples of some FMEA standards and their application area
Abbreviation Name Application area Year/country
EN IEC 60812 Procedures for failure mode generic 2006/EU
and effect analysis (FMEA)
BS 5760-5 Guide to failure modes, generic 1991/GB
Replaced by EN60812 effects and criticality analysis
(FMEA and FMECA)
SAE ARP 5580 Recommended failure modes generic, without 2001/USA
and effects analysis (FMEA) automotive
practices for non-automobile
applications
MIL STD 1629 Procedures for performing a military 1980/USA
failure mode and effect
analysis
VDA 4.2 Sicherung der Qualität vor automotive 2006/D
Serieneinsatz
SAE J1739 Potential failure mode and automotive 2002/USA
effects analysis in design
(Design FMEA) and
potential failure mode and
effects analysis in
manufacturing and assembly
processes (process FMEA)
and effects analysis for
machinery (machinery
FMEA)
Ford Design Institute FMEA Handbook automotive 2004/USA
Section 7.9 presents some possibilities from the literature to eliminate weaknesses
of classical FMEA.
The first method, based on (Chin, Wang et al. 2009) and explained in Sect. 7.9.1,
makes it possible to let some of the team members have more influence on the
evaluation of certain components. This can be very useful, since the teams consist
of persons with different professional responsibilities.
The second method, based on (Bluvband, Grabov et al. 2004), introduces a fourth
rating besides O, S, and D, the feasibility rating F.
Section 7.9.3 suggests the additional use of risk maps to not only depend on the
risk priority number when deciding on improvement measures.
A short introduction to the two extensions FMECA and FMEDA of FMEA is
given in Sects. 7.9.4 and 7.9.5. The C in FMECA stands for an additional criticality
analysis and the D in FMEDA stands for diagnostic coverage. FMEDA determines
safe failure fraction (SFF) additionally to classical FMEA as evaluation measure for
the functional safety management from IEC 61508.
7.9 Extensions of Classical FMEA 117
This section summarizes concepts from (Chin, Wang et al. 2009) as presented in
(Hofmann 2009).
In diverse teams, the team members have different professional responsibilities.
So, when rating a failure, it may be desirable that the rating of a team member with
an expertise in that area weighs more than the rating of a team member whose job is
related to a different part of the system.
For this, so-called weighting factors are introduced. For each failure, a number λi
is assigned to the ith team member with 0 ≤ λi ≤ 1 based on his or her competence
in the working area related to the failure. Here, higher values for λi indicate more
competence. The λi of all team members have to add up to 1, that is, for a team with
n
n members, i=1 λi = 1. The total rating is computed by
n
rating = λk · ratingk . (7.7)
k=1
m
ratingk = R Fl · ratingk,l . (7.8)
l=1
Example A team member thinks that the failure should be rated with 8 if the temper-
ature is at least t0 and with 7 if the temperate is below t0 and he or she knows that
statistically the temperate is at least t0 in 60% of the time. Then his or her rating
would be 0.6·8 + 0.4·7 = 7.6.
The following example describes an evaluation with weighting factors and risk
factors.
Example Consider a team consisting of three team members TM1 , TM2 , and TM3 .
The weighting factors of the team members and their ratings including risk factors
are shown in Table 7.5.
The total rating is
n
m
rating = λk R Fl · rating k,l
k=1 l=1
118 7 Failure Modes and Effects Analysis
Table 7.5 Example of an evaluation by three team members (TM) with weighting factors and risk
factors
TM1 , λ1 = 0.5 TM2 ,λ2 = 0.3 TM3 , λ3 = 0.2
Rating Risk factor RF Rating Risk factor RF Rating Risk factor RF
7 0.2 8 1 6 0.25
8 0.8 7 0.25
8 0.25
9 0.25
One can also compute the distribution of the ratings as displayed in Fig. 7.5. For
this, one looks at one rating number and adds all products of weighting and risk factor
that belong to this number. For example, for the rating 7 this is λ1 · R FT M1/7 + λ3 ·
R FT M3/7 = 0.5 · 0.2 + 0.2 · 0.25 = 0.15 = 15%. In total, we have the distribution
{(6,5%), (7,15%), (8,75%), (9,15%)}.
80
70
60
50
3
40
2
30 1
20
10
0
1 2 3 4 5 6 7 8 9 10
where the measure with the biggest value is the most effective.
As we have seen before, RPN is not the same as risk. Hence, it is sometimes useful to
include both parameters into the process of deciding where improvement measures
are necessary.
When including risk into the decision process, the well-known graphical presen-
tation of risk in a risk map can be used.
The risk of each of the 100 fields in the diagram is computed by multiplying the
value on the abscises with the one on the ordinate. Now each of the 100 fields in the
diagram is colored green, yellow, or red, see Fig. 7.6 for an example.
The colors have the following meaning (Helbling and Weinback 2004):
The classification can be done by defining three intervals for the risk and assigning
a color to each of them, for example, green for a risk of at most 9, yellow for a risk
between 10 and 20, and red for a risk above 20.
It is also possible to give additional restrictions for O and S. One can see that this
has been done in Fig. 7.6.
Figure 7.6 because
(a) the graph is not symmetric,
(b) (O = 7, S = 3). is yellow with a risk of 21 while (O = 10, S = 2) is red with
a risk of 20.
The classification depends on the company and/or the project.
7.9.4 FMECA
FMECA is short for failure modes, effects and criticality analysis. The “International
Electrotechnical Vocabulary (IEV)” defines FMEA as “a qualitative method of reli-
ability analysis which involves a fault modes and effects analysis together with a
consideration of the probability of their occurrence and of the ranking of the serious-
ness of the faults” (IEC 2008). That is, fault modes which are severe are not allowed
to have a high probability of occurrence (Ringwald and Häring 2010).
Here, one does not rely on the subjective evaluations from Sect. 7.5 but introduces
a method where absolute failure rates are determined and a key figure for the severity
of a fault mode is computed. Since FMECA is based on mathematical models instead
of subjective observations, the reliability of a component and the whole system can be
estimated more precisely and the importance of preventive measures can be judged
better (Ringwald and Häring 2010).
A so-called criticality matrix is used to display the criticality of a component.
The axes are labeled with the probability of occurrence (extremely unlikely, remote,
occasional, reasonably probable, frequent) and the severity (minor, marginal, critical,
catastrophic) (Ringwald and Häring 2010).
An example for a criticality matrix of a component with three possible failures is
given in Fig. 7.7. The colors are defined to have the following meaning:
7.9 Extensions of Classical FMEA 121
probability of occurrence
frequent (high risk)
reasonably
failure 3
probable
occasional
remote failure 1
extremely
(low risk) failure 2
unlikely
minor marginal critical catastrophic
severity
Fig. 7.7 Example for a criticality matrix. Classification from DIN EN 60812 with a sample color
coding to show the evaluation of the combinations of severity class and probability class
Green Acceptable
Yellow Tolerable: Improve as low as reasonably practicable (ALARP)
Orange Improvement needed: Conduct verifiable significant improvement measures
Red Not acceptable at any rate
Note that in Fig. 7.7 events with the same risk value are assessed differently, in
particular, catastrophic events. This is termed risk aversion. In a similar way, low risk
priority numbers due to high detection rates only are often treated as less acceptable.
7.9.5 FMEDA
Failure modes effects and diagnostic coverage analysis (FMEDA) is another exten-
sion of FMEA. The procedure is similar to the classical FMEA but more evaluation
parameters are determined (Ringwald and Häring 2010).
The safety-relevant components are analyzed in terms of criticality, diagnosis, and
failure rates. The potential failures are identified and their influence on the system’s
safety is determined (Ringwald and Häring 2010).
According to IEC 61508, a dangerous failure is a “failure of an element and/or
subsystem and/or system that plays a part in implementing the safety function that:
(a) prevents a safety function from operating when required (demand mode) or
causes a safety function to fail (continuous mode) such that the EUC is put into
a hazardous or potentially hazardous state; or
(b) decreases the probability that the safety function operates correctly when
required” (IEC 61508 2010).
122 7 Failure Modes and Effects Analysis
where λ S are the safe failures per hour, λ D D are the dangerous detected failures per
hour, and λ DU the dangerous undetected failures per hour (Ringwald and Häring
2010).
Remark Failure rates will be introduced in more detail in Chap. 8. (Sect. 9.8) in the
context of reliability prediction.
The diagnostic coverage (DC) is the probability that a dangerous failure is discov-
ered in case a dangerous failure occurs through a test during system operation, e.g.,
production operation. It is computed by
λD D
DC = . (7.12)
λ D D + λ DU
The total safe failure rate is the sum of all the failure rates of safe and dangerous
detected failures. It is hence computed by
λtotal/sa f e = λS + λD D . (7.13)
The safe failure fraction (SFF) is the ratio of the safe and detected dangerous
failures to all failures. It is computed by
λS + λ D D λtotal/sa f e
SF F = = . (7.14)
λ S + λ D D + λ DU λtotal
SFF and DC are the evaluation parameters (as mentioned at the beginning of the
section) which are added to the classical FMEA to get a FMEDA. Table 7.6 shows
an example of a part of a FMEDA form sheet (Ringwald and Häring 2010).
Table 7.6 Example for the structure of a FMEDA table (Ringwald and Häring 2010)
Failure mode Basis failure rate λ Probability of failure λs λdd λdu DC SFF
7.9 Extensions of Classical FMEA 123
These failure fractions are used as a basis to determine the feasible SIL within
a safety analysis following IEC 61508 by inspection of look-up best practice tables
which give requirements for the possible combinations of hardware failure tolerance
(HFT), SSF (or DC), and complexity of components (Level A/Level B) to reach a
given intended SIL Level.
It is possible to use system structures, functions, and functional structures for FMEA
that have been determined when executing other system analysis methods.
On the other hand, it is also possible to use, for example, a functional tree that
has been developed during the FMEA as a basis for an FTA.
So, if different teams are analyzing the same system with different methods, they
should compare their work to avoid doing the same analysis twice and to profit from
each other.
Carrying out an FMEA takes a lot of time and has to be done by people with
experience in that work field. Hence, the personnel costs are not negligible.
If an FMEA is only done because the client wishes it and not because it is a usual
part of the development process, and then sometimes not by a group of persons with
a lot of experience but by a single person who is available, it often does not yield
useful, cost-saving results. In those cases, the FMEA report is a waste of time and
money and increases neither quality, safety, or safety of the product nor does it lead
to a higher client satisfaction (Teng and Ho 1996).
If the FMEA does not include flowcharts or other process diagrams, or if those
are not explained to the team members, the team sometimes does not understand the
system well enough to see critical functions and potential problems.
Another problem if the FMEA does not include process diagrams is that joining
elements (soldered, adhesive, welded, and riveted joints, screws) are easily forgotten
in the analysis (Raheja 1981).
In classical FMEA, the probability of occurrence, the severity, and the probability
of detection are rated subjectively, and with that the risk priority number. Hence,
there is no objective way to measure the efficiency of the improvement measures.
124 7 Failure Modes and Effects Analysis
7.12 Questions
7.13 Answers
References
AIAG (2008): Potential failure mode and effects analysis (FMEA). Reference manual. 4th ed.
[Southfield, MI]: Chrysler LLC, Ford Motor Co., General Motors Corp.
Ben-Daya, M. and A. Raouf (1996). “A revised failure mode and effects analysis model.”
International Journal of Quality & Reliability Management, 13(1): 43–47.
Bluvband, Z., P. Grabov and O. Nakar (2004). “Expanded FMEA (EFMEA).” Annual Reliability
and Maintainability Symposium: 31–36.
Chin, K.-S., Y.-M. Wang, G. K. K. Poon and J.-B. Yang (2009). “Failure mode and effects analysis
using a group-based evidential reasoning approach.” Computers & Operations Research 36(6).
Deutsche Gesellschaft für Qualität e.V. (2004). FMEA - Fehlermöglichkeits- und Einflussanalyse.
Berlin, Beuth.
DIN (2006). Analysetechniken für die Funktionsfähigkeit von Systemen - Verfahren für die
Fehlzustandsart- und -auswirkungsanalyse (FMEA) (IEC 60812:2006), Beuth Verlag GmbH.
Dittmann, L. U. (2007). “Fehlermöglichkeits- und Einflussanalyse (FMEA).” OntoFMEA: Ontolo-
giebasierte Fehlermöglichkeits- und Einflussanalyse: 35–73.
Eberhardt, O. (2008). Gefährdungsanalyse mit FMEA: Die Fehler-Möglichkeits- und Einfluss-
Analyse gemäß VDA-Richtlinie, Expert Verlag.
Eidesen, K. and T. Aven (2007). “An evaluation of risk assessment as a tool to improve patient
safety and prioritise the resources.” Risk, Reliability and Societal Safety: 171–177.
Helbling and Weinback (2004). “Human-FMEA.” Diplomarbeit.
Hofmann, B. (2009). Fehlermöglichkeits- und Einflussanalyse (FMEA). EMI Bericht E 04/09,
Fraunhofer Ernst-Mach-Institiut.
IEC (2008). “International Electrotechnical Vocabulary Online.”
IEC 61508 (2010): Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Edition 2.0. Geneva: International Electrotechnical Commission.
126 7 Failure Modes and Effects Analysis
Krasich (2007). “Can Failure Modes and Effects Analysis Assure a Reliable Product.” Reliability
and Maintainability Symposium, 277-281.
Mariani, B., Colucci (2007). “Using an innovative SoC-level FMEA methodology to design in
compliance with IEC61508.” Proceedings of the conference on Design, automation and test in
Europe, 492–497.
Mariani, R., G. Boschi and F. Colucci (2007). “Using an innovative SoC-level FMEA methodology
to design in compliance with IEC61508, EDA Consortium.” Proceedings of the conference on
Design, Automation and Test in Europe: 492–497.
Mohr, R. R. (2002). “Failure Modes and Effects Analysis.”
Müller, D. H. and T. Tietjen (2003). FMEA-Praxis, Hanser.
NASA Lewis Research Center. (2003). “Tools of Reliability Analysis - Introduction and FMEAs.”
Retrieved 2015-09-15, from http://www.fmeainfocentre.com/presentations/dfr9.pdf.
Naude, P., G. Lockett and K. Holmes (1997). “A case study of strategic engineering decision
making using judgmental modeling and psychological profiling.” IEEE TRANSACTIONS ON
ENGINEERING MANAGEMENT 44(3): 237–247.
Pentii, H. and H. Atte (2002). “Failure Mode and Effects Analysis of Software- Based Automation
Systems.” SÄTEILYTURVAKESKUS STRÅLSÄKERHETSCENTRALEN RADIATION AND
NUCLEAR SAFETY AUTHORITY STUK-YTO-TR 190.
Plato. (2008). “Plato - Solutions by Software.” from www.plato.de.
Raheja, D. (1981). “Failure Mode and Effect Analysis - Uses and Misuses.” ASQC Quality Congress
Transactions - San Francisco: 374–379.
Ringwald, M. and I. Häring (2010). Zuverlässigkeitsanalyse der Elektronik einer Sensorplatine.
Praxisarbeit DHBW Lörrach/ EMI-Bericht, Fraunhofer Ernst-Mach-Institiut.
Satsrisakul, Yupak (2018): Quantitative probabilistic safety assessment of autonomous car functions
with Markov models. Master Thesis. Universität Freiburg, Freiburg. INATECH.
Sony Semiconductor. (2000). “Quality and Reliability HandBook.” from http://www.sony.net/Pro
ducts/SC-HP/quality/index.html.
Steinbring, F. (2008). “Sicherheits - und Risikoanalysen mit Verwendung der Platosoftware.”
Vortrag Unterlüß, 11.2008, RMW GmbH.
Syska, A. (2006). Fehlermöglichkeits- und Einflussanalyse (FMEA). Produktionsmanagement: Das
A-Z wichtiger Methoden und Konzepte für die Produktion von heute, Gabler: 46–49.
TeachTarget. (2012). “SearchSoftwareQuality.” Retrieved 2012-03-26, from http://searchsoftwareq
uality.techtarget.com/definition/software-requirements-specification.
Teng and Ho (1996). “FMEA An integrated approach for product design and process control.”
International Journal of Quality & Reliability Management 13(5): 8–26.
VDA (1996). “Sicherung der Qualität vor Serieneinsatz - System-FMEA.”
VDA (2006). Bd.4 System FMEA im Entwurf Test 2.
Wittig, K.-J. (1993). Qualitätsmanagement in der Praxis. Stuttgart, Teubner.
Chapter 8
Hazard Analysis
8.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 127
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_8
128 8 Hazard Analysis
Figure 8.1 shows the relationship between the different hazard analysis methods. The
abbreviations stand for hazard analysis (HA), preliminary hazard list (PHL), prelim-
inary hazard analysis (PHA), subsystem hazard analysis (SSHA), system hazard
analysis (SHA), and operating and support hazard analysis (O&SHA). FTA and
FMECA are already known from Chaps. 6 and 7.
Depending on the product, the procedure of the hazard analysis can also follow
different schemes, for example, the PHA and HA can be combined for very simple
systems.
Table 8.1 shows which form of safety analysis needs input from which other form.
It reads as follows: the analysis method on the left side uses information from the
methods in the columns marked with an “X”.
Remarks
– The points (8) to (12) are updated according to the measures.
– A hazard log is only added to. Entries are never deleted.
– It has to be clear who is responsible for writing the hazard log (assuring
consistency).
The preliminary hazard list (PHL) is the starting point for the following hazard
analyses. The goal is a temporary list of hazards. The estimates are conservative, that
is, risks are rather over than underestimated.
The information necessary to execute a PHL and the results of a PHL are displayed
in Table 8.2
The procedure of the PHL is as follows (Echterling 2011):
(1) Detection of potential hazards by brainstorming, checklists for hazard sources,
and expertise.
(2) Update of the hazard log.
The head of the worksheet should at least contain the following entries (Echterling
2011):
(1) The system under analysis.
(2) Name of the person executing the analysis.
(3) Date of the execution.
The worksheet should at least contain the following columns (Echterling 2011):
(1) Unique identification of the hazard for references, for example, a running
number.
(2) Operation mode.
(3) Hazard (title).
(4) Hazard source.
(5) Cause/trigger.
(6) Consequences.
(7) Comments.
8.4 Preliminary Hazard List (PHL) 131
Table 8.2 Tabular overview on types of hazard analyses: name, input, main analysis process tasks,
and output. Based on Ericson (2005), Echterling (2010)
Type of analysis Input Hazard analysis tasks Output
Preliminary hazard • Design knowledge • Comparison of • Specific hazards
list (PHL) • Generic hazard lists, design and hazards • Safety–critical
e.g., from standards • Comparison of factors
• Experiences design and • Hazard sources
experiences
• Documentation
process
Preliminary hazard • Design knowledge • Analysis of the PHL • Specific hazards
analysis (PHA) • General hazards and the top events • Safety–critical
• PHL • Identification of new factors
• Top events hazards • Hazard sources
• Evaluation of the • Methods for risk
hazards as far as the reduction
knowledge of the • System
design allows it requirements
• Documentation
process
Subsystem hazard • Design • Evaluation of the • Hazard sources
analysis (SSHA) • Specific hazards causes of all hazards • Risk evaluation
• PHA • Execution of • System
• Safety–critical necessary analyses requirements
functions • Documentation
• Gained experiences process
System hazard • Design • Identification of • System interface
analysis (SHA) • Specific hazards subsystem interface hazards
• PHA, SSHA hazards • Hazard sources
• Safety–critical • Evaluation of the • Risk evaluation
functions causes of all hazards • System
• Gained experiences • Execution of requirements
necessary analyses
• Documentation
process
Operating and support • Design • Gathering of all • Specific hazards
hazard analysis • System operation tasks and steps of a • Hazard sources
(O&SHA) • Handbooks procedure • Risk evaluation
• Hazard lists • Evaluation of all • System
tasks and steps requirements
• Use of identified • Precautionary
hazards and measures, warnings,
checklists procedures
• Identification and
evaluation of all
hazards
• Documentation
process
132 8 Hazard Analysis
Remarks
– The differentiation between hazard source and cause can, for example, be the
following: The hazard source is explosive and the cause/trigger of the hazard is
vibration.
– The analysis must be started early in the developing process.
– All potential hazards must be documented. (This includes hazards which are
judged as harmless.)
– The procedure of identifying potential hazards should be well structured.
– A checklist for hazard sources must be used.
– A list of the used hardware, the functions, and the modes of operation should be
available.
– System-specific abbreviations and terms should be avoided.
– The description of the hazard in the worksheet should not be too detailed but
meaningful.
– There should be enough explanatory text in the worksheet.
– The description of a cause has to be clearly related to the corresponding hazard.
Table 8.3 Extract of an exemplary German PHL based on DIN EN ISO 12100 (Thielsch 2012)
Type of hazard/hazard Cause/Triggering event Potential damage
source effects/consequences
Mechanical hazard – Acceleration/deceleration – Squeezing
(kinetic or rotational energy) – Impact
– Moving parts – Cutting (off)
– Cutting parts – Roping
– Sharp edges in/wrapping/catching
– Rotating parts – Drawing in
– Chafing/excoriation
– Cropping
– Puncture
– Penetration or perforation
Electrical hazard – Electric overload – Injuring or fatal electro
– Short circuit shock
– Fire
Hazards caused by – Smoke gas/vapor – Breathing difficulties
substances/agents – CBRN agents – Poisoning
Physical stress – Stress – Human error
– Mental/psychological overload – Burnout
– Working tasks beyond scope of
expert knowledge or skills of
employee
Human physical hazard – Too high physical stress – Joint damage
– Carrying of heavy loads – Herniated disk
8.5 Preliminary Hazard Analysis 133
The preliminary hazard analysis (PHA) analyzes hazards, their causes, their conse-
quences, the related risks, and countermeasures to reduce or eliminate those risks.
It is executed when detailed information is not yet available. “A mishap risk index,
see e.g. (MIL-STD-882D 2000), describes the associated severity and probability of
each hazard” (Häring and Kanat 2013). The information from the PHL is used in the
PHA.
The information necessary to execute a PHA and the results of a PHA are displayed
in Table 8.2, see third row.
The procedure of the PHA is as follows (Echterling 2011):
(1) Copying entries from the PHL.
(2) Identification of new potential hazards, their causes, and the related risks.
(3) Estimation of the risk category (qualitatively or quantitatively).
(4) Suggestion of measures to eliminate hazards or reduce risks (it is possible that
no measures can yet be suggested at this point in time).
(5) Evaluation of the planned measures to ensure that they meet the expectations
and possibly suggestions for later verification procedures for the measures.
(6) Update of the hazard log.
The PHA is a living document and transforms into a hazard analysis later (typically
after a “Preliminary Design Review”).
In the following, the entries/columns that were also part of the PHL are written
in gray. Here, information from the PHL can be copied but it has to be updated. The
columns are not left out to emphasize the importance of regarding them again under
consideration of new system knowledge.
The head of the worksheet should at least contain the following entries (Echterling
2011):
(1) The system under analysis.
(2) Part or function of the system under analysis.
(3) Name of the person executing the analysis.
(4) Date of the execution.
The worksheet should at least contain the following columns (Echterling 2011):
(1) Unique identification of the hazard for references, for example, a running
number.
(2) Operation mode.
(3) Hazard (title).
(4) Hazard source.
(5) Cause/trigger.
(6) Consequences.
(7) Estimate of the risk category at the beginning (baseline).
(8) All countermeasures.
134 8 Hazard Analysis
Remarks
– The remarks for the PHL still hold.
– The description of the severity classes must be clearly related to the consequences.
– The probability of occurrence must be clearly related to the causes.
– The countermeasures must be clearly related to the causes and hazard sources.
– In most cases, it cannot be expected that the severity class is reduced by
countermeasures.
– The risk category should be evaluated in terms of improvement after the execution
of countermeasures.
– The handling of the hazard can only be finished in the framework of the PHA if
the safety requirements are fulfilled.
The subsystem hazard analysis (SSHA) is executed as soon as there is enough infor-
mation available about the subsystem and as soon as the system is so complex that
the analysis is regarded as useful.
The goal of the SSHA is to identify causes of and countermeasures for previ-
ously determined hazards and the verification of safety on the subsystem level. Risk
estimates are specified with fault tree analysis or FMEA.
For the SSHA, it is especially important to complement the tables with enough
explanatory text.
“The SSHA stays unfinished as long as the risk is higher than the requirements
demand” (Häring and Kanat 2013).
The information necessary to execute a SSHA and the results of a SSHA are
displayed in Table 8.2.
The procedure of the SSHA is as follows (Echterling 2011):
(1) Copying entries from the PHA.
(2) Identification of new potential hazards within the subsystem.
(3) Determination of the risk category, for example, with FMEA or fault tree
analysis.
(4) Determination of measures to eliminate hazards or reduce risks.
(5) Evaluation of the planned measures to ensure that they meet the expectations.
(6) Verification of the final risk category after executing all measures.
(7) Update of the hazard log.
8.6 Subsystem Hazard Analysis 135
Again, as before: In the following, the entries/columns that were also part of the
PHA are written in gray. Here, information from the PHA can be copied but it has to
be updated. The columns are not left out to emphasize the importance of regarding
them again under consideration of new system knowledge.
The head of the worksheet should at least contain the following entries (Echterling
2011):
(1) The system under analysis.
(2) Parts or subsystem or function of the system under analysis.
(3) Name of the person executing the analysis.
(4) Date of the execution.
The worksheet should at least contain the following columns (Echterling 2011):
(1) Unique identification of the hazard for references, for example, a running
number.
(2) Operation mode.
(3) Hazard (title).
(4) Hazard source.
(5) Cause/trigger.
(6) Consequences.
(7) Risk category at the beginning (baseline).
(8) All countermeasures.
(9) Risk category after executing the countermeasures.
(10) Planned verification procedure or executed verification with reference to the
corresponding document.
(11) Comments.
(12) Status of the hazard (the processing is either still open or completed).
Remarks
– The remarks for the PHL and PHA still hold.
– The SSHA must be executed within the boundaries of the subsystem.
– Detailed references are necessary, in particular, to other system analysis methods.
– The SSHA must analyze the system in as much detail as necessary for the
respective hazard. It is the most detailed hazard analysis.
– The handling of hazards can only be finished when the remaining risk is acceptable
and the safety requirements are fulfilled.
The system hazard analysis (SHA) evaluates risks and ensures the adherence to safety
rules on system level. It is based on the PHA and possibly the SSHA. “It considers
interfaces between subsystems. The SHA includes common cause failures and risks
with regard to interface hazards. […] The SHA is the hazard analysis with the widest
136 8 Hazard Analysis
scope. The analysis must be comprehensible by a third party. Hazards may only be
marked as closed when the effectiveness of recommended actions has been verified
and the safety requirements have been met” (Häring and Kanat 2013).
The information necessary to execute a SHA and the results of a SHA are displayed
in Table 8.2.
The procedure of the SHA is as follows (Echterling 2011):
(1) Copying entries from the PHA.
(2) Identification of further hazards on system level.
(3) Identification of hazards from interfaces between subsystems.
(4) Determination of a risk category for the identified hazards before the execution
of measures (baseline). The risk category usually is the same as in the PHA.
(5) Determination of measures to reduce risks.
(6) Monitoring of the execution of the suggested measures.
(7) Evaluation of the planned measures to ensure that they meet the expectations.
(8) Verification of the final risk category after executing all measures.
(9) Update of the hazard log.
The SHA is a living document and is typically presented at the “Critical Design
Review.”
Again, as before: In the following, the entries/columns that were also part of the
PHA are written in gray. Here, information from the PHA can be copied but it has to
be updated. The columns are not left out to emphasize the importance of regarding
them again under consideration of new system knowledge.
The head of the worksheet should at least contain the following entries (Echterling
2011):
(1) The system under analysis.
(2) Name of the person executing the analysis.
(3) Date of the execution.
The worksheet should at least contain the following columns (Echterling 2011):
(1) Unique identification of the hazard for references, for example, a running
number
(2) Operation mode.
(3) Hazard (title).
(4) Hazard source (e.g., interface, subsystem).
(5) Cause/trigger.
(6) Consequences.
(7) Risk category at the beginning.
(8) All countermeasures.
(9) Risk category after executing the countermeasures.
(10) Planned verification procedure or executed verification with reference to the
corresponding document.
8.7 System Hazard Analysis 137
(11) Comments.
(12) Status of the hazard (the processing is either still open or completed).
Remarks
– The remarks for the PHL and PHA still hold.
– Hazards can only be marked as completed if the effectiveness of the measures is
verified.
– Failures with common cause and dependent events have to be sufficiently
regarded.
– Fault trees do not replace hazard analyses but do preliminary work for them.
– The handling of hazards can only be finished when the remaining risk is acceptable
and the safety requirements are fulfilled.
– Detailed references are necessary.
– The analysis must be understandable for a qualified third party.
The operating and support hazard analysis (O&SHA) captures hazards that can
occur in operation and maintenance. The goal is to identify causes, consequences,
risks, and measures for risk reduction.
The information necessary to execute an O&SHA and the results of an O&SHA
are displayed in Table 8.2.
The procedure of the O&SHA is as follows (Echterling 2011):
(1) Making a list of all activities that are relevant in the framework of an O&SHA
(this list can normally be generated from existing documents).
(2) Evaluation of all activities related to a hazard and comparison with a checklist
for hazard sources.
(3) Determination of a risk category for the identified hazards before and after the
execution of measures.
(4) Documentation of all measures.
(5) Checking that all necessary warnings are written in the documentation.
(6) Monitoring of the execution of the suggested measures.
(7) Update of the hazard log.
In the following, the entries/columns that were also part of the SHA are written in
gray. Other than before, here information is not copied from the previous analyses.
The O&SHA can, however, be combined with the SHA to a more extended SHA.
The head of the worksheet should at least contain the following entries (Echterling
2011):
(1) The system under analysis.
(2) Operation mode of the system.
(3) Name of the person executing the analysis.
138 8 Hazard Analysis
The worksheet should at least contain the following columns (Echterling 2011):
(1) Unique identification of the hazard for references, for example, a running
number.
(2) Description of the procedure.
(3) Hazard (title).
(4) Hazard source (e.g., interface, subsystem).
(5) Cause/trigger.
(6) Consequences.
(7) Risk category at the beginning.
(8) All countermeasures.
(9) Risk category after executing the countermeasures.
(10) Planned verification procedure or executed verification with reference to the
corresponding document.
(11) Comments.
(12) Status of the hazard (the processing is either still open or completed).
Remarks
– All activities must be listed.
– All activities must be completely described.
Table 8.4 summarizes and compares the columns of the worksheets of the different
types of hazard analysis.
The hazard analysis demands a risk evaluation for the identified hazards. The goal is
to analyze whether the identified risk is below a given level of acceptance. The risk
is measured by a combination of probability of occurrence and the extent of damage.
The risk evaluation can be qualitative, quantitative, or semi-quantitative.
A quantitative evaluation expresses risk in terms of numbers, for example, e/year.
A qualitative evaluation uses terms like “low” and “high.” The semi-quantitative
evaluation assigns relative numbers to each risk class, for example, numbers from 1
to 10.
8.10 Evaluation of Risks 139
Table 8.4 Entries in the work sheet of the different types of hazard analysis
Hazard Preliminary Preliminary Subsystem System Operating
log hazard list hazard hazard hazard and support
(HL) (PHL) analysis analysis analysis hazard
(PHA) (SSHA) (SHA) analysis
(O&SHA)
System under X X X X X
analysis
System X X
component under
analysis
State of operation X
of system
Executing person X X X X X
Date of the X X X X X
execution
Identification X X X X X X
Mode of X X X X X X
operation
Description of X
procedure
Hazard X X X X X X
Hazard source X X X X X X
Cause X X X X X X
Consequences X X X X X X
Risk category at X X estimate X X X
beginning
Countermeasures X
All X X X X
countermeasures
Risk category X X estimate X X X
after
countermeasures
Verification X X X X X
procedure
Comments X X X X X X
Date and status X X X X X
An established method for the evaluation of risks is the so-called risk map or risk
matrix which has already been introduced in Sect. 7.9.3.
Figure 8.2 shows a semi-quantitative risk matrix according to MIL-STD-882D.
Each combination of damage and probability of occurrence leads to one risk class.
140 8 Hazard Analysis
Fig. 8.2 Risk matrix example as proposed by MIL-STD-882D (Department of Defence 2000)
The acceptance level of each class is marked with a color. Red is high, orange serious,
yellow medium, and green low risk.
The decision which field is marked in which color depends on the situation.
Risk graphs are a semi-quantitative method to evaluate risks. The risk is estimated
with a decision tree typically using three or four risk parameters (depending on which
norm is used).
The IEC 61508 uses the four risk parameters to determine technical reliability
requirements for safety functions. At least, the first two and the last decision options
can be used for the assessment of any risk.
(C) Consequence risk parameter,
(F) Frequency and exposure time risk parameter,
(P) Possibility of failing to avoid hazard risk parameter, and.
8.10 Evaluation of Risks 141
Fig. 8.3 Risk graph: general scheme (IEC 61508 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch
Fig. 8.4 Risk graph—example (illustrates general principles only) (IEC 61508 2010). Copyright
© 2010 IEC Geneva, Switzerland. www.iec.ch
142 8 Hazard Analysis
Table 8.5 Explanation of the parameters in Figs. 8.4 and 8.5 (IEC 61508 2010). Copyright © 2010
IEC Geneva, Switzerland. www.iec.ch
Necessary minimum risk Safety integrity level
reduction
C = Consequence risk parameter
F = Frequency and exposure time – No safety requirements
risk parameter
P = Possibility of avoiding hazard a No special safety
risk parameter requirements
W = Probability of unwanted b, c 1
occurrence d 2
a, b, c … h = Estimates of the
required risk reduction for the e, f 3
safety-related systems g 4
h An E/E/PE safety-related
system is not sufficient
Evaluation of Private
visits communic-
ation
Tables 8.6 and 8.7 show examples of calibrations of the parameters C, F, P, and
W for the example in Fig. 8.4 and the generic risk graph in Fig. 8.3, respectively.
The present risk graph is a visualization option for the successive consideration
of influencing factors relevant for the frequency/probability and the consequences.
Typically, each level means an increase or decrease by an order of magnitude (factor
of 10 or 0.1). The risk graph implicitly executes an individual as well as collective
risk criteria. Typically, it uses a single individual risk criterion with some deviations
for large consequence numbers. In the present case, we have
R = W F PC ≤ Rcrit , (8.1)
Table 8.6 Example of data relating to risk graph in Fig. 8.4 (IEC 61508 2010). Copyright © 2010 IEC Geneva, Switzerland. www.iec.ch
Risk parameter Classification Comments
Consequence C1 Minor injury 1 The classification system has been developed to deal with injury and death
(C) C2 Serious permanent injury to one or more persons; death to one person to people. Other classification schemes would need to be developed for
8.10 Evaluation of Risks
(continued)
8 Hazard Analysis
Table 8.7 (continued)
Risk parameter Classification Comments
Possibility of avoiding the hazardous event (P) if the PA Adopted if all conditions
protection system fails to operate in column 4 are
satisfied
PB Adopted if all the
conditions are not
satisfied
8.10 Evaluation of Risks
Demand rate (W) W1 Demand rate less than 0,1 D per year
The number of times per year that the hazardous event W2 Demand rate between 0,1 D and D per year
would occur in absence of a the E/E/PE safety-related
system W3 Demand rate between D and 10 D per year
To determine the demand rate it is necessary to consider For demand rates higher than 10 D per year higher
all sources of failure that can lead to one hazardous event. integrity shall be needed
In determining the demand rate, limited credit can be
allowed for control system performance and intervention.
The performance which can be claimed if the control
system is not to be designed and maintained according to
IEC 61508 is limited to below the performance ranges
associated with SIL 1
NOTE This is an example to illustrate the application of the principles for the design of risk graphs. Risk graphs for particular applications and particular hazards will be agreed
with those involved, taking into account tolerable risk, see Clauses E.1 to E.6
145
146 8 Hazard Analysis
Table 8.8 Overview of norms and standards treating risk graphs, translated extract of Table 10.3
in Thielsch (2012)
Standard Title Year
ISO 13849–1 Safety of machinery–safety-related parts of control systems–Part 1: 2008
General principles for design
IEC 61508 Functional safety of electrical/electronic/programmable Electronic 2010
safety-related systems
IEC 61511 Functional safety–safety instrumented systems for the process 2005
industry sector
VDI/VDE 2180 Safeguarding of industrial process plants by means of process 2010
control engineering
where Rcrit can be derived for each combination of the risk graph. Typically, it is the
same for small consequence numbers and decreases for larger consequence numbers.
Often the factor C rates fatalities 10 times more than injuries. Another option to use
the same risk criterion is to rate for instance 10 fatalities 100 times more than 1
fatality, thus implementing risk aversion.
The IEC 61508 is not the only norm using risk graphs. An extract of an overview
of norms and standards treating risks graphs from Thielsch (2012) is shown in Table
8.8.
The presented risk graph approach is very generic and can also be applied to safety
and security-critical systems, e.g., if the safety risks originate from security hazards
(cyber-induced functional safety issues). However, an even more generic approach
would be to assess the resiliency of technical systems rather than only their safety
and security.
This would allow to consider in a more explicit and systematic way not only the
effects of the resilience (catastrophe, disruptive event) management phases prepara-
tion, prevention, and protection on risks quantification but also much more explicitly
the phases response and recovery. On top level, this outlines a possible way to resilient
socio-technical systems, see e.g., (Häring 2013a, c; Häring et al. 2016).
The IEC 61508–5 introduces a method to quantitatively determine the safety integrity
levels (SIL). The following key steps of the method are necessary for every E/E/PE
safety-related system (IEC 61508 S+ 2010):
– “determine the tolerable risk from a table as [Table 8.9];
– determine the EUC [equipment under control] risk;
– determine the necessary risk reduction to meet the tolerable risk;
8.10 Evaluation of Risks 147
Table 8.9 Example of a risk classification of accidents, graphically adapted, content from IEC
61508 S+ (2010). Copyright © 2010 IEC Geneva, Switzerland. www.iec.ch
Consequence
Frequency Catastrophic Critical Marginal Negligible
Frequent I I I II
Probable I I II III
Occasional I II III III
Remote II III III IV
Improbable III III IV IV
Incredible IV IV IV IV
Class I Intolerable risk
Class II Undesirable risk, and tolerable only if risk reduction is impracticable or if the
costs are grossly disproportionate to the improvement gained
Class III Tolerable risk if the cost of risk reduction would exceed the improvement gained
Class IV Negligible risk
– allocate the necessary risk reduction to the E/E/PE safety-related systems, other
technology safety-related systems and other risk reduction measures” (IEC 61508
S+ 2010).
Here, Table 8.9 should be seen as an example what such a table can look like. It
should not be used in this way since precise numbers, for example, for the frequencies,
are missing (IEC 61508 S+ 2010).
In this method, the limit of a safety function of an E/E/PE safety-related system
for a given level of damage is computed by
Ft C Ft
PFDavg ≤ = , (8.2)
Fnp C Fnp
Remark If the SIL determined with (8.2) is too high, the probability of occurrence
can be reduced by appropriate protection measures. Afterward, formula (8.2) can be
used again with F p instead of Fnp where
F p = Fnp p1 , (8.3)
and p1 is the probability that the protection measure prevents a failure (Thielsch
2012).
Example As mentioned before, Table 8.9 cannot be used in this way because the
frequencies do not have numerical values. For the following example, we use Table
8.10 from Thielsch (2012) from whom we also have this example. This table should
not be copied for projects besides this example.
We consider the example of a safety function of a safety-related E/E/PE system
with low demand rate. Experts determined that in case of a failure “severe” damage
would occur. The frequency of such a failure was estimated with Fnp = 1.5·10−2 a −1 .
Using Table 8.10 one can see that the risk class is III-C. This is a non-tolerable
risk (red). The highest tolerable frequency for severe consequences is Ft = 10−4 a −1
(the frequency belonging to the last green area in the column “severe”).
Hence, the highest possible PFDavg is computed by
Ft 10−4 a −1
PFDavg = = ≈ 6.7 · 10−3 . (8.4)
Fnp 1.5 · 10−2 a −1
We remember the SIL table from Table 3.5 that we display here again as Table
8.11 to get a better overview.
We can see that for a system with low demand rate, PFDavg = 6.7 · 10−3
corresponds to SIL 2.
The experts also found out that it is possible to install an independent non-
functional protection measure which reduces the probability of occurrence by 50
percent. Hence,
This leads to
8.10 Evaluation of Risks 149
Table 8.11 The SILs for high and low demand rates as defined in IEC 61508 (2010). Copyright ©
2010 IEC Geneva, Switzerland. www.iec.ch
Low demand rate High demand rate
Per request Per hour
SIL Average probability of a SIL Average frequency of a
dangerous failure dangerous failure
−5 −9 −1
4 10 , 10−4 4 10 h , 10−8 h −1
−4 −8 −1
3 10 , 10−3 3 10 h , 10−7 h −1
−3 −7 −1
2 10 , 10−2 2 10 h , 10−6 h −1
−2 −6 −1
1 10 , 10−1 1 10 h , 10−5 h −1
10−4
PFDavg = ≈ 1.3 · 10−2 (8.6)
7.5 · 10−3
Alternatively to the SIL determination based on risk map of Sect. 8.10.1, risk graph
of Sect. 8.10.2, and the computational method of Sect. 8.10.3, the necessary SILs
of systems can also be computed from first principles in terms of allowed local
individual risk and collective risk criteria. An example for a vehicle protection system
is provided in Kaufman and Häring (2011).
More generally, any method suited for assessing overall system safety in terms
of overall acceptable risk of a system considering short-term and long-term options
before, during, and/or after events can be used to determine functional safety require-
ments. See, for instance, the often data-driven and learning envisioned approaches
for critical infrastructure systems in Bruneau et al. (2020) the process and methods
listed in Häring et al. (2017) for critical infrastructure systems, the approach in Häring
(2015) mainly for hazardous (explosive) events, and the methods proposed in Häring
(2013b, c) for civil crisis response systems.
Example processes to be conducted range from basic in Dörr and Häring (2008)
to current risk analysis and management processes (Häring 2015), performance-
based risk and resilience assessment (Häring et al. 2017) to panarchy process-based
approaches (Häring et al. 2020).
The application of a tabletop analysis to identify challenging risks for mass gath-
erings is provided in Siebold et al. (2015). Such approaches are more efficient if they
are supported with software, see, e.g., (Baumann et al. 2014) for an implementation
example developed within the EU project (BESECURE 2015). An example for a
150 8 Hazard Analysis
Table 8.12 shows how the different types of hazard analysis can be assigned to the
phases of the Safety Life Cycle of IEC 61508. A clear allocation to one phase is not
always possible. Table 8.13 shows how the different types of hazard analysis can be
assigned to a more general description of a development cycle.
8.12 Standardization Process 151
Table 8.12 Allocation of the different types of hazard analysis to the phases of the Safety Life
Cycle of IEC 61508, following (Echterling 2011)
Abbreviation Method Safety life cycle phase of IEC 61508a
PHL Preliminary hazard list 1 Concept
2 Overall scope definition
PHA Preliminary hazard analysis 3 Hazard and risk analysis
SSHA Subsystem hazard analysis 7 Overall safety validation planning
9 Realization
10 Safety-related systems: other
technology—Realization
11 External risk reduction
facilities—Realization
SHA System hazard analysis 7 Overall safety validation planning
9 Realization
10 Safety-related systems: other
technology—Realization
11 External risk reduction
facilities—Realization
13 Overall safety validation
O&SHA Operation and support hazard analysis 6 Overall operation and maintenance
planning
9 Realization
10 Safety-related systems: other
technology—Realization
11 External risk reduction
facilities—Realization
13 Overall safety validation
a For more details on the Safety life cycle according to IEC 61508 see Chap. 11
Table 8.13 Allocation of the different types of hazard analysis to a general development cycle
(Echterling 2011)
Method Concept Preliminary Design Test Operation
design
PHL X Preliminary Critical
PHA X Design Review Design Review
SSHA X
SHA X X
O&SHA X X
The following standardization approach from Häring and Kanat (2013) can be seen
as a motivation to think about the safety analysis processes used in your company.
152 8 Hazard Analysis
In the example analyzed in the paper, Häring and Kanat found out that “very
different approaches [to safety analysis] were used. Even within the same company,
safety analysis techniques often vary significantly from project to project” (Häring
and Kanat 2013). Since in a lot of companies, projects are worked on where the
company is only sub-contractor or where they get public money, this situation might
occur more often because “depending on the project set-up, the industry has some-
times to work together with third-party auditors which require different levels of rigor
of analysis” (Häring and Kanat 2013). The standardization approach developed in
Häring and Kanat (2013) is shown in Fig. 8.5.
“It was decided to agree on a list of safety analysis techniques and to specify
in detail how they were to be executed. This entailed a short description of the
technique, a formal portrait of the process, a list of columns for tabular techniques
and a catalogue of important issues that need to be considered” (Häring and Kanat
2013).
The chapter introduced different types of hazard analysis. They lay their focus on
different aspects of the system and are executed when different amounts of informa-
tion are available. Table 8.14 summarizes the minimal table content of the different
methods, compare also Table 8.4.
We have seen (see Table 8.4) that there are a lot of identical common columns
for the different types of hazard analysis. These, however, might still be filled with
different information because their analyses are executed at different time points in
the development cycle (see Tables 8.12 and 8.13).
8.14 Questions
Table 8.14 Analysis methods and minimal required table columns (Häring and Kanat 2013)
Columns Analysis methods
PHL PHA SSHA SHA O&SHA FMECA HL
Unique identifier X X X X X X X
Element X
Failure mode X
Failure rate X
Immediate effects X
System effect X
Operating mode X X X X X X
Hazard X X X X X X
Source X X X X X X
Cause/trigger X X X X X X X
Initial risk X X X X X
Recommended action X X X X X X
Residual risk X X X X X
Verification procedure X X X X X
Result X X X X X X
Comment X X X X X X X
Date and status X X X X X
(6) Is a SHA still necessary if SSHA has been done for every subsystem? Why?
Why not?
(7) What is the major difference in the general idea of the SHA and the O&SHA?
Name an example for a hazard and a hazard source that could appear in both
types of analysis. What would be a possible cause for the hazard to appear in the
SHA? What would be a possible cause for the hazard to appear in the O&SHA?
(8) How can you plan the tabular scope of hazard analyses during system
development? Name typical column headings of PHA and SHA.
8.15 Answers
(1) Major advantages of hazard analyses include the following: (i) hazard analyses
can be used in all phases of the development process to assess safety of systems;
(ii) it covers detail up to management level; (iii) systematic inductive approach
on level of kinds of hazards; (iv) it also comprises deductive approach as it
asks for origins of potential hazard events; (v) it can be used to summarize
and coordinate the need for other more detailed system analysis methods; and
(vi) hazard log allows to summarize past experiences in the safety development
process and lessons learnt in the safety management process.
154 8 Hazard Analysis
SSHA SHA
O&SHA
References
Baumann, D; Häring, I; Siebold, U; Finger, J (2014): A web application for urban security
enhancement. In Klaus Thoma, Ivo Häring, Tobias Leismann (Eds.): Proceedings/9th Future
Security - Security Research Conference. Berlin, September 16 - 18, 2014. Fraunhofer-Verbund
Verteidigungs- und Sicherheitsforschung; Future Security Research Conference; Future Security
2014. Stuttgart: Fraunhofer Verl., pp. 17–25.
References 155
BESECURE (2015): EU project, Best practice enhancers for security in urban environments,
2012–2015, Grant agreement ID: 285222. Available online at https://cordis.europa.eu/project/
id/285222, checked on 9/27/2020.
Bruneau, Michael; Cimellaro, Gian-Paolo; Didier, Max; Marco, Domaneschi; Häring, Ivo; Lu,
Xilin et al. (2020): Challenges and generic research questions for future research on resilience.
In Zhishen Wu, Xilin Lu, Mohammad Noori (Eds.): Resilience of critical infrastructure systems.
Emerging developments and future challenges. Boca Raton, FL: CRC Press, Taylor & Francis
Group (Taylor and Francis series in resilience and sustainability in civil, mechanical, aerospace
and manufacturing engineering systems), pp. 1–42.
D-BOX (2016): EU Project, Demining too-box for humanitarian clearing of large scale area
from anti-personal, landmines und cluster munitions, 2013–2016, Grant agreement ID: 284996.
Available online at https://cordis.europa.eu/project/id/284996, checked on 9/27/2020.
Department of Defence (2000). MIL-STD-882D Standard practice for system safety.
Deutsches Institut für Normung (2011). DIN EN ISO 12100. Sicherheit von maschinen - Allgemeine
Gestaltungsgrundsätze - Risikobeurteilung und Risikominderung.
Dörr, Andreas; Häring, Ivo (2008): Introduction to methods applied in hazard and risk analysis.
In N Gebbeken, K Thoma (Eds.): “Bau-Protect” Buildings and Utilities Protection. Bilingual
Proceedings. Berichte aus dem Konstruktiven Ingenieurbau Nummer 2008/02. 3. Workshop Bau-
Protect, Buildings and Utilities Protection. Munich, 28.-29.10.2008. Neubiberg: Universität der
Bundeswehr München.
Echterling, N. (2010). IEC 61508 Phasen und Methoden (presentation). Koblenz Symposium.
Holzen.
Echterling, N. (2011). EMI report E 17/11
EDEN (2016): EU Project, End-user driven demo for CBRNE, 2013–2016, Grant agreement
ID: 313077. Available online at https://cordis.europa.eu/project/id/313077; https://eden-securi
tyfp7.eu/, checked on 9/27/2020.
Encounter (2015): EU Project Encounter, explosive neutralisation and mitigation countermeasures
for IEDs in urban/civil environment, 2012–2015, Grant agreement ID: 285505 (2015-08-20).
Available online at https://cordis.europa.eu/project/id/285505; www.encounter-fp7.eu/.
Ericson, C. (2005). Hazard Analysis Techniques for System Safety, Wiley.
Fehling-Kaschek, Mirjam; Faist, Katja; Miller, Natalie; Finger, Jörg; Häring, Ivo; Carli, Marco
et al. (2019): A systematic tabular approach for risk and resilience assessment and Improvement
in the telecommunication industry. In Michael Beer, Enrico Zio (Eds.): Proceedings of the 29th
European Safety and Reliability Conference (ESREL 2019). ESREL. Hannover, Germany, 22–
26 September 2019. European Safety and Reliability Association (ESRA). Singapore: Research
Publishing Services, pp. 1312–1319. Available online at https://esrel2019.org/files/proceedings.
zip.
Fehling-Kaschek, Mirjam; Miller, Natalie; Haab, Gael; Faist, Katja; Stolz, Alexander; Häring,
Ivo et al. (2020): Risk and Resilience Assessment and Improvement in the Telecommunica-
tion Industry. In Piero Baraldi, Francesco Di Maio, Enrico Zio (Eds.): Proceedings of the 30th
European Safety and Reliability Conference and the 15th Probabilistic Safety Assessment and
Management Conference. ESREL2020 and PSAM15. European Safety and Reliability Aassoci-
ation (ESRA), International Association for Probabilistic Safety Assessment and Management
(PSAM). Singapore: Research Publishing Services. Available online at https://www.rpsonline.
com.sg/proceedings/esrel2020/pdf/3995.pdf, checked on 9/25/2020.
Fischer, K.; Riedel,W.; Häring, I.; Nieuwenhuijs, A.; Crabbe, S.; Trojaborg, S.; Hynes,W. (2012a):
Vulnerability Identification and Resilience Enhancements of Urban Environments. In N. Aschen-
bruck, P. Martini, M. Meier, J. Tölle (Eds.): 7th Security Research Conference, Future Security
2012. Bonn.
Fischer, Kai; Riedel, Werner; Häring, Ivo; Nieuwenhuijs, Albert; Crabbe, Stephen; Trojaborg,
Steen S. et al. (2012b): Vulnerability Identification and Resilience Enhancements of Urban Envi-
ronments. In Nils Aschenbruck, Peter Martini, Michael Meier, Jens Tölle (Eds.): Future Secu-
rity. 7th Security Research Conference, Future Security 2012, Bonn, Germany, September 4-6,
156 8 Hazard Analysis
2012. Proceedings, vol. 318. Berlin, Heidelberg: Springer (Communications in Computer and
Information Science, 318), pp. 176–179.
Fischer, K; Häring, I; Riedel, W (2015): Risk-based resilience quantification and improvement
for urban areas. In Jürgen Beyerer, Andreas Meissner, Jürgen Geisler (Eds.): Security Research
Conference. 10th Future Security; Berlin, September 15-17, 2015 Proceedings. Fraunhofer IOSB,
Karlsruhe; Fraunhofer-Verbund Verteidigungs- und Sicherheitsforschung; Security Research
Conference. Stuttgart: Fraunhofer Verlag, pp. 417–424. Available online at http://publica.fraunh
ofer.de/eprints/urn_nbn_de_0011-n-3624247.pdf, checked on 9/25/2020.
Fischer, Kai; Hiermaier, Stefan; Riedel, Werner; Häring, Ivo (2018): Morphology Dependent
Assessment of Resilience for Urban Areas. In Sustainability 10 (6), p. 1800. https://doi.org/
10.3390/su10061800.
Fischer, Kai; Hiermaier, Stefan; Werner, Riedel; Häring, Ivo (2019): Statistical driven Vulnera-
bility Assessment for the Resilience Quantification of Urban Areas. In: Proceedings of the 13-th
International Confernence on Applications of Statistics and Probability in Civil Engineering.
ICASP13. Seoul, South Korea, 26.-30.05.2019. Available online at http://s-space.snu.ac.kr/bitstr
eam/10371/153259/1/26.pdf, checked on 9/25/2020.
Ganter, Sebastian; Srivastava, Kushal; Vogelbacher, Georg; Finger, Jörg; Vamanu, Bogdan;
Kopustinskas, Vytis et al. (2020): Towards Risk and Resilience Quantification of Gas Networks
based on Numerical Simulation and Statistical Event Assessment. In Piero Baraldi, Francesco
Di Maio, Enrico Zio (Eds.): Proceedings of the 30th European Safety and Reliability Confer-
ence and the 15th Probabilistic Safety Assessment and Management Conference. ESREL2020
and PSAM15. European Safety and Reliability Aassociation (ESRA), International Association
for Probabilistic Safety Assessment and Management (PSAM). Singapore: Research Publishing
Services. Available online at https://www.rpsonline.com.sg/proceedings/esrel2020/html/3971.
xml, checked on 9/25/2020.
Häring, I (2013a): Sorting enabling technologies for risk analysis and crisis management. In Bern-
hard Katzy, Ulrike Lechner (Eds.): Civilian Crisis Response Models (Dagstuhl Seminar 13041).
27 pages. With assistance of Marc Herbstritt, p. 71.
Häring, I. (2013b): Sorting enabling technologies for risk analysis and crisis management. In Bern-
hard Katzy, Ulrike Lechner (Eds.): Civilian Crisis Response Models (Dagstuhl Seminar 13041).
27 pages. With assistance of Marc Herbstritt, p. 71.
Häring, I. (2013c): Workshop research topic Resilient Systems. In Bernhard Katzy, Ulrike Lechner
(Eds.): Civilian Crisis Response Models (Dagstuhl Seminar 13041). 27 pages. With assistance
of Marc Herbstritt, pp. 86–87.
Häring, Ivo (2015): Risk analysis and management. Engineering resilience. Singapur: Springer
Häring, I. and B. Kanat (2013). “Standardization of best-practice safety analysis methods…” Draft.
Häring, I.; Schönherr, M.; Richter, C. (2009): Quantitative hazard and risk analysis for fragments
of high-explosive shells in air. In Reliability Engineering & System Safety 94 (9), pp. 1461–1470.
https://doi.org/10.1016/j.ress.2009.02.003.
Häring, Ivo; Ebenhöch, Stefan; Stolz, Alexander (2016): Quantifying resilience for resilience engi-
neering of socio technical systems. In Eur J Secur Res (European Journal for Security Research)
1 (1), pp. 21–58. https://doi.org/10.1007/s41125-015-0001-x.
Häring, Ivo; Sansavini, Giovanni; Bellini, Emanuele; Martyn, Nick; Kovalenko, Tatyana; Kitsak,
Maksim et al. (2017): Towards a generic resilience management, quantification and development
process: general definitions, requirements, methods, techniques and measures, and case studies.
In Igor Linkov, José Manuel Palma-Oliveira (Eds.): Resilience and risk. Methods and application
in environment, cyber and social domains. NATO Advanced Research Workshop on Resilience-
Based Approaches to Critical Infrastructure Safeguarding. Dordrecht: Springer (NATO science
for peace and security series. Series C, Environmental security), pp. 21–80. Available online at
https://www.springer.com/de/book/9789402411225.
Häring, Ivo; Ramin, Malte von; Stottmeister, Alexander; Schäfer, Johannes; Vogelbacher, Georg;
Brombacher, Bernd et al. (2019): Validated 3D Spatial Stepwise Quantitative Hazard, Risk and
Resilience Analysis and Management of Explosive Events in Urban Areas. In Eur J Secur Res
References 157
Safety Assessment and Management Conference; PSAM; Annual European Safety and Reliability
Conference; ESREL. Red Hook, NY: Curran, pp. 5757–5767.
XP-DITE (2017): Accelerated Checkpoint Design Integration Test and Evaluation. Booklet.
Available online at https://www.xp-dite.eu/dissemination/, updated on 2017-2, checked on
9/25/2020.
Chapter 9
Reliability Prediction
9.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 161
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_9
162 9 Reliability Prediction
FIDES, a more recent standard. The chapter ends with questions with answers in
Sects. 9.12 and 9.13.
After introducing the necessary definitions, it will be shown that there are several
standards for reliability predictions. They are based on additive or additive–multi-
plicative models. The standard MIL-HDBK-217 is still widely used, although it is
not updated anymore. However, reliability growth factors are used to take account
of improvement of technology and production. The modern standard 217-Plus is the
successor of this standard. Companies have enriched the standard MIL-HDBK-217
with their own field data to get a more accurate standard for their field of work. The
modern standard FIDES uses a different model than MIL-HDBK-217. Its goal is to
give less conservative estimates of the probability of failure.
The chapter is mostly based on Meier and Häring (2010) and Ringwald and Häring
(2010). Ringwald’s main source is Birolini (2007). Meier’s main sources are Bowles
(1992), IEEE-Guide (2003), Birolini (2007), Glose (2008), Hoppe and Schwederski
(2009).
where F(t) = P(component fails in (0, t]) (Ringwald and Häring 2010).
Dependability is “the collective term used to describe the availability performance
and its influencing factors: reliability performance, maintainability performance and
maintenance support performance” (IEC 60050-191 1990).
Please note that, for instance, in German language, dependability and reliability
both mean “Zuverlässigkeit.” So, when deciding on the right English translation of
“Zuverlässigkeit,” one should keep in mind that reliability as defined above is only
a part of the much broader concept of dependability (Glose 2008).
9.3 Embedding “Reliability Prediction” into the Range … 163
system analysis
methods
Markov Parts-Count-
deducve analysis inducve analysis
procedures method
Parts-Stress-
Fault Tree Analysis System FMEA
method
FMECA
Event Tree
Analysis
Fig. 9.1 The most common system analysis (failure rate analysis) methods, translated from Meier
and Häring (2010)
164 9 Reliability Prediction
Failure modes analysis determines the potential types of failures that can occur. This
can be done inductively by asking “Which failures can occur and which consequences
do they have?” or deductively by regarding a potential failure and asking for the
causes of this failure (Meier and Häring 2010).
Fault tree analysis (Chap. 6) and FMEA (Chap. 7) are examples for failure mode
analysis methods.
Reliability prediction computes the system’s reliability using the failure rates of
single elements and their interfaces. For this, several standards have been devel-
oped which describe methods to compute failure rates with regard to environmental
influences (Meier and Häring 2010).
Reliability prediction will be introduced in this chapter.
System state analysis computes the system’s dependability and safety by applying
Markov models. Markov models use state transition diagrams to display the system’s
dependability and safety performance. They are used to model the temporal behavior
of the system where a system is understood as a composition of elements with only
two states (“functioning” and “failed”). The system itself can be in many states, each
of which is characterized by a certain combination of functioning and failed elements
(EMPA 2009; Meier and Häring 2010).
System state analysis will not be treated in more detail here.
9.4 Software
There are several software tools for reliability predictions, for example, A.L.D.
(RAM-Commander), Reliability Analysis Center (PRISM), and RELEX (Reliability
Studio). Each of these tools uses different and in most cases identical methods and
standards to determine component and system safety (Ringwald and Häring 2010).
Besides for (failure) predictions, the tools are also used to display the functional
connection of the components in reliability block diagrams (RBD), for fault trees,
and for event trees.
9.5 Failure 165
9.5 Failure
A failure occurs if a component cannot carry out its demanded function. Failure-free
time, that is, the time where no failures occur, can usually be interpreted as a random
variable. Here, it is assumed that the component is free of failures at t = 0. Besides
their probability of occurrence, failures should be analyzed in terms of sort, cause,
and consequences (Ringwald and Häring 2010).
In the following let
that is, F(t) is defined as the probability that the component fails up to time t (Ringwald
and Häring 2010).
When regarding the different parts of a system, one distinguishes between a high and
a low demand mode of operation (IEC 61508 S+ 2010; Ringwald and Häring 2010).
In low demand mode, a system only uses its safety function once or very rarely
(at most once a year) (Birolini 2007). For example, if the system is an airbag system,
the safety function that the airbag deploys in a crash is in low demand mode. On the
other hand, the safety function that the airbag does not deploy during normal driving
is used constantly. It is hence in high demand mode (Ringwald and Häring 2010).
For a system in low demand mode, one regards the probability of failure on
demand (PFD). A system in high demand mode is evaluated by the probability of
failure per hour (PFH). With those, the safety integrity levels (SIL) are defined, see
Table 8.11.
d
f (t) := F(t), (9.3)
dt
where F(t) is the probability that the component fails in (0, t], see Sect. 9.5 (Glose
2008; Hansen 2009; Meier and Häring 2010).
The definition of f (t) can be rewritten as
166 9 Reliability Prediction
d
f (t) := F(t)
dt
P(failure in (0, t + δt]) − P(failure in (0, t])
= lim
δt→0 δt
P((failure in (t, t + δt]) ∩ (no failure in (0, t]))
= lim (9.4)
δt→0 δt
which will be needed in Sect. 9.8. The failure density can be interpreted as the rate
of change of the failure probability.
Common density functions with the corresponding failure probabilities and failure
rates are listed in Table 9.1. An overview of mathematical definitions is given in Table
9.2.
The failure rate λ(t), given that the system is free of failure at time 0, is defined by
the following formula (Leibniz-Rechenzentrum 2004):
Table 9.2 Overview of mathematical definitions for reliability predictions (Meyna and Pauli 2003)
English term Mathematical definition/notation/formulas
Actual time of failure T T is a random variable
Survival probability Rb(t) Rb(t) = P(T > t) with:
Rb(u) ≤ Rb(t) for u > t,
Rb(0) = 1,
lim Rb(t) = 0,
t→∞
∞
Rb(t) = f (s)ds
t
where we use (9.4) in the third equation. Informally spoken, the failure rate is the
rate with which a failure occurs at time t if it is known that the failure has not yet
occurred until time t.
As unit of measurement, one typically uses FIT (failures in time). The definitions
of FIT vary in the different standards. For example, in MIL-HDBK-217 1 FIT =
10−6 / h while in FIDES 1 FIT = 10−9 / h (Ringwald and Häring 2010).
168 9 Reliability Prediction
Example (Ringwald and Häring 2010) Let the probability that a certain component
fails during a 30-year-long storage period be 10−8 . This yields a failure rate of
10−8
30a
≈ 10−8 · 30
1
· 365
1
· 24h
1
≈ 3.81 · 10−14 / h = 3.81 · 10−5 FIT with the definition
from FIDES.
If the failure rate λ(t) is constant, that is, λ(t) ≡ λ, its reciprocal value 1/λ is the
mean in time to the next failure (MTTF) (Glose 2008).
Starting point to determine the failure rate is the observed failures, see Fig. 9.2. Those
are composed of the early “infant mortality” failures, the constant random failures,
and the wear out failures. Adding the relating curves (failures as a function of time)
yields the bathtub curve (Meyna and Pauli 2003; Ringwald and Häring 2010).
The bathtub curve can, for example, be seen as the sum of two failure rate curves
belonging to two Weibull distributions with parameters b1 = 0.5, b2 = 5, and T = 1
(Glose 2008), see Fig. 9.3.
One typically divides the bathtub curve into three parts (Birolini 1991):
– Part 1: The part where the curve λ(t) decreases quickly. A high number of early
failures can be caused by material weaknesses, quality fluctuations during the
production process, or application errors (dimensioning, testing, operation).
– Part 2: The part of the curve where λ(t) is almost constant.
– Part 3: The part of the curve where λ(t) increases due to wear out failures.
Fig. 9.2 Schematic of Bathtub curve, see, e.g., Rajarshi and Rajarshi (1988)
9.9 Bathtub Curve 169
Fig. 9.3 Example bathtub curve as the sum of two Weibull distributions (Glose 2008)
Most standards only refer to the second phase and assume a constant failure rate λ,
only some standards, for example, 217-Plus, use a factor to regard the early failures
(Glose 2008; Meier and Häring 2010). In practical applications, it has turned out that
the assumption of a constant failure rate is realistic for a lot of systems and makes
the computations easier (Birolini 2007). On the other hand, there also are several
studies, for example, (Wong 1989; Wong 1995; Jones and Hayes 2001), that criticize
this assumption (Glose 2008).
9.10 Standards
Besides the logical and structural arrangement of the components, the failure rates of
the single components are essential in order to determine the failure rate of the whole
system. Standards are used to compute those failure rates. The standards are described
in handbooks which introduce generic failure rates that have been gained through
field data and simulation tests. Furthermore, the standards establish the functional
connections between failure rates and influencing factors such as temperature or
humidity of the surroundings (Meier and Häring 2010).
This section gives an overview of the standards’ general design and then lists
examples that are used in the industry.
Standards are normally described in handbooks which contain one or more of the
following prediction methods (IEEE-Guide 2003):
– Tables with constant component-specific failure rates, for operation under load as
well as for operation in idle.
– Factors for the environmental influences to compute the constant failure rates for
operation under load as well as for operation in idle.
170 9 Reliability Prediction
– Factors for converting between failure rates for operation under load and operation
in idle.
One distinguishes between multiplicative and multiplicative–additive models.
Most standards have adapted to the multiplicative model of MIL-HDBK-217, see
Sect. 9.10.2. For those multiplicative models, the formula to compute the failure rate
of a component is
λ P = f (λG , π1 , π2 , . . . , πi ) = λG π1 π2 . . . πi , (9.7)
where λ P is the predicted failure rate of the component, λG is the generic failure rate,
and πk with 1 ≤ k ≤ i are adjustment factors (Meier and Häring 2010).
Modern standards such as FIDES or 217-Plus use multiplicative–additive models
(Meier and Häring 2010). For these, the formulas to compute the failure rate contain
additions and multiplications. The formulae for the different multiplicative–additive
models do not have a generic form in common like the multiplicative models with
the formula in (9.7). The formulae have more complex structures than (9.7) and have
to be regarded individually for the different standards. Here, this will be done for
FIDES in Sect. 9.10.11.
9.10.2 MIL-HDBK-217
The Parts Stress method is used toward the end of the design process when a detailed
parts list (which includes component loads, environmental conditions, and arrange-
ment of the components) is available. The fundamental equation for the computation
of a component’s predicted failure rate λ p is
9.10 Standards 171
λ p = λb · π E · π Q · . . . (9.8)
with generic failure rate λb (representing the influence of electrical and temperature
loads on the component), the environmental adjustment factor π E , and the quality
adjustment factor π Q .
Environmental conditions are composed into groups that relate to the typical mili-
tary operation areas (MIL-HDBK-217E: Table 5.1.1-3). These three general factors
(λb , π E , and π Q ) are contained in every component model (Department Of Defense
1986). Further π -factors depend on the type of component (resistor, capacitor, …)
and regard additional component-typical stress factors (MIL-HDBK-217E: Table
5.1.1-4) (Ringwald and Häring 2010).
The Parts Count method is used to estimate the failure rate in early design phases. It
is especially useful if there is not enough information available to cay out the Parts
Stress method (Ringwald and Häring 2010).
The Parts Count method categorizes components by types (resistor, capacitor,
…) and uses a predefined failure rate for each type of component (see the tables in
Sect. 5.2 of MIL-HDBK-217E (Department Of Defense 1986)). The total failure rate
of the system is computed by
n
λtotal = Ni (λG · π Q )i (9.9)
i=1
where λtotal is the failure rate of the whole system, λG is the type-specific failure rate
of the ith component, π Q is the component’s quality of the ith component, Ni says
how many of the ith component are incorporated into the system, and n is the number
of different components used in the system. If a component is produced for less than
2 years or if there are drastic changes in the design or in the process, an additional
factor π L , the so-called learning factor or reliability growth factor, is multiplied
with λG · π Q for this component (Ringwald and Häring 2010). The learning factor
becomes smaller as the product becomes older and more reliable. Experimental data
(for example, by analyzing the number of failures before and after the change of a
component) and statistic tools are used to determine the learning factor.
In (9.9), adding up all failure rates means that a serial linking of all components
is assumed. In general, this overestimates the actual failure rate. Hence, the Parts
Count method normally yields higher and less precise failure rates than the Parts
Stress method (Glose 2008).
172 9 Reliability Prediction
The Siemens standard can be used for electronic and electromechanical components
(Meier and Häring 2010).
The failure rates are given in the measuring unit “failures per 106 h” and only refer
to the operation under loads (Hoppe and Schwederski 2009). They are determined by
Siemens test and field data as well as external sources such as MIL-HDBK-217. The
structure of the models is similar to the one of MIL-HDBK-217 (Bowles 1992) but
the adjustment factors only cover influences by electrical and thermal loads (ECSS
2006; Meier and Häring 2010).
9.10.4 Telcordia
The Telcordia standard was originally known as Bell Laboratories Bellcore standard
for reliability predictions for usual electronic components (Ringwald and Häring
2010).
Telcorida TR-332 Issue 6 and SR-332 Issue 1 are used as economical standards in
telecommunication. Telcordia comprises three methods to predict a product’s relia-
bility: Parts Count, Parts Count with additional use of laboratory data, and reliability
prediction using field data. Those methods have additive and multiplicative parts
(Ringwald and Häring 2010).
9.10.5 217-Plus
9.10.6 NSWC
IEC TR 62380 was developed by the French company CNET (Centre National
d’Etudes des Télécommunications, today: France Télécom (Glose 2008)). Modeling
of thermal behavior and systems at rest is used to take different environmental
conditions into account (Meier and Häring 2010).
The same data that was used by CNET was also used by two other telecommuni-
cation companies, BT (British Telecom) and Italtel (Italy), for their own standards
HRD5 (Handbook of Reliability Data) and IRPH 2003 (Italtel Reliability Prediction
Handbooks), respectively (Glose 2008).
The IEEE Gold Book is an IEEE recommendation for the design of dependable
industrial and commercial electricity supply systems. It includes data for commercial
voltage distribution systems (Meier and Häring 2010).
Due to bad predictions of MIL-HDBK-217 with respect to failure rates in the field,
the Society of Automotive Engineers (SAE), USA, enriched the standard with their
own data in 1983 (Denson 1998; Glose 2008).
GJB/Z 299B was developed by the Chinese military and comprises, like the MIL-
HDBK-217, the Parts Stress and the Parts Count methods. It was translated to English
in 2001. Sometimes it is also referred to as “China 299B” (Vintr 2007).
9.10.11 FIDES
FIDES is based on field data, tests of component manufacturers, and existing reli-
ability prediction methods. It wants to give realistic (not pessimistic or conservative)
estimates. It is also possible to use the standard on sub-assemblies instead of only
on individual components. This makes it possible to use the standard early in the
development process (Meier and Häring 2010).
In addition to the failure considered in MIL-HDBK-217, FIDES also regards
insufficiencies in the development process, failures in the production process, wrong
assumptions, and induced failures. The influences are divided into three categories:
technology, process, and use. The category “technology” covers the component itself
and its installation, “process” includes all procedures from development to decom-
missioning, and “use” considers restrictions in use by the manufacturer and the qual-
ification of the user. Software failures and improper use are not considered (Meier
and Häring 2010).
The total failure rate of the system is computed by
n
λtotal = λ physical π pr ocess π P M i , (9.10)
i=1
where λ physical is the ith component’s physical failure rate; π pr ocess is a measure for
the quality of the development, production, and operating process; and π P M (PM =
parts manufacturing) is a quality evaluation of the component and the manufacturer
(Meier and Häring 2010).
The physical failure rate λ physical is determined by the equation
m
λ physical = πinduced · (λ0 · πacceleration )k , (9.11)
k=1
For comparison of the coverage of standards along with available software imple-
mentations, one can compare in tabular form for each standard the following criteria
(Glose 2008):
• Scope of application of standard.
• Last edition.
• Expectation of updates.
9.11 Comparison of Standards and Software Tools 175
9.12 Questions
(1) For which system analysis methods can you use reliability prediction?
(2) Which reliability prediction standards can be distinguished?
(3) How are Pi factors determined?
(4) Which information is typically not yet available in case when applying parts
count analysis?
(5) What are reliability growth factors?
(6) What is the difference between multiplicative and additive reliability predic-
tion.
(7) Which of the following expressions are true?
(a) It is assumed that the component is free of failures at t = 0,
(b) F(t) = P(component fails in (0, t]),
(c) F(t) = P(component fails after time t),
(d) F(t) = P(component does not fail in (0, t]),
(e) Rb(t) = P(component fails in (0, t]),
(f) Rb(t) = P(component fails after time t), and
(g) Rb(t) = P(component does not fail in (0, t]).
(8) Which of the following expressions are true?
(a) lim Rb(t) = 1,
t→∞
176 9 Reliability Prediction
9.13 Answers
(1) For all quantitative methods, e.g., FMEDA, FTA, and Markov models.
9.13 Answers 177
References
Birolini, A. (1991). Qualität und Zuverlässigkeit technischer Systeme. Berlin, Heidelberg, Springer.
Birolini, A. (2007). Reliability Engineering: Theory and Practice, Springer Berlin.
Bowles, J. B. (1992). “A Survey of Reliability-Prediction Procedures For Microelectronic Devices.”
IEEE Transactions on Reliability 41(1): 2–12.
178 9 Reliability Prediction
Denson, W. (1998). “The History of Reliability Prediction.” IEEE Transactions on Reliability 47(3-
SP): 321–328.
Department Of Defense (1986). Military Handbook - Reliability Prediction of Electronic Equipment
(MIL-HDBK-217E), U.S. Government Painting Office.
ECSS (2006). Space Product assurance - Components reliability data sources and their use,
Noordwijk, The Netherlands, ESA Publications Division.
EMPA. (2009). “Markoff-Modelle.” Retrieved from 2012-04-02.
Glose, D (2008): Zuverlässigkeitsvorhersage für elektronische Komponenten unter mechanischer
Belastung, Reliability predicition of electronic components for mechanical stress environment.
Hochschule Kempten, University of Applied Sciences; Fraunhofer EMI, Efringen-Kirchen.
Hansen, P. (2009). Entwicklung eines energetischen Sanierungsmodells für den europäischen
Wohngebäudesektor unter dem Aspekt der Erstellung von Szenarien für Energieund CO2-
Einsparpotenziale bis 2030, Forschungszentrum Jülich.
Hoppe, W. and P. Schwederski (2009). Comparison of 4 Reliability Prediction Approaches for
realistic failure rates of electronic parts required for Safety & Reliability Analysis. Zuverlässigkeit
und Entwurf - 3. GMM/GI/ITG-Fachtagung. Stuttgart, Germany.
IEC 60050-191 (1990). Internationales Elektrotechnisches Wörterbuch. Genf, International Elec-
trotechnical Commission.
IEC 61508 S+ (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Ed. 2 Geneva, International Electrotechnical Commission.
IEEE-Guide (2003). “IEEE Guide for selecting and using reliability prediction based on IEEE
1413.”
Jones, J. and J. Hayes (2001). “Estimation of System Reliability Using a “Non-Constant Failure
Rate” Model.” IEEE Transaction on Reliability 50(3): 286–288.
Leibniz-Rechenzentrum. (2004). “Verweildaueranalyse, auch: Verlaufsdatenanalyse, Ereignis-
analyse, Survival-Analyse (engl.: Survival Analysis, Analysis of Failure Times, Event History
Analysis).” Retrieved 2012-04-11, from http://www.lrz.de/~wlm/ilm_v8.htm.
Meier, S. and I. Häring (2010). Zuverlässigkeitsanalyse der an der induktiven Datenübertra-
gung beteiligten Schaltungsteile eines eingebetteten Systems. Praxisarbeit DHBW Lörrach/EMI-
Bericht, Fraunhofer Ernst-Mach-Institiut.
Meyna, A. and B. Pauli (2003). Taschenbuch der Zuverlässigkeits- und Sicherheitstechnik;
Quantitative Bewertungsverfahren. München, Hanser.
Rajarshi, Sujata; Rajarshi, M. B. (1988): Bathtub distributions: a review. In Communications in
Statistics - Theory and Methods 17 (8), pp. 2597–2621. https://doi.org/10.1080/036109288088
29761.
Ringwald, M. and I. Häring (2010). Zuverlässigkeitsanalyse der Elektronik einer Sensorplatine.
Praxisarbeit DHBW Lörrach/ EMI-Bericht, Fraunhofer Ernst-Mach-Institiut.
Vintr, M. (2007). Tools for Reliability Prediction of Systems, Brno University of Technology.
Wong, K. L. (1989). “The Roller-Coaster Curve is in.” Quality and Reliability Engineering int. 5:
29–36.
Wong, K. L. (1995). “A New Framework For Part Failure-Rate Prediction Models.” IEEE
Transaction on Reliability 44(1): 139–146.
Chapter 10
Models for Hardware and Software
Development Processes
10.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 179
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_10
180 10 Models for Hardware and Software Development Processes
what customers and end users really expect in realistic applications, foreseeable
similar applications as well as potential and accessible misuse.
Taking these challenges into account, this chapter introduces different types of
software and hardware development processes. However, instead of following the
doctrines of any of the plethora of existing system development processes and
their communities, first selected generic properties of any development process are
identified.
The chapter starts with a list of attributes or characteristics of development
processes in Sect. 10.2 that may also occur in combination for a single develop-
ment process. Afterward, a few well-known examples of models for software and
hardware development processes are introduced and presented in typical process
schemes, see Sect. 10.3.
The development model properties incremental, iterative, and linear are identified
as key properties of development processes using software development as sample
case. They can be employed in a balanced way in systematic system development
processes to make systems safer. The introduced different software development
models can be classified in traditional models and agile models. The attributes intro-
duced are helpful to show that also agile development methods use very similar
approaches as classical methods, however adapted to much increased computational
resources, the work force expectations, and the need for fast progress validation and
monitoring in modern economic contexts.
In the next chapter, we take a closer look at the Safety Life Cycle of the IEC
61508. The life of a system is divided into 16 phases which we will introduce.
The development processes presented here may incorporate activities for
achieving a high level of safety of systems. However, achieving safety and secu-
rity is not the main aim of standard development processes of systems. As system
development processes cover the development of systems, they can be used to struc-
ture the overall life cycle of a system from “cradle to grave,” i.e., from the first idea
to the final dismantling, reuse, or recycling, especially the first phases, see Sect. 3.3.
In Chap. 11, we will see that safety development life cycles can be understood as
special cases or extensions of development life cycles.
The models that will be introduced in Sect. 10.3 describe different types of devel-
opment processes. Three attributes or properties of such models are listed in this
section.
10.2 Properties of the Software Development Models 181
Example A computer program that is completely written before it is compiled for the
first time is not developed incrementally. An incremental development would occur
in several steps with compiling different parts of the program at different times.
Example A computer program that is written once and sold right away is not iter-
atively developed. A program where different functions are improved several times
after having been tested is iteratively developed.
Individuals and interactions over processes and tools. Working software over
comprehensive documentation. Customer collaboration over contract negotiation.
Responding to change over following a plan.
That is, while there is value in the items on the right, we value the items on the
left more” (Beck et al. 2001).
Iterations are short time intervals (normally a few weeks) and involve a team.
After each iteration the results are shown to the client/decision-maker.
An example of agile software development is Scrum, see Sect. 10.3.5.
The Waterfall Model organizes the development process into phases, which have to
be validated in mile stone meetings. Only if one phase is completed, the development
process can proceed to the next phase, see Fig. 10.1. Steps to the previous phase can
only be done in the extended Waterfall Model, see Fig. 10.2. See the references
Kramer (2018) and Ruparelia (2010) for some recent applications and discussions.
Each phase has an output in the form of documents or reports. The requirements
listed in these documents have to be precise enough to be used for surveying the
implementation.
“The sequential phases in Waterfall model are:
– Requirement Gathering and analysis: All possible requirements of the system
to be developed are captured in this phase and documented in a requirement
specification doc.
– System Design: The requirement specifications from first phase are studied in this
phase and system design is prepared. System Design helps in specifying hardware
and system requirements and also helps in defining overall system architecture.
Implementaon
Verificaon
Maintenance
10.3 Example Development Models 183
– Implementation: With inputs from system design, the system is first developed
in small programs called units, which are integrated in the next phase. Each unit
is developed and tested for its functionality which is referred to as Unit Testing.
– Integration and Testing: All the units developed in the implementation phase
are integrated into a system after testing of each unit. Post integration the entire
system is tested for any faults and failures.
– Deployment of system: Once the functional and non-functional testing is done,
the product is deployed in the customer environment or released into the market.
– Maintenance: There are some issues which come up in the client environment.
To fix those issues patches are released. Also, to enhance the product some better
versions are released. Maintenance is done to deliver these changes in the customer
environment.
All these phases are cascaded to each other in which progress is seen as flowing
steadily downward (like a waterfall) through the phases. The next phase is started
only after the defined set of goals are achieved for previous phase and it is signed off,
so the name ‘Waterfall Model’. In this model phases do not overlap” (Tutorialspoint
2013),
“Advantages of waterfall model:
– Simple and easy to understand and use.
– Easy to manage due to the rigidity of the model—each phase has specific
deliverables and a review process.
– Phases are processed and completed one at a time.
– Works well for smaller projects where requirements are very well understood.
Disadvantages of waterfall model:
– Once an application is in the testing stage, it is very difficult to go back and change
something that was not well-thought out in the concept stage.
184 10 Models for Hardware and Software Development Processes
The Spiral Model is an iterative software development process which is divided into
four segments. These four segments are (see also Fig. 10.3) (Boehm 1988)
(1) Determine objectives,
(2) Identify and resolve risks,
(3) Development and Test, and
(4) Plan the next iteration.
The process starts in phase (1) and proceeds through the four phases. After the
fourth phase, the first iteration is completed and the system has been improved. The
process starts again in phase (1), only this time with the improved system. In this
way, the system is improved with every iteration. This continues until the system
meets the expectations. For example, a model with four iterations could divide into
(a) Developing a concept for the production of the desired system,
(b) Setting up a list of requirements that have to be regarded during the development
of the system,
(c) Developing and validating a first design of the system, and
(d) Developing and implementing a final design of the system that then passes the
final inspection.
In Kaduri (2012), the four steps are shortly described as follows:
“Planning: In this phase, the new system requirements are gathered and defined
after a comprehensive system study of the various business processes. This usually
involves interviewing internal and external users, preparation of detailed flow
diagrams showing the process or processes for which going to be developed, the
inputs and outputs in terms of how the data is to be recorded/entered and the form
in which the results are to be presented.
10.3 Example Development Models 185
Risk Analysis Phase: In this phase to identify risk and alternate solutions, a
process is followed. The process includes addressing the factors that may risk the
successful completion of the entire project development including alternative strate-
gies and constraints. The issues pertaining to the possibility of the development if not
meeting, for example, user requirements, reporting requirements or the capability of
the development team or the compatibility and functionality of the hardware with
software. To undertake development, the Risk analysis and suggested solutions to
mitigate and eliminate the Risks would thus become a part of the finalized strategy.
A prototype is produced at the end of the risk analysis phase.
Engineering Phase: In this phase, software is produced along with coding and
testing at the end of the phase. After preparation of prototype tested against bench-
marks based on customer expectations and evaluated risks to verify the various
aspects of the development. Until customer satisfaction is achieved before develop-
ment of the next phase of the product, refinements and rectifications of the prototype
are undertaken.
Evaluation Phase: In this phase, the final system is thoroughly evaluated and
tested based on the refined prototype. Evaluation phase allows the customer to eval-
uate the output of the project to date before the project continues to the next spiral.
186 10 Models for Hardware and Software Development Processes
Routine maintenance is carried out on a continuing basis to prevent large scale failures
and to minimize downtime” (Manual Testing Guide 2013).
Rouse (2007, 2020) subdivides the four steps further and defines the following
nine steps:
1. “The new system requirements are defined in as much detail as possible. This
usually involves interviewing a number of users representing all the external or
internal users and other aspects of the existing system.
2. A preliminary design is created for the new system.
3. A first prototype of the new system is constructed from the preliminary design.
This is usually a scaled-down system, and represents an approximation of the
characteristics of the final product.
4. A second prototype is evolved by a fourfold procedure: (1) evaluating the first
prototype in terms of its strengths, weaknesses, and risks; (2) defining the require-
ments of the second prototype; (3) planning and designing the second prototype;
(4) constructing and testing the second prototype.
5. At the customer’s option, the entire project can be aborted if the risk is deemed
too great. Risk factors might involve development cost overruns, operating-cost
miscalculation, or any other factor that could, in the customer’s judgment, result
in a less-than-satisfactory final product.
6. The existing prototype is evaluated in the same manner as was the previous
prototype, and, if necessary, another prototype is developed from it according to
the fourfold procedure outlined above.
7. The preceding steps are iterated until the customer is satisfied that the refined
prototype represents the final product desired.
8. The final system is constructed, based on the refined prototype.
9. The final system is thoroughly evaluated and tested. Routine maintenance is
carried out on a continuing basis to prevent large-scale failures and to minimize
downtime” (Rouse 2007).
Main observations include the following:
• Classical development steps are only part of the model.
• As important dimension, the cost levels are considered in terms of distance from
the origin.
• The model allows for incremental as well as more waterfall-like development
processes.
• Early forms of validation take place in early phases for retargeting development
aims.
• Risk and target of development assessment is conducted at all system development
levels, in particular, in idea, concept, design, and implementation phase.
• Schematic puts great emphasis on activities that are not pure engineering and
development but focuses on refinement of targets, analysis, verification, and
validation.
10.3 Example Development Models 187
10.3.3 V-Model
The V-Model, see Figs. 10.4 and 10.5, is a graphical model in form of the letter “V”
that represents the steps of the development of a system in its life cycle. It describes
the activities that have to be executed and results that have to be produced during the
development process.
The left side of the V covers the identification of requirements and creation
of system specifications. They completely determine the system. The specifica-
tions are verified during the right-hand side steps. Already, in these steps, the
verification/validation plans are developed.
Fig. 10.4 V-Model from IEC 61508-3 (IEC 61508 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch
At the bottom of the V, the implementation at the highest level of resolution takes
place.
The right side of the “V” represents integration of parts and their verification
(Forsberg and Mooz 1991, 1998; Forsberg et al. 2005; International Council On
Systems Engineering 2007; SEOR 2007).
Since the verification/validation activities at higher level of resolution (toward the
bottom of the V) are more technical, often only at highest level of resolution, i.e., at
system level, the term validation is used.
The left side of the “V” is comparable to the Waterfall Model. It describes the
development of the system from general requirements to a precise draft. The right
side of the “V” describes the verification and validation process. Here the process
starts with the more precise testing, the verification. It is checked whether the actual
requirements for the system have been fulfilled. The validation which follows the
verification is the more general test. Here it is tested whether the general concept has
been implemented correctly (“Does the system really do what was planned?”).
More recent discussions and modifications of the V model include Bröhl and
Dröschel (1995), Dröschel (1998), Clark (2009), Mathur and Malik (2010), Eigner
et al. (2017) and Angermeier et al. (2006).
There also exists a widely used extension to software project management called
V-Model XT, see Angermeier et al. (2006), Friedrich (2009), for which by now a
rich experience of application exists.
10.3.5 Scrum
Scrum is an agile software development method suitable for small groups as well as
development teams (Rising and Janoff 2000; Beck et al. 2001; Schwaber and Beedle
2002). In contrast to the Waterfall or Spiral Model, scrum consists of many iterations,
so-called sprints. The main idea of this agile software development is to identify and
10.3 Example Development Models 189
10.4 Questions
10.5 Answers
References
Mathur, Sonali; Malik, Shaily (2010): Advancements in the V-Model. In IJCA 1 (12), pp. 30–35.
https://doi.org/10.5120/266-425.
Rational (1998): Rational Unified Process. Best Practices for Software Development Teams.
Cupertino, USA (Rational Software White Paper TP026B, Rev 11/01). Available online
at https://www.ibm.com/developerworks/rational/library/content/03July/1000/1251/1251_best
practices_TP026B.pdf, checked on 9/21/2020.
Rising, L.; Janoff, N. S. (2000): The Scrum software development process for small teams. In IEEE
Softw. 17 (4), pp. 26–32. https://doi.org/10.1109/52.854065.
Rouse, Margaret (2007): Spiral model (spiral lifecycle model) (2013-12-10). Available online at
http://searchsoftwarequality.techtarget.com/definition/spiral-model.
Rouse, Margaret (2020): Spiral model. TechTarget, SearchSoftware Quality. Available online
at https://searchsoftwarequality.techtarget.com/definition/spiral-model, updated on 2019-08,
checked on 9/25/2020.
Ruparelia, Nayan B. (2010): Software development lifecycle models. In SIGSOFT Softw. Eng. Notes
35 (3), pp. 8–13. https://doi.org/10.1145/1764810.1764814.
Schwaber, Ken; Beedle, Mike (2002): Agile software development with Scrum. Upper Saddle River,
NJ: Prentice Hall (Series in agile software development).
SEOR. (2007). “The SE VEE.” Retrieved 2007-05-26, from http://www.gmu.edu/departments/seor/
insert/robot/robot2.html.
Shafiee, Sara; Wautelet, Yves; Hvam, Lars; Sandrin, Enrico; Forza, Cipriano (2020): Scrum versus
Rational Unified Process in facing the main challenges of product configuration systems devel-
opment. In Journal of Systems and Software 170, p. 110732. https://doi.org/10.1016/j.jss.2020.
110732.
Takahira, R Y; Laraia, L R; Dias, F A; Yu, A S; Nascimento, P T; Camargo, A S (2012): Scrum
and Embedded Software development for the automotive industry. In Dundar F. Kocaoglu (Ed.):
Proceedings of PICMET ’12: technology management for emerging technologies (PICMET),
2012. Conference; July 29, 2012–Aug. 2, 2012, Vancouver, BC, Canada. Portland International
Center for Management of Engineering and Technology; PICMET conference. Piscataway, NJ:
IEEE, pp. 2664–2672.
Tutorialspoint. (2013). “SDLC Waterfall Model.” Retrieved 2013-12-09, from http://www.tutorials
point.com/sdlc/sdlc_waterfall_model.htm.
Wang, Yang; Ramadani, Jasmin; Wagner, Stefan (2017): An Exploratory Study on Applying a
Scrum Development Process for Safety-Critical Systems. In Michael Felderer, Daniel Méndez
Fernández, Burak Turhan, Marcos Kalinowski, Federica Sarro, Dietmar Winkler (Eds.): Product-
Focused Software Process Improvement. 18th International Conference, PROFES 2017, Inns-
bruck, Austria, November 29-December 1, 2017, Proceedings, vol. 10611. Cham: Springer
International Publishing (Lecture Notes in Computer Science, 10611), pp. 324–340.
Chapter 11
The Standard IEC 61508 and Its Safety
Life Cycle
11.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 193
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_11
194 11 The Standard IEC 61508 and Its Safety Life Cycle
The realization phase of the safety life cycle comprises the development of hard-
ware (see Sect. 11.8.8.1) and software (see Sect. 11.8.8.2) and its respective mutual
integration. An example is given of recommendations of methods for software for a
selected SIL level.
The chapter is mostly based on the standard itself (IEC 61508 2010) and on
(Larisch, Hänle et al. 2008a, b; Hänle 2007).
Fig. 11.1 Structure of the norm IEC 61508, graphic from (IEC 61508 2010) with optic modifica-
tions by Larisch as in (Larisch, Hänle et al. 2008a, b). Copyright © 2010 IEC Geneva, Switzerland.
www.iec.ch
196 11 The Standard IEC 61508 and Its Safety Life Cycle
Some examples of further literature on the IEC 61508 are the papers (Redmill
1998a, b; Bell 2005) and the books (Smith and Simpson 2004; Börcsök 2006).
11.4 Reminder
Several concepts from the IEC 61508 have already been mentioned or introduced
throughout the course. They are listed in Table 11.1.
11.5 Definitions
Part 4 of the IEC 61508 lists definitions and abbreviations. We include the definitions
here that are most important for the context of this course.
EUC (Equipment Under Control): “equipment, machinery, apparatus or plant
used for manufacturing, process, transportation, medical or other activities” (IEC
61508 2010).
Element: “part of a subsystem comprising a single component or any group of
components that performs one or more element safety functions” (IEC 61508 2010).
Remark: In Version 2 of IEC 61508, the term “element” is often used where it
said “component” in Version 1.
Tolerable risk: “risk which is accepted in a given context based on the current
values of society” (IEC 61508 2010).
EUC risk: “risk arising from the EUC or its interaction with the EUC control
system” (IEC 61508 2010).
The relationship between tolerable risk and EUC risk was already shown in
Fig. 3.2.
11.5 Definitions 197
A = {Blade stops when protection is removed within less than 0.1 sec.}. (11.1)
The standard IEC 61508 has a systematically structured safety concept, the Safety
Life Cycle, see Fig. 11.2. It has to be kept in mind that this figure is only a scheme. In
reality, the phases will not be sequentially gone through once but several steps will
be repeated iteratively (Larisch, Hänle et al. 2008a, b).
The Safety Life Cycle helps to do all relevant measures in a systematic order to
determine and achieve the safety integrity level (SIL) for a safety-related E/E/PE
system. The safety life cycle consists of 16 phases which describe the complete life
of a product, from the concept over installation and maintenance to decommissioning
(Larisch, Hänle et al. 2008a, b).
The safety life cycle also considers safety-related systems from other technologies
and external systems for risk reduction but they are not regarded in the rest of the
standard IEC 61508. To proceed from one phase to the next, the present phase has to
be completed. Work in later phases can result in the necessity of changes in earlier
phases so that they have to be revised. The steps backward are not shown in Fig. 11.2
but iterations are allowed and sometimes necessary as mentioned before (Larisch,
Hänle et al. 2008a, b).
During the first five phases of the Safety Life Cycle, the hazard and risk analysis
are executed and based on those the safety requirements analysis. In Phase 9 and 10,
the software and the hardware are developed and realized. Parallel to this phase, the
Fig. 11.2 Safety Life Cycle of IEC 61508 (IEC 61508 S+ 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch
11.7 Safety Life Cycle 199
overall validation, installation, and operation are planned in Phases 6 to 8. The plans
from these phases are executed in Phases 12 to 14. After changes to the system in
Phase 15, one has to go back to the related phase in the Safety Life Cycle. Phase 16
describes the end of the Safety Life Cycle with the commissioning or disposal of the
system.
Phase 11 is not treated in detail in the IEC 61508 (Larisch, Hänle et al. 2008a,
b). However, its implications have to be considered throughout the application of
the standard. A typical example is that a mechanical device significantly reduces the
risk of a system, thus also lowering the overall SIL requirements for the remaining
safety functions that are realized using hardware and software.
The standard IEC 61508 gives good and elaborate support for the later phases
concerning the realization of the system. For the first phases up to the allocation of
safety functions, experiences have shown that users often miss precise information
from the standard (Larisch, Hänle et al. 2008a, b). The system analysis in Phases 1 to
3 which is necessary to determine the risks is not described in much detail, especially
the evaluation of the acceptance of the risks is only shown exemplarily.
In the following sections, some of the 16 phases will be explained in more detail by
summarizing the corresponding sections of IEC 61508 (IEC 61508 S+ 2010).
The goal of the concept phase is to get a sufficient understanding of the EUC and
its environment. In this phase, information about the physical environment, hazard
sources, hazard situations, and legal requirements should be collected and docu-
mented (Thielsch 2012). This includes hazards that occur in the interaction of the
system with other systems.
Appropriate methods are, for example, graphics, wiring diagrams, UML/ SysML
models, preliminary hazard list, and hazard list.
The objectives of Phase 2 are to define the boundaries of the system (EUC) and to
determine the scope of the hazard and risk analysis. The external, internal, temporal,
and probabilistic boundaries of the system should be defined. It should be determined
what should be part of the analysis. This refers to the systems itself, other systems,
external events, types of failures that should be considered, etc.
200 11 The Standard IEC 61508 and Its Safety Life Cycle
Appropriate methods are, for example, consultation with the client/user, use case
diagrams, and a finer hazard list.
In Phase 3, hazard events related to the EUC are determined, sequences of events that
lead to those hazard events are identified, and risks that are connected to the hazard
are also determined. A hazard and risk analysis based on Phase 2 should be executed
and adapted if more information is available later in the development process. The
risk should be reduced according to the results of the analysis.
An appropriate system analysis method should be chosen and executed. The
system analysis should consider the correctly functioning system, failures, uninten-
tional and intentional incorrect use of the system, etc. Probabilities and consequences
of failures, including credible incorrect use should also be included in the analysis.
The sequences of events from Phase 2 should be analyzed in more detail. The
risk of the system for different hazards should be determined. Appropriate methods
are, for example, hazard analysis, fault tree analysis, FMEA, QRA (quantitative risk
analysis), and PRA (probabilistic risk analysis).
Figure 3.2 shows the risk concepts of the IEC 61508. By evaluating the risk, it
can be determined how safe the EUC is and the overall safety requirements can be
derived. The determined requirements have the goal to reduce the EUC risk to a
tolerable level by external risk minimization measures and by safety-related systems
(Thielsch 2012).
The requirements should be defied in terms of safety functions and safety integrity.
“The overall safety functions to be performed will not, at this stage, be specified in
technology-specific terms since the method and technology implementation of the
overall safety functions will not be known until later” (IEC 61508 2010).
“The first objective of the requirements of this subclause is to allocate the overall
safety functions […] to the designated E/E/PE safety-related systems and other
risk reduction measures” (IEC 61508 2010). Secondly, the target failure measure
(depending on low demand mode/high demand mode/continuous mode of operation)
and the SIL for each safety function are allocated.
11.8 More Detailed Description of Some Phases 201
Phases 6 to 8 will not be explained here. They are neither typical application areas for
the methods from the previous chapters nor are they different from the first version
of the standard. For more detail on those phases, please refer to the IEC 61508.
Phase 9 from the first version of the IEC 61508 used to be the realization phase
where the software and hardware safety life cycle are executed. In Version 2 of the
IEC 61508, this is now located in Phase 10. Phase 9 now defines the E/E/PE system
safety requirements. They are based on the overall safety requirements defined in
Phase 4 and allocated in Phase 5.
The requirements specification should now “provide comprehensive detailed
requirements sufficient for the design and development of the E/E/PE related
systems” (IEC 61508 2010) and “[e]quipment designers can use the specification
as a basis for selecting the equipment and architecture” (IEC 61508 2010).
In the first version of IEC 61508, Phase 9 contained the software and hardware safety
life cycle. In the second version of the standard, this phase is now Phase 10. Phase
10 of the safety life cycle can again be described in terms of development life cycles,
the E/E/PE system hardware development process, and the software development
process, for which, respectively, the requirements of V-Models need to be fulfilled.
The software life cycle is shown in Fig. 11.4. It interacts with the similarly structured
E/E/PE system life cycle from Fig. 11.2.
202 11 The Standard IEC 61508 and Its Safety Life Cycle
Fig. 11.3 E/E/PE system life cycle from IEC 61508 (IEC 61508 2010) with minor graphical
changes. Copyright © 2010 IEC Geneva, Switzerland. www.iec.ch
Fig. 11.4 Software life cycle from IEC 61508 (IEC 61508 2010) with minor graphical changes.
Copyright © 2010 IEC Geneva, Switzerland. www.iec.ch
11.8 More Detailed Description of Some Phases 203
Table 11.2 Extract from Table A.1 in IEC 61508 (IEC 61508 2010). Copyright © 2010 IEC Geneva,
Switzerland. www.iec.ch
Technique/Measure Ref. SIL 1 SIL 2 SIL 3 SIL 4
Semi-formal methods Table B.7 R R HR HR
Formal methods B.2.2, C 2.4 – R R HR
Forward traceability between […] C.2.11 R R HR HR
For the different phases, there are tables in the Annex of IEC 61508 part 3 indi-
cating which methods should be used depending on the SIL. For each method, the
standard distinguishes between highly recommendable (HR), recommendable (R),
no recommendation for or against the measure (–), and not recommendable (NR).
The standard provides methods (techniques and measures) for systematic and
statistical hardware failures and for systematic software failures. The methods cover
prevention of failures as well as control in case of occurrence. Statistical hard-
ware failures are, for instance, opening of solder connections or degradations of
components due to vibrations.
Example Table A.1 treats the software safety requirements specification, that is,
Phase 10.1. An extract of the table is shown in Table 11.2. These are, in particular,
methods for prevention of systematic hardware and software failures.
“Provided that the software safety lifecycle satisfies the requirements of Table
11.1 [listing objectives, scope, requirements subclause, inputs and outputs of the
steps in the software life cycle], it is acceptable to tailor the V-model (see Fig. 10.4)
to take account of the safety integrity and the complexity of the project” (IEC 61508
2010).
Actual development and safety assessment process steps as well as techniques and
measures recommended for software safety can be found to be rather similar when
comparing approaches and methods recommended by IEC 61508 with other stan-
dards. An example comparison approach and its conduction is provided in (Larisch
et al. 2009). It shows that compliance listings are useful for comparison.
Phases 11 to 16 will not be explained here. They are neither typical application areas
for the methods from the previous chapters nor are they different from the first version
of the standard. For more detail on those phases, please refer to the IEC 61508.
204 11 The Standard IEC 61508 and Its Safety Life Cycle
Finally, some already discussed and missing concepts are listed that need to be
fulfilled for each safety functions. A safety function needs to fulfill the following
requirements:
• Qualitative descriptions of what the safety function should achieve including time
limits and transitions from unsafe to safe state or the maintenance of a safe state
that is required.
• Quantification of the reliability of the safety function on demand or continuously,
i.e., determination of expected SIL level.
• For the safety architecture of the safety-related system (e.g., serial system or
parallel system or in combination) as achieved within the allocation of the safety
function for each element, the SIL level must be provided.
• The hardware failure tolerance (HFT) must be sufficient for the level complexity
of the elements. One distinguishes between components for which all failure
modes are known (level A of complexity) and complex components (level B) for
which not all failure modes are known. For instance, H F T = 1 means that any
single element of the safety-related system may fail without failure of the overall
safety-related system.
• The diagnostic coverage (DC) must fit to the level of complexity of the component,
the HFT, and the intended SIL.
• For the hardware and software development of each element as well as all the
assessment tasks, the appropriate methods must be selected as recommended for
the selected SIL level.
11.10 Questions
(1) Define safety life cycle according to IEC 61508 and name import phases that
lead to the specification of the safety function.
(2) Which quantities need to be provided for each safety function?
(3) What are techniques and measures (methods) in the context of functional
safety?
(4) How can the quantities SIL, DC/SFF, and HFT be determined?
(5) Which development model for HW and SW is recommended by IEC 61508?
(6) Classify into statistical and systematic failures: (a) radiation damage of HW,
(b) SW error, and (c) HW component selection does not fit to environmental
stress.
(7) Of how many parts does the standard IEC 61508 consist of?
(8) When was the second version of the IEC 61508 published?
(9) How many parts of the standard are normative?
(10) Relate the terms hazardous event, hazardous situation, hazard, and harm to
each other.
11.10 Questions 205
11.11 Answers
(1) See Sect. 3.14 and Fig. 11.2 focusing on the early phases and their description
in the text.
(2) For each safety function, a qualitative and quantitative description needs to be
provided. The qualitative description covers what is done to reduce a risk. It may
contain application-specific quantities. Example: “Open airbag within 100 ms
after detection of collision.” The quantitative description of a safety function is
given by the safety integrity level (SIL), defined as the probability of failure on
demand or continuously of the safety function.
When realizing a safety function by a related EEPE system, minimum combi-
nations of (i) hardware failure tolerance (HFT), (ii) diagnostic coverage (DC) or
safe failure fraction (SFF), and (iii) complexity of components used in EEPE system
need to be fulfilled. The Standard IEC 61508 contains best practice tables where the
maximum achievable SIL is provided given HFT, Level A/B, and DC. High SILs are
only achievable for high HFT, Level A, and high DC.
(3) They are methods that are recommended or should be used to fulfill general
or step-specific requirements of the Safety Life Cycle and the Development
Processes of the IEC 61508.
(4) The quantities SIL, DC/SFF, and HFT can be determined using FTA and
approximated when using FMEDA or DFM. HFT can only be determined
with FTA for HFT > 2.
(5) V-Model.
(6) (a) statistical hardware failure, (b) systematic software failure, and (c)
systematic hardware design failure.
(7) Seven.
(8) 2010.
206 11 The Standard IEC 61508 and Its Safety Life Cycle
References
Redmill, F. J. (2000a). Installing IEC 61508 and Supporting Its Users - Nine Necessities. Workshop
for Safety Critical Systems and Software, Australia.
Redmill, F. J. (2000b). Understanding the Use, Missuse and Abuse of Safety Integrity Levels,
Proceedings of the Eighth Safety-critical Systems Symposium, pp. 20–34, Springer.
Schmidt, Andreas; Häring, Ivo (2007): Ex-post assessment of the software quality of an embedded
system. In Terje Aven, Jan Erik Vinnem (Eds.): Risk, Reliablity and Societal Safety, European
Safety and Reliability Conference (ESREL) 2007, vol. 2. Stavangar, Norway: Taylor and Francis
Group, London, pp. 1739–1746.
Smith, D. J. and K. G. Simpson (2004). Functional Safety - A Straightforward Guide to Applying
IEC 61508 and Related Standards. London, Elsevier.
Thielsch, P. (2012). Risikoanalysemethoden zur Festlegung der Gesamtssicherheitsanforderungen
im Sinn der “IEC 61508 (Ed. 2)”. Bachelor, Hochschule Furtwangen.
Chapter 12
Requirements for Safety-Critical Systems
12.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 209
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_12
210 12 Requirements for Safety-Critical Systems
In detail, the chapter covers the following contents. Section 12.2 further explains
the importance of safety requirements and their role in system development
processes.
Section 12.3 gives a list of definitions which are important for this chapter. The
definitions of risk and safety are repeated. Harm, hazardous events and situations,
hazard, safety functions, highly available systems and safety-critical system, and
safety requirements are introduced in parts with slight variations when compared to
the definitions already given in Chaps. 3 and 9, due to the fact that safety requirements
beyond the context of functional safety are included in the present chapter.
Section 12.4 lists in 10 subsections properties of safety requirements. Most of
them are defined in opposing pairs. Examples are given. Section 12.4.11 does not
directly give properties of safety requirements but system safety properties such as
reliability or availability which can play an important role for safety requirements.
Section 12.5 determines which combinations of properties of Sect. 12.4 appear
most often, based on the examples of Sect. 12.4. Typical combinations are, for
instance, safety requirements of structural nature (non-functional) that aim at
preventing damage events which are specified in detail. Another combination often
encountered is functional safety requirements that aim at mitigating damage effects
during or post event occurrence which may be specifically focused on the hazard
cause or effect. In contrary, functional non-specific safety requirements before events
do not occur very often.
In summary, the aim of this chapter is to introduce terms for a clear and comprehen-
sible characterization (classification) of safety requirements and to compare which
combination of terms typically appear in combination with each other. This leads
to the classification of safety requirements by their properties as well as potential
promising combinations of such properties which are not employed very often.
The chapter is a summary, translation, and partial paraphrasing and extension of
(Hänle 2007; Hänle and Häring 2008).
12.2 Context
Many accidents occur due to problems with system requirements for hardware and
software. In particular, empirically based statements confirm the generally assumed
hypothesis that the majority of safety problems for software originates in software
requirements and not in programming errors (Firesmith 2004).
The requirements are often incomplete (Hull et al. 2004). They normally do not
specify prevention of rare dangerous events. The specification often also does not
describe how one should react in exceptional situations. In the cases where dangers
and exceptions are described, the requirements usually are not specific enough or
parts are missing which is only noticed by specialists in that work field (Firesmith
2004).
Earlier safety technologies were invented as a reaction to technological innova-
tions. Nowadays, it is often also the other way around. Technological innovations
12.2 Context 211
take place due to safety requirements. Practice has shown that it is more expensive to
install safety later on than to include it as part of the concept from the start (Zypries
2003).
12.3 Definitions
Risk and safety have already been defined in Sects. 3.4 and 3.7. As a reminder, safety
is defined as “freedom from unacceptable risks” (IEC 61508 2010) and following
the classical definition of risk by Blaise Pascal we define the risk as the product of
probability and consequences: R = PC.
Highly available systems are systems that have to work reliably. A failure of the
system can cause high damage. Safety-critical systems can also be highly available
(Hänle 2007).
For safety-critical systems, there is always the risk of a hazardous event involving
persons. A highly available system may be free of risk for persons and “only” lead
to financial damage. For example, the failure of a bank’s data center can lead to costs
of millions per hour (Lenz 2007), but does not cause harm on persons.
IEC 61508 distinguishes between safety requirements for functional safety and non-
functional safety. Functional safety includes every function of the system which is
executed to guarantee the safety of the system. Non-functional safety is fulfilled by
properties, parts, or concepts of the system which guarantee safety by their existence
(Hänle 2007).
The automotive sector uses the terms active and passive safety (Exxon Mobil Corpo-
ration 2007; Hänle 2007). Active safety is anything that prevents accidents. This
comprises effective brakes, smooth steering, antilock brakes, fatigue-proof seats,
and clearly arranged and uncomplicated control and indicating elements.
Passive safety requirements are requirements for those functions and properties
of the system which prevent or minimize damage in case of an accident. Examples
for those functions/properties are airbags, emergency tensioning seat belt retractors,
and deformation zones (Hänle 2007).
This shows that the classification into active and passive is not the same as the
one into functional and non-functional although this might first be expected by the
choice of terms. The difference can also be seen in the following examples.
– Active functional safety requirements are safety requirements for the functions
of the system that have to be executed as positive action to prevent a dangerous
situation, for example, turning off the engine (DIN IEC 61508 2005).
– Passive functional safety requirements are safety requirements for safety functions
that have a surveillance function and only interfere if there is a problem. An
example is the requirement to prevent a start of the engine in case of a dangerous
event (DIN IEC 61508 2005).
– An example for an active non-functional safety requirement is to provide a special
insulation to withstand high temperatures (DIN IEC 61508 2005).
12.4 Properties of Safety Requirements 213
– Another example is the use of redundant components which still guarantee the
correct functioning of the system in case of a failure in one component. This can
be an example for an active non-functional as well as for a passive non-functional
safety requirement, depending on where the concept is used.
Example (Hänle 2007): The requirement that an airbag should inflate after a spec-
ified time is an abstract, general definition. A concrete (but non-technical) require-
ment would be that the airbag should inflate 30–40 ms after the collision was detected
(Hänle 2007).
For safety-critical systems, risks determine the major part of the safety requirements
(Storey 1996). A “Safety Requirement” document is usually created to describe how
the system has to be secured to guarantee safety (Hänle 2007).
Risk analysis of a system identifies risks and leads to the definition of safety
requirements for the system (Grunske 2004). According to Leveson (1995), the
requirements defined in order to prevent identified risks together with the quantitative
determination of tolerable hazard rates for each risk form the safety requirements for
the system (Hänle 2007).
Here two types of safety requirements can be distinguished. The one type refers to
which risks have to be regarded and the other to how the risks can be prevented. We
will call the first a cause-oriented safety requirement and the latter an effect-oriented
safety requirement (Hänle 2007).
Effect-oriented safety requirements can partly consist of cause-oriented safety
requirements. The cause-oriented requirements are used to name the reason for the
effect-oriented requirement. Those composed safety requirements are counted as
effect-oriented because effect-oriented requirements are stronger requirements for
the realization than cause-oriented requirements (Hänle 2007).
Static safety requirements are requirements which have to be fulfilled during the
product’s entire life cycle, for example, constructive safety requirements or quality
requirements (Hänle 2007).
It is rather easy to verify during or after the development process that static safety
requirements are fulfilled. For this, it is essential to know the whole system (Hänle
2007).
It often is not possible to prove that a requirement is fulfilled correctly in all dynamic
situations due to the complexity of modern systems. So, only a certain number of
possible system runs can be tested and a fault can still be undetected after the test
(Buschermöhle et al. 2004).
A formal description of the requirements can be used for a formal proof of
correctness for all possible system runs. This can be very elaborate and expensive
(Buschermöhle et al. 2004; Hänle 2007).
Under the term standardized we summarize all safety requirements where the fulfill-
ment yields additional safety or reliability. Typically, those requirements ask for
approved production processes and methods (Hänle 2007).
216 12 Requirements for Safety-Critical Systems
Standardized requirements are only a small part of the safety requirements. Many
requirements are not standardized, for example, the functional safety requirements
of IEC 61508.
Qualitative safety requirements describe the way in which a system fulfills safety to
minimize or prevent the hazard of persons and environment in a verbose way, for
example, that a machine stops if the protection of the blade is removed (Hänle 2007).
12.4 Properties of Safety Requirements 217
System-specific safety requirements are requirements which refer to the whole system
(Ober et al. 2005).
Module-specific safety requirements only refer to parts of the system, down to
single elements.
There are safety requirements which ask for something to occur in a certain period
of time. The velocity with which the task is completed is not very important as long
as the task is finished in the fixed time period (Stallings 2002). We call those safety
requirements time-critical.
According to Storey (1996), there are seven different properties of system safety.
Their importance varies with the specific system under consideration and with this
their influence on the safety requirements for the system.
Those seven properties are reliability, availability, fail-safe state, system integrity,
data integrity, system recovery, and maintainability (Hänle 2007).
Reliability means that a function should work correctly over a certain period
of time (minutes, hours, days, years, decades). An example for such a system is a
machine used in medicine to monitor the cardiac function (Hänle 2007). See the
example given in Sect. 4.12.
Availability requires that a system works when it is needed. An example is the
emergency shutdown of a nuclear power plant. It is not required that this function
works correctly for a long period of time but that it works sufficiently close to the
point in time when it is needed (Hänle 2007). See also Sect. 9.2.
218 12 Requirements for Safety-Critical Systems
Some systems can be in states which are regarded as safe. Fail-safe means that
the system transfers to such a safe state in case of a failure (Ehrenberger 2002). An
example is the stopping of a train in case of a failure. Some systems do not have safe
states, for example, an airplane while it is in the air (Hänle 2007).
System integrity describes the property of a system to detect and display failures
in its own operations. An example is an autopilot system which shows a warning if
it doubts the correct execution of its own functions (Hänle 2007).
Data integrity describes the property of a system to prevent, recognize, and
possibly correct mistakes in its own database (including file-IO). Incorrect data can
lead to wrong decisions of the system. An example for data integrity is to store data
three times and only use it if two stored values are exactly the same (Hänle 2007).
Especially for systems without safe state it is important to be able to restore the
system (system recovery). It depends on the type of system whether this is possible.
An example would be to restart a system after a failure (Hänle 2007).
Maintainability also has to be considered because not all systems can be main-
tained regularly or even at all. This should be regarded when deciding on safety
requirements. An example for a system that cannot be maintained is a satellite (Hänle
2007).
Table 12.1 Properties of safety requirements. If two properties are not separated by a line only one
property may be simultaneously assigned to a safety requirement (Hänle and Häring 2008)
Property A safety requirement has this property
Functional If the system should have a specific function or behavior
Non-functional If the system should have a specific characteristic or a specific
concept should be used
Active If something is required that prevents a hazardous event
Passive If something is required that mitigates the consequences of an
occurring hazardous event
Concrete/precise If the requirement has detailed information
Abstract If the requirement is generally described
Technical If the requirement describes how something has to be realized
Non-technical If the requirement does not describe how something should be
realized
Cause-oriented If it describes one or more hazards that the system should prevent
Effect-oriented If something is described that mitigates or prevents hazards
Static If the requirement may be verified at any time
Dynamic If the requirement may only be verified at a specific point in time
in the product life cycle of the system
Quantitative If a numerical or a proportional specification is given
Qualitative If a verbal requirement describes the method how a system
satisfies safety
Standardized If the use of something proven is requested, e.g., the use of a
standard, proven components, proven procedures, prohibition of
something specific
System-specific If the requirement applies to a system as a whole
Module-/component-specific If the requirement applies only to a part of the system
Time-critical If a task must be fulfilled till a specific date or within a specific
time period
Time-uncritical If it is not time-critical
Reliability If the requirement demands the reliable functioning on demand
Availability If the requirement calls for the functioning of a system function
in the correct moment
Fail-safe state If the requirement calls for a safe state in which the system can
be set in case of a failure. This is also called »fail-safe«
(Ehrenberger 2002)
System integrity If the requirement demands self-analyzing capability so that the
system can recognize failures of its own operations
Data integrity If the requirement demands recognition of failures in the
database of the system
System recovery If the requirement calls for a reset function so that the
(sub-)system can be recovered from a failure state
Maintainability If the requirement describes how the system can be maintained
220 12 Requirements for Safety-Critical Systems
Authority 2003; Grell 2003; Benoc 2004; DIN EN ISO 12100 2004; Safety Guide
NS-G-1.8 2005; Safety Guide NS-G-1.12 2005) and literature (Leveson 1995; Storey
1996; Lano and Bicarregui 1998; Jacobson et al. 1999; Robertson and Robertson
1999; Lamsweerde 2000; Tsumaki and Morisawa 2001; Bitsch and Göhner 2002;
Firesmith 2004; Grunske 2004; Glinz 2005, 2007; Fontan et al. 2006; Ober et al.
2006, Freitas et al. 2007; Sommerville 2007)” (Hänle and Häring 2008).
The analysis was executed in a spreadsheet application (Microsoft Excel) using
“AutoFilter” based on a table as in the representative Table 12.2 with one line for
each of the 32 examples. In the table, “−” means that the property is not relevant
and “?” that it is not clear to which of the properties it should be related.
Due to the small dataset of 32 safety requirements, the evaluation cannot be seen
as a proof of which combinations usually occur but the results can be seen as an
indicator. Hänle’s suggestions are
– Functional passive and non-functional active safety requirements occur much
more often than functional active and non-functional passive requirements, see
Fig. 12.1. This means that more safety requirements are defined to require func-
tions which intervene in case of an accident in order to minimize the damage
rather than to approach accidents in a functional active way. On the other hand,
properties that are defined for the system usually prevent accidents, such as special
insulation or high-quality components. Properties that reduce damage in case of
an accident, for example, deformation zones in cars, are rare.
– Non-functional effect-oriented safety requirements are static which implies that
they can always be tested and they are always time-uncritical.
– Cause-oriented safety requirements are, due to their description of risks, non-
functional abstract non-technical dynamic safety requirements.
– Time-critical safety requirements always are functional dynamic non-technical
safety requirements. This is supported by Liggesmeyer (2002) who classifies
time-critical requirements as functional properties of a technical system.
– Technical safety requirements are concrete time-uncritical requirements. Non-
technical requirements can be concrete or abstract.
– Dynamic safety requirements, that is, requirements that can only be tested at
certain time points, normally are functional safety requirements but not always.
– Standardized safety requirements are static safety requirements. This means that
they can be verified at any time point in the system life cycle (possibly using
development documentation documents).
– Concrete/abstract and system-/module-specific properties do not show accumu-
lations and can be associated with any type of requirement.
– Due to the small dataset and the long list of system safety properties, one cannot
say much about the association of safety requirements to those properties. It can be
only be seen that safety requirements with the system safety property “availability”
normally are functional and safety requirements with the system safety property
“reliability” normally are non-functional.
Table 12.2 Evaluation of the properties of the example safety properties (Hänle 2007)
Safety Functional Active Concrete Technical Cause- Static Qualitative Standardized System Time-critical System
requirement (f)/non-functional (a)/passive (c)/abstract (t)/non-technical (c)/effect-oriented (s)/dynamic (ql)/quantitative (s) (s)-/module-specific (c)/time-uncritical safety
(nf) (p) (a) (nt) (e) (d) (qt) (m) (u) property
Category III nf p c t e s ql s S u rela
12.5 Evaluating the Properties
systems (see
Sect. 12.4.3)
Single device nf a c t e s ql s M u rela
(see
Sect. 12.4.3)
Airbag in F p c nt e d qt – M c ava
30–40 ms
(see
Sect. 12.4.3)
Airbag F p a nt e d qt – M c ava
without time
restriction
(see
Sect. 12.4.6)
Nuclear nf ? a nt c d ql – S u ?
power plant
safety (see
Sect. 12.4.11)
a rel = reliability, av = availability
221
222 12 Requirements for Safety-Critical Systems
12.6 Questions
(1) Explain the difference between functional and active as presented in this chapter.
(2) Regard the pairs of properties of safety requirements:
• Functional versus non-functional,
• Active versus passive,
• Technical versus non-technical,
• Concrete versus abstract,
• Cause-oriented versus effect-oriented,
12.6 Questions 223
(5) The data center of a big company with sensible data about their products is a
… system.
12.7 Answers
(1) Functional refers to functions. They have to be actively executed. Active refers
to the prevention of accidents, i.e., active prevention.
(2) See definitions of Sect. 12.4.
(3) Non-functional, active, technical, concrete, static, and time-uncritical should be
easier than their opponent. For cause-/effect-oriented, qualitative/quantitative,
and system-/module-specific it is hard to say.
One can also prevent an accident by a safe property of a system (active, non-
functional) and an actively executed function can react in case of an accident
instead of preventing it (functional, passive).
(4) Closing the gate: functional, passive, concrete, non-technical, effect-oriented,
dynamic, quantitative, likely standardized, system-specific, time-critical, and
availability.
Locked person: non-functional, active, abstract, non-technical, cause-oriented,
static, qualitative, not standardized, system-specific, time-critical, availability.
(5) (c) A failure in the data center causes high damage but no harm on persons.
224 12 Requirements for Safety-Critical Systems
References
Antón, A. (1997). Goal Identification and Refinement in the Specification of Information Systems.
PhD.
Benoc. (2004). “Sicherheitsanforderungen an das Airbag System.” Retrieved 2007-09-15, from
http://www.benoac.de/kompetenzen_sicherheit_de.html.
Beyer, D. (2002). Formale Verifikation von Realzeit-Systemen mittels Cottbus Timed Automata.
Berlin, Mensch & Buch Verlag.
Bitsch, F. and P. Göhner (2002). Spezifikation von Sicherheitsanforderungen mit Safety-Patterns.
Software Engineering in der industriellen Praxis. Düsseldorf, VDI Verlag GmbH.
Buschermöhle, R., M. Brörkens, I. Brückner, W. Damm, W. Hasselbring, B. Josko, C. Schulte and
T. Wolf (2004). “Model Checking - Grundlagen und Praxiserfahrungen.” Informatik-Spektrum
27(2): 146–158.
Civil Aviation Authority (2003). CAP 670 - Air Traffic Services Safety Requirements. London,
Civil Aviation Authority.
DIN EN 418 (1993). NOT-AUS-Einrichtung, funktionelle Aspekte, Gestaltungsleitsätze. D. E.-S.
v. Maschinen. Berlin, Beuth Verlag GmbH.
DIN EN ISO 12100 (2004). Teil 2: Technische Leitsätze. DIN EN ISO 12100 - Sicherheit von
Maschinen - Grundbegriffe, allgemeine Gestaltungsleitsätze. Berlin, Beuth Verlag GmbH.
DIN IEC 61508 (2005). Funktionale Sicherheit sicherheitsbezogener elek-
trischer/elektronischer/programmierbarer elektronischer Systeme, Teile 0 - 7. DIN. Berlin,
Beuth Verlag.
Ehrenberger, W. (2002). Software-Verifikation - Verfahren für den Zuverlässigkeitsnachweis von
Software. Munich, Vienna, Carl Hanser Verlag.
European Directive 70/311/EWG (1970). Angleichung der Rechtsvorschriften der Mitgliedstaaten
über die Lenkanlagen von Kraftfahrzeugen und Kraftfahrzeuganhängern. Richtlinie 70/311/EWG
Luxemburg, Der Rat der Europäischen Gemeinschaften.
European Directive 97/23/EG (1997). Anhang I - Grundlegende Sicherheitsanforderungen, Das
Europäische Paralament und der Rat der Europäischen Union.
Exxon Mobil Corporation. (2007). “Active Safety Technologies.” Retrieved 2012-06-
04, from http://www.mobil1.com/USAEnglish/MotorOil/Car_Care/Notes_From_The_Road/Saf
ety_System_Definitons.aspx.
Firesmith, D. G. (2004). “Engineering Safety Requirements, Safety Constraints, and Safety-Critical
Requirements.” Journal of Object Technology 3(3): 27–42.
Fontan, B., L. Apvrille, P. d. Saqui-Sannes and J. P. Courtiat (2006). Real-Time and Embedded
System Verification Based on Formal Requirements. IEEE Symposium on Industrial Embedded
Systems (IES’06). Antibes, France.
Freitas, E. P., M. A. Wehrmeister, C. E. Pereira, F. R. Wagner, E. T. S. Jr. and F. C. Carvalho
(2007). Using Aspect-oriented Concepts in the Requirements Analysis of Distributed Real-Time
Embedded Systems. IFIP International Federation for Information Processing. Boston, Springer.
References 225
Glinz, M. (2005). Rethinking the Notion of Non-Functional Requirements. Proceedings of the Third
World Congress for Software Quality. Munich, Germany, Department of Informatics, University
of Zurich.
Glinz, M. (2006). Requirements Engineering I. Lecture Notes. Universität Zürich.
Glinz, M. (2007). On Non-Functional Requirements. Proceedings of the 15th IEEE International
Requirements Engineering Conference. Delhi, India, Department of Informatics, University of
Zurich.
Grell, D. (2003). Rad am Draht - Innovationslawine in der Autotechnik. c’t - magazin für
computertechnik. 14.
Grunske, L. (2004). Strukturorientierte Optimierung der Qualitätseigenschaften von softwarein-
tensiven technischen Systemen im Architekturentwurf. Potsdam, Hasso Plattner Institute for
Software Systems Engineering, University Potsdam.
Hänle, A. (2007). Modellierung und Spezifikation von Anforderungen eines sicherheitskritischen
Systems mit UML, Modeling and Specification of Requirements of a safety critical System with
UML. Diploma Thesis, Hochschule Konstanz für Technik, Wirtschaft und Gestaltung (HTWG),
University of Applied Sciences; Fraunhofer EMI, Efringen-Kirchen.
Hänle, A. and I. Häring (2008). UML safety requirement specification and verification. Safety
Reliablity and Risk Analysis: Theory, Methods and Applications, European Safety and Reliablity
Conference (ESREL) 2008. S. Martorell, C. G. Soares and J. Barett. Valencia, Spain, Taylor and
Franzis Group, London. 2: 1555–1563.
Hull, E., K. Jackson and J. Dick (2004). Requirements Engineering, Springer.
IEC 61508 (1998-2005). Functional Safety of Electrical/Electronic/Programmable Electronic
Safety-related Systems. Geneva, International Electrotechnical Commission.
IEC 61508 (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Edition 2.0 Geneva, International Electrotechnical Commission.
IEC 61508 S+ (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Ed. 2 Geneva, International Electrotechnical Commission.
ISO 26262 Parts 1 to 12 Ed. 2, 2018-12: Road vehicles - Functional safety. Available online at
https://www.iso.org/standard/68383.html.
ISO 21448, 2019-01: Road vehicles — Safety of the intended functionality. Available online at
https://www.iso.org/standard/70939.html, checked on 9/21/2020.
Jacobson, I., G. Booch and J. Rumbaugh (1999). Unified Software Development Process.
Amsterdam, Addison-Wesley Longman.
Lamsweerde, A. v. (2000). Requirements Engineering in the Year 00: A Research Perspective.
ICSE’2000 - 22nd International Conference on Software Engineering. Limerick, Irland, ACM
Press.
Lano, K. and J. Bicarregui (1998). Formalising the UML in Structured Temporal Theories. Second
ECOOP Workshop on Precise Behavioral Semantics. Brussels, Belgium.
Lenz, U. (2007). “IT-Systeme: Ausfallsicherheit im Kostenvergleich.” Retrieved 2012-05-31, from
http://www.centralit.de/html/it_management/itinfrastruktur/458076/index1.html.
Leveson, N. G. (1995). Safeware: System Safety and Computers. Boston, Addison-Wesley.
Liggesmeyer, P. (2002). Software-Qualität: Testen, Analysieren und Verifizieren von Software.
Heidelberg, Berlin, Spektrum Akademischer Verlag.
Ober, I., S. Graf and I. Ober (2006). “Validating timed UML models by simulation and verification.”
International Journal on Software Tools for Technology Transfer (STTT) 8(2): 128.
Ober, I., S. Ober and I. Graf (2005). “Validating timed UML models by simulation and verification.”
International Journal on Software Tools for Technology Transfer (STTT).
Richtlinie 97/23/EG (1997). Anhang I - Grundlegende Sicherheitsanforderungen, Das Europäische
Parlament und der Rat der Europäischen Union.
Robertson, S. and J. Robertson (1999). Mastering the Requirements Process. Amsterdam, Addison-
Wesley Longman.
226 12 Requirements for Safety-Critical Systems
Safety Guide NS-G-1.8 (2005). Design of Emergency Power Systems for Nuclear Power Plants.
Safety Guide NS-G-1.8 - IAEA Safety Standards Series. Vienna, International Atomic Energy
Agency.
Safety Guide NS-G-1.12 (2005). Design of the Reactor Core for Nuclear Power Plants. Safety
Guide NS-G-1.12 - IAEA Safety Standards for protecting people and the environment. Vienna,
International Atomic Energy Agency.
Smarandache, I. M. and N. Nissanke (1999). “Applicability of SIGNAL in safety critical system
development.” IEEE Proceedings - Software 146(2).
Sommerville, I. (2007). Software Engineering 8. Harlow, Pearson Education Limited.
Stallings, W. (2002). Betriebssysteme - Prinzipien und Umsetzung. München, Pearson Education
Deutschland GmbH.
STANAG 4187 Ed. 3 (2001). Fuzing Systems - Safety Design Requirements. Standardization Agree-
ments (STANAG), North Atlantic Treaty Organization (NATO), NATO Standardization Agency
(NSA).
Storey, N. (1996). Safety-Critical Computer Systems. Harlow, Addison Wesley.
Tsumaki, T. and Y. Morisawa (2001). A Framework for Requirements Tracing using UML. Seventh
Asia-Pacific Software Engineering Conference. Singapore, Republic of Singapore, Nihon Unisys.
Zypries, B. (2003). “Bundesjustizministerin Zypries anlässlich der Eröffnung der Systems 2003:
Sicherheitsanforderungen sind der Motor von Innovationen.” Retrieved 2007-09-28, from http://
www.bmj.bund.de/enid/0,5cfd7d706d635f6964092d09393031093a0979656172092d093230
3033093a096d6f6e7468092d093130093a095f7472636964092d09393031/Ministerin/Reden_
129.html.
Chapter 13
Semi-Formal Modeling
of Multi-technological Systems I: UML
13.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 227
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_13
228 13 Semi-Formal Modeling of Multi-technological Systems I: UML
in SysML: class diagram, composite structure diagram, state diagram, and sequence
diagram.
Section 13.7 applies those diagrams in examples of modeling different types of
safety requirements. The examples are used for a classification of safety requirements
into four types in Sect. 13.8.
The goal of this chapter is to apply certain UML diagrams and one SysML diagram
in examples of verifying safety requirements. The goal is not to give a full overview
of all UML diagrams. This provides examples of how to use well-established devel-
opment tools for system development for safety analyses and support of implemen-
tation of safety properties. We introduce in this chapter all diagrams that are used for
modeling the safety requirements.
The chapter summarized, translates, and partially paraphrases (Hänle 2007) and
(Hänle and Häring 2008). This work itself is extensively based on (Rupp, Hahn et al.
2005; Kecher 2006; Oestereich 2006).
13.3 History
The concept of object orientation is known since the 1970s (Oestereich 2006) but
was not popular at first. It became widely used in the 1990s when software-producing
companies started to realize the advantages of object-oriented programming (Kecher
2006).
With the beginning of the increasing use of object-oriented programming, a
big variety of object-oriented analysis and design methods was developed (Rupp,
Hahn et al. 2005; Kecher 2006; Oestereich 2006). Among those methods were
the methods by Grady Booch, James Rumbaugh, and Ivar Jacobsen (Rupp, Hahn
et al. 2005; Kecher 2006) who can be seen as the founding fathers of UML (Hänle
2007). At the beginning of the 1990s, they developed, independently of each other,
modeling languages which support object-oriented software development (Hänle
2007). Of those methods, the methods by Booch and Rumbaugh were preferably
used (Oestereich 2006).
In the middle of the 1990s, the software company Rational Software (IBM 2012)
recruited first Booch and Rumbaugh and later also Jacobsen (Rupp, Hahn et al. 2005).
While working for the same company, they had the idea to unify their modeling
approaches (Oestereich 2006), and hence the name “Unified Modeling Language.”
Many important companies realized that the unification introduced a quasi-
standard and that a uniform modeling language simplified the communication
between the developers (Rupp, Hahn et al. 2005; Kecher 2006). Some of those
companies, for example, IBM, Microsoft, and Oracle, founded the consortium “UML
Partners” to improve the use of UML in the industry (Kecher 2006).
In 1997, UML Version 1.1 was submitted to and accepted by the OMG (Object
Management Group, “an international, open membership, not-for-profit computer
industry consortium [for standardization] since 1989” (Object Management Group
2012)). Since then, the further development of UML has been done by the OMG
(Rupp, Hahn et al. 2005; Kecher 2006; Oestereich 2006). The following versions
included some corrections and extensions of UML.
UML 2.0 from 2004 was an in-depth revision of UML. This was necessary
because the corrections and extensions did not only improve UML but also made
it more complex and harder to study (Kecher 2006; Oestereich 2006). Unnecessary
or unused elements, for example, parameterizable collaborations, were left out and
other elements, for example, appropriate notations to specify real state systems, were
added (Rupp, Hahn et al. 2005). Areas with weak semantics were specified (Rupp,
Hahn et al. 2005). The changes are described in more detail in (Rupp, Hahn et al.
2005).
230 13 Semi-Formal Modeling of Multi-technological Systems I: UML
The most recent version is UML 2.4.2. A major difference between the newer
versions of UML (UML 2.2 and later versions) compared to UML 2.0 is the so-called
profile diagram which did not exist in the earlier versions of UML.
The literature database WTi (WTI-Frankfurt eG 2011) contains 1379 entries for
the search request “UML” since 1998. Most of the recent articles do not describe
the UML itself or extensions of it. They focus on applications. The articles about
UML itself often treat the verification of diagrams or the automation of generating
code from diagrams. A few examples of articles are listed in this section. Corre-
spondingly, reviews tend to be domain-specific, for instance, on executable UML
13.5 UML in the Literature 231
We will only mention some diagrams here. For a more detailed introduction, please
refer to the English book (Fowler 2003), the German books (Rupp, Hahn et al. 2005;
Oestereich 2009; Kecher 2011), and the UML specification (Object Management
Group 2011).
If one wants to study UML in more detail or wants to study one specific diagram
more intensely, the book (Rupp, Hahn et al. 2005) can be suggested. In this book,
the diagrams are described comprehensively and detailed (Hänle 2007). The book
can be used by beginners as a study guide or by advanced users as reference (Hänle
2007). Since the book is from 2005, it does not cover the newest version of UML
and does not include the profile diagram.
Oestereich (2009) only describes the diagrams that he considers the most impor-
tant and most commonly used, so the UML specification (Object Management Group
2011) or an additional book, for example, (Rupp, Hahn et al. 2005), might be needed.
The book, judged on the basis of an earlier edition, is good to get a quick introduction
to UML (Hänle 2007).
Kecher (2006) gives a compact and comprehensive summary of UML. For each
diagram type, he has a section about the most common mistakes made when applying
the diagram (Hänle 2007). All diagrams are explained but not always in all details. For
the details, one additionally needs (Rupp, Hahn et al. 2005) or the UML specification
(Object Management Group 2011). The newest version was released in 2011 (Kecher
2011).
(Fowler 2003) is a well-known and comprehensive English introduction to UML.
Since it is a compact book like (Oestereich 2009), one again additionally needs the
UML specification (Object Management Group 2011). The book is also translated
into German.
There are seven diagrams to represent the structure: the class diagram, the composite
structure diagram, the component diagram, the deployment diagram, the object
diagram, the package diagram, and the profile diagram.
There are also seven diagrams to represent the behavior: the activity diagram, the
use case diagram, the interaction overview diagram, the communication diagram,
the sequence diagram, the timing diagram, and the state machine diagram.
13.6 UML Diagrams 233
testclass testclass
funcon mode
acvated
class name testclass
funcon mode
aributes acvated
testclass
program() program()
operaons acvate() acvate()
Fig. 13.1 Different forms of representing the class “testclass,” following (Hänle 2007)
A class diagram shows the classes used in a system and their relations where a class
is an abstraction of a set of objects with the same properties (Hänle 2007).
13.6.1.1 Class
A class can be understood as a construction plan for objects with the same attributes
and the same operations (Kecher 2006). The objects which are constructed according
to the construction plan are called instances (Kecher 2006).
A class is represented by a rectangle. This can be divided into up to three segments
(the class name, the attributes, and the operations), see Fig. 13.1.
The attributes describe properties of the class in form of data. For example, the
attribute “activated” of Fig. 13.1 saves data in form of a truth value.
Operations describe the class’s behavior. In Fig. 13.1 the objects in “testclass”
can be programmed and activated.
The classes can be described in more detail by adding symbols for the visibility and
the return type to the descriptions in the diagram, see Fig. 13.2. Visibility symbols
are written in front of the attribute’s/operation’s name. The return type is written
behind the attribute’s/operation’s name, separated from it by a colon. If there is no
return type behind an operation, it is assumed that this operation does not return a
value.
The meaning of the visibility symbols can be found in Table 13.1.
13.6.1.2 Association
The relationship between two classes is represented by a line which connects the two
classes. It is called an (bidirectional) association. A plain line says that the classes
234 13 Semi-Formal Modeling of Multi-technological Systems I: UML
testclass
know each other and can interact with each other, that is, they can access public
methods and attributes of the other class (Hänle 2007).
Numbers at the ends of the line indicate the so-called multiplicity of the class.
The multiplicity of a class is the number of instances of this class which interacts
with the other class, see Fig. 13.3. (Table 13.2).
If the line between two classes is an arrow, the association is unidirectional, see
Fig. 13.4. The class at the tip of the arrow does not know the class at the other end
of the arrow and does not have access to it.
An association that links a class to itself is called a reflexive association (Rupp,
Hahn et al. 2005; Kecher 2006), see Fig. 13.5.
testclassA testclassB
2 1
Fig. 13.3 Example of an association where two instances of “testclassA” interact with one instance
of “testclassB,” following (Hänle 2007)
13.6 UML Diagrams 235
testclassA testclassB
Fig. 13.4 Example of a unidirectional association where “testclassB” does not have access to
“testclassA,” following, but the reverse is true (Hänle 2007)
Remark (Hänle 2007): The reflexive association does not link an instance to itself.
It links instances of the same class to each other.
13.6.1.3 Generalization
is the subclass. The relationship is represented by a line with a hallow triangle at the
end of the superclass (Wikipedia 2012a).
13.6.1.4 Realization
13.6.1.5 Dependency
A dependency is weaker than an association. It only states that a class uses another
class at some point in time. “One class depends on another if the latter is a parameter
variable or local variable of a method of the former” (Wikipedia 2012a). It is repre-
sented by a dashed arrow from the dependent to the independent class (Wikipedia
2012a).
For an overview of all relationships between classes, see also Table 13.3.
13.6.2 Classifier
The composite structure diagram shows the interior of a classifier and its interactions
with the surroundings. It presents the structure of the system and a hierarchical
organization of the individual system components and their intersections (Hänle
2007).
The advantage compared to the class diagram is that in the composite structure
diagram classes can be nested into each other. This describes the relations between
the classes better than in the class diagram where they could only be modeled by
associations (Hänle 2007).
The composite structure diagram can be seen as a concrete possible implemen-
tation of the class diagram (Hänle 2007). In a composite structure diagram, every
object has a role, that is, a task in the system that it has to fulfill (Weilkiens 2006).
This role level closes the gap between the class level and the object level (Weilkiens
2006).
Each object of a class can only have one role but a class can be used for several
roles by assigning them to different objects of the class (Hänle 2007).
In the composite structure diagram, the following notational elements are used
(see also Fig. 13.6):
– Structured classes,
– Parts,
– Connectors,
– Ports, and
– Collaborations.
13.6.3.2 Connectors
13.6.3.3 Ports
13.6.3.4 Collaborations
Name Collaboraon
NameRoleB
NameInstance3
NameRoleA NameCollaboraonApp:
NameInstance1
NameCollaboraon
NameInstance2
NameRoleC
Fig. 13.8 Example of a collaboration application to the collaboration of Fig. 13.7, based on (Hänle
2007)
240 13 Semi-Formal Modeling of Multi-technological Systems I: UML
the collaboration application is written into the ellipsis, followed by a colon and the
name of the collaboration.
A state machine describes the behavior of a part of the system (Object Management
Group 2004). This part of the system is normally a class (Weilkiens 2006) but it can
also be any other classifier, for example, UseCases (Rupp, Hahn et al. 2005).
A state diagram describes a state machine. The elements represented in a state
diagram usually are.
– states,
– transitions,
– pseudostates, and
– signals and events,
See, for example, Fig. 13.9.
State machines are based on the work of David Harel who combined the theories
of general Mealy and Moore models to create a model that can be used to describe
complex system behavior (Object Management Group 2004; Weilkiens 2006).
13.6.4.1 Transitions
The guard is a condition that has to be fulfilled for the state transition. If it is not
fulfilled, the system stays in the old state. The guard is written in square brackets and
can be evaluated by True or False at any point in time (Rupp, Hahn et al. 2005).
The behavior describes which function is executed during the state transition
(Hänle 2007).
Events can occur in several places in the diagram. An event can be a trigger for a
state transition. It is also possible to define an event that leads to the execution of a
behavior within a state. In this case, the event does not lead to a state transition but
to the execution of a function (Hänle 2007).
An event is called a signal if it is created at and sent from a signal station (i.e.,
the arrow starts at a signal station) (Hänle 2007).
13.6.4.3 State
“A state models a situation during which some (usually implicit) invariant condition
holds” (Object Management Group 2004).
A state is represented by a rounded rectangle. The rectangle is divided by a
straight horizontal line. The upper part contains the name. The lower part can be
used to describe the state’s internal behavior. The internal behavior also has a trigger
but does not lead to a state transition.
Remark Rupp, Hahn et al. (2005) suggests not to use triggers on the outgoing
transition because it is normally not necessary.
242 13 Semi-Formal Modeling of Multi-technological Systems I: UML
“A message sent from outside the diagram can be represented by [an arrow]
originating from a filled-in circle (‘found message’ in UML) or from a border of the
sequence diagram (‘gate’ in UML)” (Wikipedia 2012b).
Sequence diagrams can become complex when there are many interactions.
Chapter 12 of (Rupp, Hahn et al. 2005) explains methods to modularize the diagrams
to have a clearer structure.
Like the sequence diagram, the timing diagram also displays interactions between
objects. It describes the temporal condition of state transitions of one or more involved
objects (Oestereich 2006; Hänle 2007).
A timing diagram has a horizontal time axis. The participants of the interaction
and their states are listed above each other in the diagram. An example of a timing
diagram is shown in Fig. 13.11 (Hänle 2007).
The state of a participant can be seen on his lifeline which is read as follows. As
long as the lifeline is a horizontal line, the participant is in the same state. A vertical
or diagonal line represents a state transition (Hänle 2007).
Remark The UML specification does not specifically assign a term to the lines.
Hence, different names are used in the literature for what is called lifeline here.
Messages that lead to a state transition are represented by arrows like in the
sequence diagram.
The time axis at the bottom of the diagram can be labeled with actual time units
but it can also be kept abstract. If it is abstract, the time unit can be written in a
note, see, for example, Fig. 13.26. Time points can be defined by assigning a time
to a parameter, for example, t = 10 : 56. Time intervals can be described by two
time points in curly braces, for example, {10 : 56..11 : 30}. Relations in time can be
described through the parameter, for example, t + 3 which would be 10 : 59 for the
previous example (Hänle 2007).
In order to have the option of variable time points, the keyword “now” is available
in UML (Object Management Group 2004). This keyword always refers to the current
time. From this, time dependencies can be named, see, for example, Fig. 13.26 (Hänle
2007).
Another keyword of UML is “duration” to determine time spans that can be used
for further tasks (Hänle 2007).
This subsection briefly introduces further UML diagrams. For a deeper introduction,
please refer to the literature, for example, the books presented in 1.5.
A deployment diagram in the unified modeling language serves to model the physical
deployment of artifacts on deployment targets.
Activity diagrams are, loosely defined, diagrams for showing workflows of stepwise
activities and actions, where one can, for example, display choice, iteration, and
concurrency. In the unified modeling language, activity diagrams can be used to
describe the business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.
Interaction overview diagrams in the unified modeling language are a type of activity
diagram. They are a high-level structuring mechanism for sequence diagrams.
13.6.8 Profiles
13.6.8.1 Constraints
A constraint defines restrictions for the whole model. They are written in curly
braces and have to be evaluable in terms of a truth value. A constraint can be written in
programming code or in natural language. An example for the first is {si ze > 5} while
an example for the second is {A class can have at most one upper class} (Hänle
2007).
A tool-based verification of constraints is only possible if they were defined in
a formal language (Object Management Group 2004) and if the UML tool in use
supports this function (Hänle 2007).
13.6.8.2 Stereotypes
A stereotype extends the description of an element in the model. In this way, the
elements can be defined and labeled more precisely (Hänle 2007).
Example (Hänle 2007): A class that describes hardware and software components
could use the stereotypes «hardware» and «software» to distinguish the two types of
component.
UML does not have a diagram to describe requirements. This can be found in the
system modeling language (SysML), the so-called requirement diagram. It shows
all requirements and their relations with each other and with other elements of the
model (Hänle 2007). SysML will described in detail in Chap. 14.
The requirements are described with the same notation as classes, extended by
the stereotype «requirement». They are only allowed to have two attributes, “text”
and “id” (Weilkiens 2006).
13.6 UML Diagrams 247
The attributes “text” and “id” are simple strings that the modeling person can
define. The attribute “id” should be a unique ID for the requirement; the attribute
“text” should describe the requirement. For an example, see Fig. 13.24 (Hänle 2007).
Figure 13.12 shows the abstract syntax for requirements stereotypes as presented
in the SysML specification (OMG 2010).
Weilkiens (2006) suggests that one uses specialized «requirement» stereotypes for
specific project because SysML does not categorize the requirements. Special-
ized «requirement» stereotypes can distinguish between different categories of
requirements. For example, safety requirements can be labeled by the stereotype
«safetyRequirement» (Hänle 2007).
13.6.9.2 Relationships
It is also possible to denote the satisfy relationship on a note. Here, the note
contains the word “Satisfies” in bold letters in the headline and below the stereo-
type «requirement» followed by the requirement. The note is attached to the design
element by a dashed line, see Fig. 13.14 (Hänle 2007).
The namespace containment relationship describes that one requirement is
contained in another requirement (Weilkiens 2006). This relationship is represented
by a line from the contained requirement to the containing requirement with a circle
with a cross at the end of the containing requirement, see Fig. 13.15 (Hänle 2007).
The refine relationship describes that an element of the model describes the prop-
erties of a requirement in more detail (Weilkiens 2006). It is modeled in the same way
Fig. 13.14 Example of a safety relationship displayed in a note, extract from Fig. 13.23
13.6 UML Diagrams 249
Fig. 13.15 Example of a namespace containment relationship, extract from Fig. 13.23
as the satisfy relationship, only with the stereotype «refine» on the arrow instead of
«satisfy». In the alternative notation, “Satisfies” is replaced by “RefinedBy” (Hänle
2007).
The trace relationship is a general relationship that only states that two elements
are related without further specification. It is also modeled in the same way as the
satisfy relationship, only with the stereotype «trace» on the arrow instead of «satisfy».
In the alternative notation, “Satisfies” is replaced by “TraceFrom” or “TraceTo”
(Hänle 2007).
Instead of a use case diagram, we use a SysML requirement diagram to display the
safety requirement, see Fig. 13.16. We define the stereotype “safetyRequirement” as
a specialization of the usual “requirement” stereotype. For simplicity, we use the ID
“1” (Hänle 2007).
The next structural level is visualization through a UML collaboration. Regarding
the requirement one notices that there are two types of elements which play an
important role: the whole safety system and the separate elements of the safety
system.
The distribution of roles is an important aspect of safety requirements. In UML,
roles can be represented by collaborations in the composite structure diagram, see
Sect. 13.6.3.4.
Figure 13.17 shows the collaboration for our safety requirement. Apart from the
newly defined stereotype “safetyRequirement” only standard notation is used. The
collaboration states exactly what the requirement asks. It says that it is a parts–whole
relation and that all elements with the role “Element of Safety System” have to be
located within the safety system.
We applied the safety requirement to a system from a project of the Fraun-
hofer Ernst-Mach-Institut. The system under regard with erased component names is
displayed in Fig. 13.18. In the figure, new names are assigned to some of the compo-
nents to be able to refer to them. For example, the safety system is called “deviceA”
(see Fig. 13.18).
Fig. 13.17 Representation of the safety requirement “single device” in a collaboration (Hänle 2007)
13.7 Example Diagrams Used for Safety Requirements 251
Fig. 13.18 The safety requirement collaboration applied to an example system, following (Hänle
2007)
252 13 Semi-Formal Modeling of Multi-technological Systems I: UML
For every module, it had to be decided which role in the collaboration it has. For
this, it was important that a person with substantial knowledge of the system was
involved (Hänle 2007).
Remark A UML simulation can also help to identify the roles of the components
(Storey 1996). This option was also used for the system of the project.
It turned out that one element (“elementA” in Fig. 13.18) was an element of the
safety system “deviceA” but was not located within “deviceA.” Hence, the safety
requirement was not fulfilled.
During the analysis of the system of Fig. 13.18 it turned out that there was no role
that could have been assigned to “elementB” or “elementC.” Hence, the collaboration
was extended to the one shown in Fig. 13.19. In the extension, the new role “Not
a Safety System Element” is defined. It is connected to the safety system by two
associations with the new stereotypes “outside of” and “inside of” which indicate
whether the element that is not part of the safety system is located within or outside of
the safety system. With this extension, one can assign the role “Not a Safety System
Element” to “elementB” and “elementC” which are outside of the safety system.
The whole safety requirement is summarized in Fig. 13.20.
The idea of such semi-formalized safety requirements is that they have to be
fulfilled in UML models of safety–critical systems, see (Hänle and Häring 2008).
We can address them as safety patterns.
A similar analysis can be done with the safety requirement of Fig. 13.21.
The safety requirement is represented in a collaboration in Fig. 13.22. The asso-
ciation between the roles “Safety System” and “Sensor” says that a safety system
must be connected to at least two sensors. This is not explicitly mentioned verbose
13.7 Example Diagrams Used for Safety Requirements 253
Fig. 13.20 Summary of the safety requirement “single device” (Hänle 2007)
«safetyRequirement»
Separate Device
Text= ʺSensors of a safety system which should detect the same event
have to be independent of each other and spaally separated.ʺ
Id= ʺ1ʺ
Fig. 13.21 SysML requirement diagram for the safety requirement “separate devices,” following
(Hänle 2007)
254 13 Semi-Formal Modeling of Multi-technological Systems I: UML
Fig. 13.22 Collaboration representing the safety requirement “separate devices” (Hänle 2007)
in the safety requirement but follows from it, see the symbol “2..*”. The association
does not say whether the sensors are within or outside of the safety system. The
association of the role “Sensor” to the role “Sensor” is reflexive. However, it says
that no sensor (“0”) is allowed to be in contact with another sensor.
The safety requirement can be applied to projects in a similar way as the safety
requirement in Fig. 13.18.
Fig. 13.23 Complete description of the safety requirement “separate devices with independent
physical criteria,” following (Hänle 2007)
«safetyRequirement»
Stop Blade When Protecon Removed
Text= ʺIf the protecon is removed from the blade while the machine
is in use, the blade has to stop latest aer 1 second.ʺ
Id= ʺ1ʺ
Fig. 13.24 Safety requirement “Stop Blade When Protection Removed,” following (Hänle 2007)
The roles that are described in this safety requirement are the blade and the
protection of the blade. “Removing the protection” and the time period of 1 s (which
is a sample time only) are also important but cannot be represented in a collaboration.
The safety requirement does not say how the two roles “blade” and “protection” are
related to each other and to the bread cutter. In the collaboration, they are hence
presented as parts of the bread cutter (Hänle 2007). (Fig. 13.25).
The requirements that could not be represented in the collaboration (the time
period of 1 s and the removal of the protection) can be shown in a timing diagram.
This is done in Fig. 13.26 (Hänle 2007).
Fig. 13.25 Collaboration “stop blade when protection removed” (Hänle 2007)
Fig. 13.26 Timing diagram for the safety requirement “stop blade when protection removed”
(Hänle 2007)
13.7 Example Diagrams Used for Safety Requirements 257
The event of removing the protection and the resulting state transition of the role
“protection” are indicated by the word “remove” and the vertical line from “protection
closed” to “protection open.” The arrow from “protection open” to “blade rotating”
indicates the function that stops the blade. The parameter t is set to the current time
(t = now) and can be used to express the time period of “at most 1 s” (displayed as
{t..t + 1}) (Hänle 2007).
In order to verify the requirement, the UML model can be simulated. The sequence
diagram from the simulation can be used to test whether the removal of the protection
leads to a stop of the rotating blade. If the model behaves in the way it should, the
time between the occurrence of the event and the stop of the blade has to be measured
in a second step, for example, by adding constructs to measure the time to the model.
Both steps together would show the fulfillment of the requirements (Hänle 2007).
The complete description of the requirement can be found in Fig. 13.27.
This type of requirement is dynamic. That is, this technique only has a testing
character because only concrete situations can be regarded. Hence, this method can
only show the presence of failures but not their absence (Liggesmeyer 2002).
Regarding the examples of Sect. 13.7, one can assign four types of safety
requirements to those examples (Hänle 2007; Hänle and Häring 2008):
(1) Construction requirement,
(2) Time requirement,
(3) Requirement of a system’s property, and
(4) General safety requirement.
Of course, this list is not complete. Further examples expected to yield more types
of safety requirements that can be modeled with UML/SysML. For details on the
four types, see Tables 13.4 and 13.5.
13.9 Questions
(1) Which types of diagrams are used in UML for basic system modeling?
(2) Name sample diagrams that can be used to graphically represent safety
requirements.
(3) Who can see which part of the following class? (Fig. 13.28)
Fig. 13.27 Complete description of the safety requirement “stop blade when protection removed,”
following (Hänle 2007)
Nothing—outside of class.
(4) Symbolize the following relations between classes A and B:
(a) B does not have access to A,
(b) B is a part of A,
(c) As composition: B is a part of A,
(d) As realization: A is the client of B,
(e) B is the superclass of A, and
(f) A depends on B.
13.9 Questions 259
Table 13.4 Four types of safety requirements with examples. The descriptions in the left column
are quoted from (Hänle and Häring 2008), the examples are assigned as in (Hänle 2007) and (Hänle
and Häring 2008)
Type of safety requirement Example
Type 1: “Construction requirement” Sections 13.7.1 and 13.7.2
Type 2: “Time Requirement or chronological Section 13.7.3
order of functions, states or events”
Type 3: “Property of the system that has to be Section 13.7.2, compliance with a norm
fulfilled and must be verifiable at all time”
Type 4: “Requirement that describes the safety “The system should be save” (Hänle and
of the system, e.g., prevention of a hazard or Häring 2008), “The system should prevent the
risk” occurrence of the hazardous event X” (Hänle
and Häring 2008)
(7) What would be a correct notation for a transition to describe the situation that
an airbag is released after a shock but only if the car slows down by more than
p0 percent of its speed within time t 0 ?
13.10 Answers
(1) See Sects. 13.6.1, 13.6.3, 13.6.4, 13.6.7.5, 13.6.5, and 13.6.7.6.
(2) See, for instance, the diagrams used in Sect. 13.7.
(3) Activated—online inside + derived classes,
Active instances—online inside,
Activate—classes in the same package, and.
Nothing—outside of class.
(4) Use Table 13.3.
(5) See Sect. 13.6.3.2.
(6) Class diagram: class, reflexive association, realization;
Composite structure diagram: connector, port, structured class, collaboration;
All/none: classifier, profile, constraint, stereotype;
State diagram: event, pseudo state, trigger;
Sequence diagram: message, lifeline;
260 13 Semi-Formal Modeling of Multi-technological Systems I: UML
Table 13.5 Four types of safety requirements and their representation and verification with
UML/SysML (Hänle and Häring 2008)
Type of safety requirement Required UML/SysML Approach in short form
elements
Type 1 UML collaboration 1. Transform safety
Properties: non-functional, UML composite structure requirement into a
static, concrete, effect-oriented, diagram requirement diagram
technical, time-critical, UML/SysML sequence 2. Identify and define existing
(standardized) diagram roles
Description: SysML requirement diagram 3. Describe the roles in a
Construction requirement collaboration
4. Verification
with»CollaborationUse« in
model of the system under
control
5. By simulating the system, it
is possible to get a better
understanding about
modules which could not be
assigned to a role
Type 2 UML collaboration 1. Transform safety
Properties: functional, UML timing diagram requirement into a
dynamical, effect-oriented, UML/SysML sequence requirement diagram
non-technical, (time-critical) diagram 2. Identify and define existing
Description: Time Requirement roles
or chronological order of 3. Describe the roles in a
functions, states or events collaboration
4. Show the time behavior in a
timing diagram
5. Verification or at least an
educated guess can be
achieved by simulating the
system’s UML model
Type 3 UML/SysML satisfy 1. Transform safety
Properties: static, relationship requirement into a
effect-oriented, non-technical, requirement diagram
time-uncritical, (standardized) 2. Verification by analyzing
Description: verbal safety requirement
Property of the system that has with already existing
to be fulfilled and must be methods.
verifiable at all times 3. Verification should be
possible all the time
(static/dynamic)
4. Only the conformance with
the satisfy relation can be
noted
5. Alternatively, the safety
requirement can be assigned
to another safety
requirement that is
describable in UML and in
the same context
(continued)
13.10 Answers 261
References
Ahmad, Tanwir; Iqbal, Junaid; Ashraf, Adnan; Truscan, Dragos; Porres, Ivan (2019): Model-based
testing using UML activity diagrams: A systematic mapping study. In Computer Science Review
33, pp. 98–112. https://doi.org/10.1016/j.cosrev.2019.07.001.
Ciccozzi, Federico; Malavolta, Ivano; Selic, Bran (2019): Execution of UML models: a systematic
review of research and practice. In Softw Syst Model 18 (3), pp. 2313–2360. https://doi.org/10.
1007/s10270-018-0675-4.
da Silva, João Pablo S.; Ecar, Miguel; Pimenta, Marcelo S.; Guedes, Gilleanes T. A.; Franz,
Luiz Paulo; Marchezan, Luciano (2018): A systematic literature review of UML-based domain-
specific modeling languages for self-adaptive systems. In: 2018 ACM/IEEE 13th International
Symposium on Software Engineering for Adaptive and Self-Managing Systems. SEAMS 2018:
28-29 May 2018, Gothenburg, Sweden: proceedings. With assistance of Jesper Andersson.
SEAMS; Association for Computing Machinery; Institute of Electrical and Electronics Engineers;
ACM/IEEE International Symposium on Software Engineering for Adaptive and Self-Managing
Systems; IEEE/ACM International Symposium on Software Engineering for Adaptive and Self-
Managing Systems; International Symposium on Software Engineering for Adaptive and Self-
Managing Systems; International Conference on Software Engineering (ICSE). Piscataway, NJ:
IEEE, pp. 87–93.
262 13 Semi-Formal Modeling of Multi-technological Systems I: UML
Elamkulam, Janees; Glazberg, Ziv; Rabinovitz, Ishai; Kowlali, Gururaja; Gupta, Satish Chandra;
Kohli, Sandeep et al. (2007): Detecting Design Flaws in UML State Charts for Embedded Soft-
ware. In Eyal Bin, Avi Ziv, Shmuel Ur (Eds.): Hardware and software, verification and testing.
Second International Haifa Verification Conference, HVC 2006, Haifa, Israel, October 23-26,
2006; revised selected papers, vol. 4383. Berlin: Springer (Lecture Notes in Computer Science,
4383), pp. 109–121.
Engels, Gregor; Küster, Jochen M.; Heckel, Reiko; Lohmann, Marc (2003): Model-Based Verifi-
cation and Validation of Properties. In Electronic Notes in Theoretical Computer Science 82 (7),
pp. 133–150. https://doi.org/10.1016/s1571-0661(04)80752-7.
Fontan, B.; Apvrille, L.; Saqui-Sannes, P. de; Courtiat, J.-P. (2007): Real-Time and Embedded
System Verification Based on Formal Requirements. In: International Symposium on Industrial
Embedded Systems, 2006. IES ’06; Antibes Juan-Les-Pins, France, 18–20 Oct. 2006. Institute of
Electrical and Electronics Engineers; International Symposium on Industrial Embedded Systems;
IES 2006. Piscataway, NJ: IEEE Service Center, pp. 1–10.
Fowler, M. (2003). UML Distilled: A Brief Guide to the Standard Object Modeling Language,
Addison-Wesley Professional.
Gogolla, M., F. Büttner and M. Richters (2006). USE: A UML-based Specification Environment.
Bremen, Universität Bremen.
Hänle, A. (2007): Modellierung und Spezifikation von Anforderungen eines sicherheitskritischen
Systems mit UML, Modeling and Specification of Requirements of a safety critical System with
UML. Hochschule Konstanz für Technik, Wirtschaft und Gestaltung (HTWG), University of
Applied Sciences; Fraunhofer EMI, Efringen-Kirchen. Computer Science.
Hänle, A.; Häring, I. (2008): UML safety requirement specification and verification. In Sebastian
Martorell, C. Guedes Soares, Julie Barett (Eds.): Safety Reliablity and Risk Analysis: Theory,
Methods and Applications, European Safety and Reliablity Conference (ESREL) 2008, vol. 2.
Valencia, Spain: Taylor and Franzis Group, London, pp. 1555–1563.
He, Weiguo; Goddard, S. (2000): Capturing an application’s temporal properties with UML for
Real-Time. In: Proceedings, Fifth IEEE International Symposium on High Assurance Systems
Engineering (HASE 2000). November 15–17, 2000, Albuquerque, New Mexico. IEEE Interna-
tional High-Assurance Systems Engineering Symposium; IEEE Computer Society. Los Alamitos,
Calif: IEEE Computer Society, pp. 65–74.
IBM. (2012). “Rational Software.” Retrieved 2012-04-24, from http://www-01.ibm.com/software/
de/rational/.
Kecher, C. (2006). UML 2.0: Das umfassende Handbuch, Galileo Computing.
Kecher, C. (2011). UML 2.0: Das umfassende Handbuch, Galileo Computing.
Larisch, M; Siebold, U; Häring, I (2011): Safety aspects of generic real-time embedded software
model checking in the fuzing domain. In Christophe Berenguer, Antoine Grall, Carlos Guedes
Soares (Eds.): Advances in Safety, Reliability and Risk Management. Esrel 2011. 1st ed. Baton
Rouge: Chapman and Hall/CRC, pp. 2678–2684.
Lavagno, L. and W. Mueller. (2006). “UML: A Next-Generation Language for SoC Design.”
Retrieved 2012-05-16, from http://electronicdesign.com/Articles/Index.cfm?ArticleID=12552&
pg=1.
Liggesmeyer, P. (2002). Software-Qualität: Testen, Analysieren und Verifizieren von Software.
Heidelberg, Berlin, Spektrum Akademischer Verlag.
Lucas, Francisco J.; Molina, Fernando; Toval, Ambrosio (2009): A systematic review of UML
model consistency management. In Information and Software technology 51 (12), pp. 1631–1645.
https://doi.org/10.1016/j.infsof.2009.04.009.
McCabe, J. (1976). “A Complexity Measure.” IEE Transactions on Software Engineering SE-2.
Mukhtar, M I; Galadanc, B S (2018): Automatic code generation from UML diagrams: the state of
the art 13 (4), pp. 47–60.
Ober, I., S. Graf and I. Ober (2006). “Validating timed UML models by simulation and verification.”
International Journal on Software Tools for Technology Transfer (STTT) 8(2): 128.
Object Management Group (2004). Unified Modeling Language Superstructure.
References 263
14.1 Overview
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 265
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_14
266 14 Semi-formal Modeling of Multi-technological Systems II …
The chapter combines and extends (Larisch et al. 2008a, b; Siebold et al. 2010a,
b; Siebold and Häring 2012), which mainly focus on embedded multi-technological
systems. In a similar way, it can be applied to checkpoint systems (XP-DITE 2017;
TRESSPASS 2020) and automated industrial inspection systems of chemical plants
(ISA4.0 2020).
In SysML, there are nine different diagrams which are divided into two categories,
behavior diagrams and structure diagrams, with the exceptions of the requirements
diagram which belongs to none of the categories, see Fig. 14.2.
To model the static aspects of a system, e.g., the architecture of a system, the
structure diagrams are used. Structure diagrams are the block definition diagram,
SysML’s extension
to UML
the internal block diagram, the parametric diagram, and the package diagram. The
behavior diagrams are used to model the dynamic aspect of the system, e.g., interac-
tions between components. Behavior diagrams are the activity diagram, the sequence
diagram, the state machine diagram, and the use case diagram.
“The Block Definition Diagram is based on the UML Class Diagram. It describes
the structure of a system […]. System decomposition can be represented by using
composition relationships and associations. Therefore, so called ‘blocks’ are used.
A block represents software, hardware, personnel facilities, or any other system
element (OMG 2008). The description of a system in a Block Definition Diagram
can be simple by only mentioning the names of several modules or detailed with the
description of all information, features and relationships” (Larisch et al. 2008a, b).
The example in Fig. 14.3 shows all the interfaces and quantities relevant for SIL
determination of safety-related systems: component type (A,B), safe failure fraction
(SFF), and hardware failure tolerance (HFT), see Sect. 11.9. It has been generated by
a SysML modeling environment as described in (Siebold 2013). A similar approach
has also been used in (Siebold et al. 2010a, b).
The block definition diagram (bdd, BDD) shows the principal physical compo-
nents of a system and its hierarchical structure in a tree-like diagram. It allows
developing a number of conceptual architectures that can be used as a starting point
for analysis. Block definition diagrams are based upon class diagrams. It is only
meant to show composition relationships between elements and their multiplicities.
Main components are called system parts or system blocks.
The block definition diagram is similar to the class diagram because it gives the
allowed classes (called blocks) but does not yet show the relation between the blocks
in a real system.
The bdd is well suited to provide an overview of all possible system components,
their interfaces, and internal states. Often it is also used to show inclusions and part
relations. For safety analysis, it is important to know all components, their internal
states, and quantities that are accessible from outside.
Fig. 14.3 Example for a block definition diagram (Siebold et al. 2010a, b)
Fig. 14.4 Example for an internal block diagram (Siebold et al. 2010a, b)
“The Activity Diagram is modified from UML 2.0. With the Activity Diagram the
flow of data and control between activities can be modeled. It represents the work flow
of activities in a system. The sequence can be sequential or split in several concurrent
sequences. Causes of conditions can be branched into different alternatives. With the
use of activity diagrams one can show the behavior of a system, which is useful in
the concept phase of the safety lifecycle. Duties and responsibilities of modules can
be modeled […]” (Larisch et al. 2008a, b).
14.3 Overview of Diagrams 271
They are similar to flowcharts and show the flow of behavior, decision points, and
operation calls throughout the model. Often used diagram elements (see Fig. 14.5) are
InitialNode (or Start Node), Action, Events, DecisionNode, MergeNode, ForkNode,
JoinNode, and ActivityFinal (or Final Node). The activity starts at the initial or start
node and ends at the final node (OMG 2014).
Action.
Start node.
Final node.
Decision node (if–then-else).
Merge node.
Join node.
SysML activity diagrams (ad) show all the allowed state transitions, the forking,
and merging of the allowed state flows. This can be used to identify dynamic
system behavior and to assess it regarding the implications of system and subsystem
state transitions regarding safety. The activity diagram should be used after and
together with the SysML state machine diagram (smd) to first identify the component,
subsystem, and system states before determining all allowed transitions.
As Fig. 14.5 only shows the expected system behavior, it cannot model the
effects of failures explicitly. However, it supports the inspection of system designs
to determine potential failures and their effects.
“State Machine Diagrams are also adopted unmodified from UML. They describe
the behavior of an object as the superposition of all possible state histories in terms
of its transitions and states” (Larisch et al. 2008a, b). See Fig. 14.6 for an example.
State machine diagrams (SMD) provide means of showing state-based behavior
of a use case, block, or part. Behavior is seen as a function of inputs, present states,
and transitions. A transition is the path between states indicated by a trigger/action
syntax.
SysML state machine diagrams determine all internal and external states of
components and allowed transitions. Thus, they are well suited to assess systemati-
cally possible transitions regarding safety. Whereas activity diagrams tend to show
only the intended transitions on various levels, and the state machine diagrams list
all possible transitions. However, a potential drawback of the state machine diagram
is that it is focusing more on single modeling elements.
Use case diagrams are as in UML. Use case diagrams can be used to describe the
main functions of a system, the internal processes, and the persons and systems that
are connected with the system (Larisch et al. 2008a, b), see Fig. 14.7 for an example.
Figure 14.8 provides a sequence diagram for the example showing a sequence of the
use of the system elements in case of a collision.
Fig. 14.6 Example of a state machine diagram (Siebold and Häring 2012)
14.3 Overview of Diagrams 273
Car
Probability of
Airbag airbag malfuncon
on demand less
«include» than 1.0E-3
«include»
Safety while
driving
«include» On ignion check
«include» for malfuncon
Electronic and show in case a
Stability warning light
Control (ESC)
Open airbag
Other safety within 40 ms
features aer a collision
Use case diagrams give the system its context. They set the system boundaries.
Use case diagrams show the relationship between external actors and the expected
uses of the system. They define the expected uses of the system from the point of
view of the people/systems that use it. Use cases can be found from the perspective
of the actors or users and from the expected system behavior, e.g., system support
user 1 with three services and user 2 with four services two of which overlap.
Use case diagrams are appropriate to identify expected and potential (miss) use
contexts of safety-related and safety–critical systems. They may also be used to relate
the main functions of a system with its safety functions. They are versatile tools in
early phases of system development to clarify the main aims and respective safety
goals.
“Sequence Diagrams are adopted unmodified from UML. They describe the infor-
mation flow between elements in certain scenarios […]. Sending and receiving of
messages between entities are represented in sequence diagrams” (Larisch et al.
2008a, b).
Sequence diagrams (SD) are used to specify the messages that occur between
model elements. They can show timing information. Interaction operators added
in UML 2.0 which expand the functionality of sequence diagrams and reduce the
number of scenarios needed to describe a system’s behavior.
SysML sequence diagrams are of main interest for safety assessments because
they show (un)intended sequencing of activities, they can be quantified using time
information, and they show which components are expected to be active in which
way. This information can be used to assess safety issues, for instance, all critical
time requirements.
(1) Identify the most recently released specification documents of UML and SysML
and respective tutorials.
(2) Which additional semi-formal languages are currently available or under
development that are conform with the UML standard?
(3) How are UML and SysML related?
(4) Name all SysML diagrams. What is, respectively, their main purpose?
(5) When in the safety life cycle would you recommend to use which types of
SysML diagrams?
14.5 Answers
(1) Starting point should be the official website of the object management group
(OMG) of UML and its documentation material.
(2) See answer to (1) as starting point.
14.5 Answers 275
References
15.1 Overview
Chapter 2 showed for all approaches presented in the book how to potentially use
them for technically driven resilience engineering of socio-technical systems. While
focusing on key concepts of resilience, also neighboring concepts as classical risk
analysis were used, which can be understood to be a fundamental framework for
technical safety assessment. The focus was on single method assessment. Comple-
mentary to Chaps. 2 and 15 provides exemplary discussions on method combination
for system reliability, safety and resilience analysis, and improvement. However, the
focus of this chapter is not on details of methods and their extension options but on
method combination.
The previous chapters introduced different methods of technical safety with focus
on tabular and graphical system analysis methods, reliability prediction, and SysML
graphical system modeling. So far, it only has been discussed in Sect. 8.2 (see, in
particular, Table 8.1) and Sect. 8.11 (see Table 8.12) with focus on tabular approaches
in more detail how to combine methods efficiently. Hence, this chapter provides
further recommendations on method combinations.
Chapter 10 categorized and exampled development processes and Chap. 11 the
sample safety development process of functional safety. Both types of processes need
to be supported with appropriate techniques and measures. This is shown in more
detail for methods for hardware and software in Sect. 11.8.8 by using an example of
the method tables of (IEC 61508 S+ 2010) . This refers to a very elaborated approach
for method selection and combination as stipulated by the generic functional safety
standard and implemented in many application standards.
This chapter continues the discussion of the combination of and interaction
between four different basic methods: SysML, HA, FMEA, and FTA. It uses the
example of an electric vehicle and the identification of faults in after-sales scenarios.
This can be understood as a use case of engineering resilience in the sense of fast
stabilization, response, and recovery post potential disruptive events during opera-
tion of modern green transport systems. In particular, since electrical vehicles are
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 277
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_15
278 15 Combination of System Analysis Methods
Chapter 8 showed the relationship between the different hazard analysis methods,
see Fig. 8.1. The abbreviations used in the figure stand for hazard analysis (HA),
preliminary hazard list (PHL), preliminary hazard analysis (PHA), subsystem hazard
analysis (SSHA), system hazard analysis (SHA), and operating and support hazard
analysis (O&SHA).
Depending on the product, the procedure of the hazard analysis can also follow
different schemes, for example, the PHA and HA can be combined for very simple
systems. Table 8.1 shows which form of safety analysis needs input from which other
form. It reads as follows: the analysis method on the left side uses information from
the methods in the columns marked with an “X”.
Section 8.2 discussed how FMEA and FTA results are used to support the hazard
analyses. How they are related in detail to each other is shown exemplarily in
Sects. 15.4 and 15.5.
Fault tree analysis can be used to further analyze the results of an FMEA. In this
way, the interactions of components, interfaces, and functions can be considered
which are not covered in a classical FMEA, i.e., everything beyond single point of
failure analysis. An FMEA also does not consider the interaction of failures and
consequences.
Figure 15.1 shows that FMEA results can be used in an FTA and vice versa. The
consequences from the FMEA can be used as top events for an FTA. An example
of such a failure on system level is the loss of the battery system. The idea is that
on system level a similar failure as caused by a single point of failure can also be
280 15 Combination of System Analysis Methods
Fig. 15.1 Interaction of FMEA and FTA. Inter alia the combination of different lines of the FMEA,
i.e., failure modes are considered by the FTA. The FMEA takes up all newly identified failure modes
of the FTA. Based on Kanat (2014)
Complex systems are often not analyzed in one fault tree. Again, we consider the
example of an electric vehicle, see Fig. 15.2. The top event “vehicle does not start”
can have causes in different parts of the vehicle. It is likely that different components
15.5 Combination of Component FTAs to a System FTA 281
Top Event
Vehicle unfit
to drive
Failure of Failure of
Baery Cell Baery
Charger
like the battery and the engine are analyzed individually and that the subtrees are
later combined to a system analysis.
When combining subsystem fault trees to a system fault tree, it is important
to consider interfaces between subsystems and dependencies of states. Note that
Fig. 15.2 does not consider interfaces explicitly. Often replacement parts are available
only at subsystem level. The final assessment needs to take the level of replacement
into account.
282 15 Combination of System Analysis Methods
After combining all subsystem fault trees, minimal cut sets of the whole system,
and failure, causes for system failures can be determined.
Figure 15.2 shows a system, where no additional faults are expected due to the
combination of subsystems. However, assume that the battery is close to a critical
state (e.g., with respect to overall loading capacity) as well as the inverter (e.g., with
respect to efficiency of the DC/AC conversion). The combination of both might lead
to an overall system behavior that is critical in certain scenarios (e.g., long distances
between loading stations or active safety systems that need high accelerations). Even
more obvious are failures arising from interrupted interconnections of subsystems.
In Fig. 15.2, a simple example was chosen for illustration. The same method
works when there are also other gates, for example, and gates.
Example Consider a system with three cordless phones (two docking stations and
two identical, exchangeable receivers). Assume you defined the subsystems “docking
station” and “receiver.” For each of them, you set up a fault tree with the top event
“subsystem is broken.” To analyze your system and the top event “no phone calls
possible,” you have to analyze the interactions. For example, you have to consider
failures where a damaged docking station damages one or both receivers. Hence, the
probability that the system fails is not simply the sum of the probabilities that one
of the components fails. It is also possible that you have an event in your subsystem
fault tree for the receiver that says “does not receive electricity from station.” In the
system fault tree, this has to be linked with the subsystem fault tree of the docking
station. You also have dependencies of events. For example, if there is no electricity,
both docking stations will not work.
System-level fault trees can be used to create fault isolation procedures. The fault
isolation procedure for an exemplary fault tree is shown in Fig. 15.3.
The shown example is simple because the fault tree only has OR gates. The fault
isolation template can be realized similar to a questionnaire with statements like “If
you answered with yes, please proceed to question 5.” Especially in a digital version
of the fault isolation template, this can be easily realized with filter questions. In a
paper version, where one manually has to find the next question, this often leads to
confusion.
It is more difficult to transfer a fault tree into a graphically supported fault isolation
procedure if the fault tree comprises more different types of gates. One could, for
example, add a horizontal level for AND gates and let the further direction in the
template depend on the combinations of answers in a line. On paper, this is even
more confusing than for OR gates and cannot be recommended. Priority And gates
can only be described with text, without introducing special symbols.
15.6 Fault Isolation Procedure 283
FTA
Fig. 15.3 Transformation of a fault tree into a fault isolation procedure (Kanat 2014)
15.7 Questions
(1) What are the main mutual benefits when conducting an FMEA and an FTA
jointly?
(2) Is it sufficient to conduct for subsystems FMEAs and to use FTA only on system
level?
(3) How can you use an FTA to generate an efficient fault isolation procedure for
the top event?
284 15 Combination of System Analysis Methods
15.8 Answers
References
Automotive Word (2015): Bosch: DINA research project successfully completed – Consor-
tium researches first integrated diagnostic system for electromobility. Available online
at https://www.automotiveworld.com/news-releases/bosch-dina-research-project-successfully-
completed-consortium-researches-first-integrated-diagnostic-system-electromobility/, updated
on 12/22/2015, checked on 9/21/2020.
Barcin, Bülent; Freuer, Andreas; Kanat, Bülent; Richter, Andreas (2014): Wettbewerbsfähige Diag-
nose und Instandsetzung. In ATZ Extra 19 (11), pp. 14–19. https://doi.org/10.1365/s35778-014-
1285-6.
David, P., V. Idasiak and F. Kratz (2008). Towards a better interaction between design and depend-
ability analysis: FMEA derived from UML/SysML models. European Safety and Reliablity
Conference (ESREL 2009) Safety, Reliability and Risk Analysis: Theory, Methods and Appli-
cations. S. Martorell, C. G. Soares and J. Barnett, CRC Press/Balkema, Taylor and Francis:
2259–2266.
IEC 61508 S+ (2010). Functional Safety of Electrical/Electronic/Programmable Electronic Safety-
related Systems Ed. 2 Geneva, International Electrotechnical Commission.
Jain, Aishvarya Kumar; Satsrisakul, Yupak; Fehling-Kaschek, Mirjam; Häring, Ivo; van Rest, Jeroen
(2020): Towards Simulation of Dynamic Risk-Based Border Crossing Checkpoints. In Piero
Baraldi, Francesco Di Maio, Enrico Zio (Eds.): Proceedings of the 30th European Safety and
Reliability Conference and the 15th Probabilistic Safety Assessment and Management Confer-
ence. ESREL2020 and PSAM15. European Safety and Reliability Aassociation (ESRA), Inter-
national Association for Probabilistic Safety Assessment and Management (PSAM). Singapore:
Research Publishing Services. Available online at https://www.rpsonline.com.sg/proceedings/esr
el2020/pdf/4000.pdf, checked on 9/25/2020.
Kanat, B. (2014). FTA - Personal communication with S. Rathjen.
Kanat, B; Ebenhöch, S (2015): BMBF-Verbundprojekt DINA - Diagnose und Instandsetzung im
Aftersales für Elektrofahrzeuge, Teilvorhaben: Zuverlässigkeitsvorhersagen von “High Voltage”-
Systemen im Elektrofahrzeug für die Umsetzung von Diagnoseservices. Diagnose und Instandset-
zung im Aftersales für Elektrofahrzeuge, Teilvorhaben: Zuverlässigkeitsvorhersagen von “High
Voltage”-Systemen im Elektrofahrzeug für die Umsetzung von Diagnoseservices. With assis-
tance of TIB-Technische Informationsbibliothek Universitätsbibliothek Hannover, Technische
Informationsbibliothek (TIB). Edited by Fraunhofer EMI (Bericht E 38/2015). Available online
at https://www.tib.eu/de/suchen/id/TIBKAT:881610984/, checked on 9/21/2020.
Larisch, M., A. Hänle, I. Häring and U. Siebold (2008a). Unterstützung des Nachweises funktionaler
Sicherheit nach IEC 61508 durch SysML. Dipl. Inform. (FH), HTWG-Konstanz.
Larisch, M., A. Hänle, U. Siebold and I. Häring (2008b). SysML aided functional safety assessment.
Safety Reliablity and Risk Analysis: Theory, Methods and Applications, European Safety and
References 285
Reliablity Conference (ESREL) 2008. S. Martorell, C. G. Soares and J. Barett. Valencia, Spanien,
Taylor and Franzis Group, London. 2: 1547–1554.
Larisch, Matthias; Siebold, Uli; Häring, Ivo (2009): Assessment of functional safety of fuzing
systems. In: International system safety conference // 27th International System Safety Confer-
ence and Joint Weapons System Safety Conference 2009 (ISSC/JWSSC 2009). Huntsville,
Alabama, USA, 3–7 August 2009. Huntsville, Alabama, USA: Curran.
Li, G. and B. Wang (2011). SysML Aided Safety Analysis for Safety-Critical Systems. Artificial
Intelligence and Computational Intelligence.
Liu, Chi-Tang; Hwang, Sheue-Ling; Lin, I-K. (2013): Safety Analysis of Combined FMEA and FTA
with Computer Software Assistance – Take Photovoltaic Plant for Example. In IFAC Proceedings
Volumes 46 (9), pp. 2151–2155. https://doi.org/10.3182/20130619-3-ru-3018.00370.
Mhenni, F., N. Nguyen and J.-Y. Choley (2014). Automatic fault tree generation from SysML
system models. Advanced Intelligent Mechatronics (AIM). Besacon: 715–720.
Peeters, J.F.W.; Basten, R.J.I.; Tinga, T. (2018): Improving failure analysis efficiency by combining
FTA and FMEA in a recursive manner. In Reliability Engineering & System Safety 172, pp. 36–44.
https://doi.org/10.1016/j.ress.2017.11.024.
Renger, P; Siebold, U; Kaufmann, R; Häring, I (2015): Semi-formal static and dynamic modeling
and categorization of airport checkpoints. In Tomasz Nowakowski (Ed.): Safety and reliability.
Methodology and applications; [ESREL 2014 Conference, held in Wrocław, Poland. ESREL.
London: CRC Press.
Schoppe, C A; Häring, I; Siebold, U (2014): Semi-formal modeling of risk management process
and application to chance management and monitoring. In R. D. J. M. Steenbergen (Ed.): Safety,
reliability and risk analysis: beyond the horizon. Proceedings of the European Safety and Reli-
ability Conference, Esrel 2013, Amsterdam, The Netherlands, 29 September–2 October 2013.
Proceedings of the European Safety and Reliability Conference. Boca Raton, Fla.: CRC Press,
pp. 1411–1418.
Schoppe, C; Zehetner, J; Finger, J; Siebold, U; Häring, I (2015): Risk assessment methods for
improving urban security. In Tomasz Nowakowski (Ed.): Safety and reliability. Methodology
and applications; [ESREL 2014 Conference, held in Wrocław, Poland. ESREL. London: CRC
Press, pp. 701–708.
Shafiee, Mahmood; Enjema, Evenye; Kolios, Athanasios (2019): An Integrated FTA-FMEA Model
for Risk Analysis of Engineering Systems: A Case Study of Subsea Blowout Preventers. In
Applied Sciences 9 (6), p. 1192. https://doi.org/10.3390/app9061192.
Siebold, U. and I. Häring (2012). Semi-formal safety requirement specification using SysML state
machine diagrams. ESREL. Helsinki, Finland.
Siebold, U.; Larisch, M.; Häring, I. (2010): Using SysML Diagrams for Safety Analysis with IEC
61508. In: Sensoren und Messsysteme. Nürnberg: VDE Verlag GmbH, pp. 737–741.
Siebold, Uli; Larisch, Matthias; Häring, Ivo (2009): SysML modeling of safety critical multi-
technological system. In R. Bris, C. Guedes Soares, S. Martorell (Eds.): European Safety and Reli-
ablity Conference (ESREL) 2009. Prague, Czech Republic.: Taylor and Franzis Group, London,
pp. 1701–1706.
Chapter 16
Error Detecting and Correcting Codes
16.1 Overview
The simplest method of error detection is the parity bit. The parity bit counts the
number of “1”s in a binary string. The parity bit can be “even” or “odd.”
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 287
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9_16
288 16 Error Detecting and Correcting Codes
There are several variations of parity bits, for example, considering parity bits of
substrings.
p1 = d1 + d2 + d4 , (16.1)
p2 = d1 + d3 + d4 , (16.2)
p3 = d2 + d3 + d4 . (16.3)
Example If there is a bit flip in d3 , the parity bits p2 and p3 do no longer match the
sums in (16.2) and (16.3). Hence, the bit flip can be noticed in those two parity bits.
Table 16.2 shows that, given that there are only one-bit errors, an error in one bit
of the overall string can be uniquely identified in the parity bits. There is a unique
pattern at the right-hand side of Table 16.2 that can be uniquely mapped to the left-
hand side of Table 16.2. That is, if there are only one-bit errors, they cannot only be
identified but also corrected.
A well-defined unique operational approach derived from Table 4.2 reads as
follows. After the data string transmission, the program computes the parity bits
according to the Hamming(7,4) scheme using (16.1) to (16.3). According to Table
4.2, and assuming a one-bit flip of any of the string bits:
– If p1 and p2 do not fit to the data (i.e., the d j ), d1 is erroneous and has to be
corrected (i.e., flipped), etc.
– If p1 , p2 , and p3 do not fit to the data, d4 is erroneous and has to be corrected.
– If p1 does not fit to the data, do not correct any of the data bits, etc.
If there are up to two-bit errors, it can still be identified that there is an error but,
in general, not which type and where it is. Hence, a correction is not possible.
Example If d1 and d2 are wrong, p1 is correct (since the variables only take the
values 0 and 1) and p2 and p3 are wrong. But p2 and p3 being wrong could also
mean that only d3 is incorrect. Hence, assuming one- and two-bit errors are occurring,
identifying that p2 and p3 are incorrect does not tell which data bits are incorrect.
Even if there are only two-bit errors, it can still be identified in all cases that a
two-bit error occurred. Only in most of the cases, it can be identified that a two-bit
error occurred in the data bits! In none of the cases, it can be detected where they
occurred and hence how to correct them!
Example When inspecting the first and last line of Table 16.3 it becomes obvious
that it cannot be decided whether the data bits (d1 and d2 ) or the parity bits ( p2 and
p3 ) have been both corrupted.
Table 16.3 shows all combinations of two-bit errors and in which parity bits can
be seen.
290 16 Error Detecting and Correcting Codes
Table 16.3 Two-bit errors and their consequences, for example, Hamming(7,4)
Erroneous bits Error shows in
d1 d2 d3 d4 p1 p2 p3 p1 p2 p3
x x x x
x x x x
x x x
x x x
x x x
x x x x x
x x x x
x x x
x x x
x x x x x
x x x
x x x
x x x x x
x x x
x x x
x x x x
x x x x
x x x x
x x x x
x x x x
x x x x
Remark (Hamming 1950): An extended code that corrects one-bit errors and detects
two-bit errors can be constructed by adding another parity number that adds all ci .
The additional number can tell whether there is an even or odd number of bit flips.
16.3 Hamming Code 291
Exemplary codes can be found online. The code used for the Hamming error-
correction method in Sect. 16.4 is from (Morelos-Zaragoza 2006, 2009).
100100110000/11001 = 11101001
11001
010110
11001
011111
11001
001101
00000
0011010
11001
000110
00000
0001100
00000
00011000
11001
00001
The important part is the remainder of the division. It is added to the data instead
of the four zeros: 10010011 0000 0001.
So, the binary code we transmit is 100100110001.
By construction, this number is divisible by 11001 without remainder. So, the
received message can be divided by 11001 and if there is no remainder, the message
is very likely to be correct.
292 16 Error Detecting and Correcting Codes
10000/101 = 101
101
00100
101
001
10001/101 = 101
101
00100
101
000
Case 2: There were transmission mistakes (11011) but the remainder is still 0.
11011/101 = 111
101
0111
101
0101
101
000
10101/101 = 100
101
00001
16.4 CRC Checksums 293
10011/101 = 101
101
00111
101
010
Kaufman et al. (2012) analyze the following procedure of transmitting a time value
from a programming tool to a microprocessor.
– Programming tool with data transmission interface:
Step 1: Enter time value t pr og [value between 0.0 and 999.9 s in increments of
0.1 s].
Step 2: Store as floating point.
Step 3: Encoding 1 of 2: Convert to packed BCD (Binary Coded Decimal).
Step 4: Further encoding 2 of 2 (as needed): Hamming, 2oo3, XOR;
– Transmission of encoded data to EEPROM (Electrically Erasable Programmable
Read-Only Memory);
– Transmission of encoded data to microprocessor.
– In microprocessor:
Step 5: Decoding (in the following cases): Hamming, XOR.
Step 6: Convert packed BCD to float.
Step 7: Received time value t pr og .
1 Byte (8 bits)
1 nibble
(4 bits)
Hundreds: Tens: 1st Byte
Example floang point value: 2 1
215.3 secs Ones: Tenths:
5 3 2nd Byte
Fig. 16.1 Packed BCD representation of four-digit decimal value for an example value of 215.3 s
(Kaufman et al. 2012)
16.5.2 Assumptions
To analyze bit error-detecting codes, Kaufman et al. (2012) “make the following
assumptions:
• Each bit has the same probability of flipping to the incorrect state.
• Only bit errors are considered, i.e., no bit shifts.
• Bit errors occur statistically independent of each other, common-cause errors are
not considered, e.g., any single radiation event only affects one bit. There is no
spatial local clustering of damage events.
• Each bit can only have a single bit error, i.e. a double bit error must involve two
distinct bits and cannot involve a single bit which first changes bit state and then
changes bit state again.
• All initial time values within the available number range will be considered
[without assuming any possibly existing (set of) problem specific intervals] […].
We test 10,000 values ranging from 0.0 to 999.9 in 0.1 increments.
• We do not discern between soft (transient or intermittent) and hard (permanent)
bit errors.”
For the analysis, they wrote a simulation program with different error-detecting
codes:
“A simulation program was written and executed in C++ using Visual Studio
Premium 2010 on a PC outfitted with an AMD Athlon 64 × 2 processor. We imple-
mented the Hamming(7,4) code, as published in (Morelos-Zaragoza 2006). The other
Hamming methods tested were implemented by making minor adjustments to the
original Hamming(7,4) code. We wrote the remaining methods BCD, XOR and BCD
two out of three (2oo3) ourselves. [Fig. 16.2] shows the sequence of decisions of the
program” (Kaufman et al. 2012).
16.5 Assessment of Bit Error-Detecting and Error-Correcting Codes for a Sample System 295
Time value as
LEVEL 1: Enter the me value tprog
floang point
value
LEVEL 6: Assessment of
Program correct me value Time too short Time too short error correcon approach
terminaon / dud tprog‘=tprog tprog‘<tprog (not tprog‘>tprog (not
(not reliable) (reliable and safe) safe) reliable)
(Result 1) (Result 2) (Result 3) (Result 4)
Fig. 16.2 Decision chart for error simulation program. Source Modified from (Kaufman et al.
2012)
In Table 16.4, we want to give examples for the four cases on Level 5 for the
Hamming(7,4) code with an additional parity bit
Table 16.4 Examples of how a program for Hamming(7,4) with additional bit reacts
Case Errors in Shows in Consequences
Detected but uncorrected d1 , d2 p2 , p3 Because p4 is correct, it is known
that it is a two-bit error but could
also be in d4 , p1
Detected and correctly d1 p2 , p3 , p4 Because p4 is incorrect, it is an odd
corrected number of bit errors. If the
program assumes a one-bit error, it
is correctly corrected
Detected, corrected, but d2 , d3 , d4 p2 , p3 , p4 Program thinks it is in the case
correction incorrect above and corrects d1
Undetected and d1 , d2 , d3 , p4 - Program does not recognize error
uncorrected
296 16 Error Detecting and Correcting Codes
No
Simulaon of bit
encode tprog Decode and Increment
Inialize tprog errors: inject bit
START using given evaluate tprog by 0.1 tprog> 999.9 ?? Yes END
to 0.0 errors into tprog for
scheme (Results) (next tprog)
1-bit to 5-bit errors
Perform for all
permutaons using tprog
Fig. 16.3 Flow diagram for simulation program: variable t prog (Kaufman et al. 2012)
p4 = d1 + d2 + d3 + d4 + p1 + p2 + p3 (16.4)
for a program that assumes as little errors as possible, that is, it believes that one-bit
errors are the most likely to occur.
Figure 16.3 shows how the simulation program works. It starts with t pr og = 0.
Then this fixed time value is encoded, bit errors are simulated, the value is decoded
and it is evaluated whether the error-detecting code found the simulated bit errors
(compare Fig. 16.2). Then the same is done with the next time value t pr og = 0.1 etc.,
up to t pr og = 999.9.
Kaufman et al. (2012) also considered how much memory was needed for the
encoded data. For this, they distinguish between information bits and redundancy
bits.
Example We go back to the example from Sect. 16.2. We had the binary string
1101 and one parity bit. In this case, the total number of bits n is 5, the number of
information bits m is 4, and the number of redundancy bits k is 1.
In Fig. 16.4, the ration of information bits to the total number of bits m/n is
used to compare the different encoding schemes used. Here m = 16, compare
Fig. 16.1. Kaufman et al. (2012) “used the notation Hamming(n,m) for the single
error correcting (SEC) Hamming code. Here the number of bits in the Hamming
word corresponds to the total number of bits n, and m is the number of information
bits. For the single error correcting double error detecting (SEC-DED) code, we use
1 Byte (8 bits)
1 nibble
(4 bits)
Hundreds Tens
Ones Tenths
Fig. 16.4 Error-detection schemes to be tested and their corresponding amounts of allocated
memory used (Kaufman et al. 2012)
16.5 Assessment of Bit Error-Detecting and Error-Correcting Codes for a Sample System 297
Example For BCD with Hamming(21,16), there are 21 bits where the bit errors can
occur. That makes for a fixed time value,
21
= 21 combinations with one-bit error,
1
21
= 210 combinations with two-bit errors,
2
21
= 1330 combinations with three-bit errors,
3
21
= 5985 combinations with four-bit errors, and
4
21
= 20348 combinations with five-bit errors.
5
There are 10,000 time values (000.0 to 999.9 in 0.1 steps). So, for example, the
total number of combinations with three-bit errors is 1,330,000.
16.5.4 Results
Figures 16.5, 16.6, 16.7, 16.8 and 16.9 show which encoding schemes are most
efficient for the sample system and one- to five-bit errors.
“The simulation program […] only considered cases with bit errors. In reality,
bit errors will only occur with a certain probability. As such, the probability of an
error actually occurring needs to be overlaid with the above simulation. Here we
explore the conditions under which we believe it is acceptable to use bit correcting
techniques. Keeping in mind our assumption from [Sect. 16.5.2] that bit errors occur
statistically independently, we let f be the frequency of a single bit error, e.g., in
units of h−1 . In time interval T we claim that Tf is approximately the probability of
a single bit error assuming Tf < < 1. Let ricc be the ratio of the correctly corrected
(green shading in [Figs. 16.5 to 16.9]) and riec of erroneous corrected (see [red] and
blue shading) of the i injected bit errors in the n-bit data, and n bit the maximum
number of corrected bits. Then
n bit
n ec
P =
ec
ri (T f )i (1 − T f )n−i (16.5)
i=1
i
298 16 Error Detecting and Correcting Codes
100%
90%
80%
70%
60% program terminaon
50%
40% (Result 1)
30% correct me value (Result
20%
10% 2)
0% me too short (Result 3)
Fig. 16.5 Results of the bit error simulation program for one-bit errors for all coding methods
tested (Kaufman et al. 2012)
100%
90%
80%
70%
60%
program terminaon
50%
40% (Result 1)
30% correct me value (Result
20%
2)
10%
0% me too short (Result 3)
Fig. 16.6 Results of the bit error simulation program for two-bit errors for all coding methods
tested (Kaufman et al. 2012)
16.5 Assessment of Bit Error-Detecting and Error-Correcting Codes for a Sample System 299
100%
90%
80%
70%
60% program terminaon
50%
40% (Result 1)
30% correct me value (Result
20%
10% 2)
0% me too short (Result 3)
Fig. 16.7 Results of the bit error simulation program for three-bit errors for all coding methods
tested (Kaufman et al. 2012)
100%
90%
80%
70%
60% program terminaon
50%
40% (Result 1)
30% correct me value (Result
20%
10% 2)
0% me too short (Result 3)
Fig. 16.8 Results of the bit error simulation program for four-bit errors for all coding methods
tested (Kaufman et al. 2012)
300 16 Error Detecting and Correcting Codes
100%
90%
80%
70%
60%
program terminaon (Result
50%
1)
40%
30% correct me value (Result 2)
20%
10% me too short (Result 3)
0%
me too long (Result 4)
Fig. 16.9 Results of the bit error simulation program for five-bit errors for all coding methods
tested (Kaufman et al. 2012)
is the probability that bit errors were erroneously corrected and the same expression
for P cc that they were correctly corrected. For a given [safety–critical] system, one
has to show that the increase of the probability of reliability by P cc is not offset
by an intolerable increase of the probability […] P ec due to erroneous correction”
(Kaufman et al. 2012).
The results of Figs. 16.5 to 16.9 do not distinguish between not detected and
detected but faultly corrected errors, compare Fig. 16.2. However, the percentage
of detected errors is at least as big as the green and violet columns of Figs. 16.5 to
16.9 together. For Hamming(7,4), this percentage is only around 50–60% for three
or more bit errors.
Error-detecting and error-correcting codes are also described in the IEC 61508:
“Aim: To detect and correct errors in sensitive information.
Description: For an information of n bits, a coded block of k bits is generated which
enables r errors to be detected and corrected. Two example types are Hamming codes
and polynomial codes.
16.6 Error-Detecting and Error-Correcting Codes in the Standard IEC 61508 301
16.7 Questions
(1) Which error-detection and error-correction codes do you know? Explain their
main ideas!
(2) What has to be considered in case of use of error detection and correction in
safety–critical applications, also according to IEC 61508?
(3) Give an example that a parity checksum does not detect a multiple bit flip in a
nibble!
(4) Give an example that 2oo3 (2 out of 3) does not detect a two-bit flips in a nibble!
(5) How can one systematically assess the detection and correction capability of
Hamming codes?
16.8 Answers
(1) See Sects. 16.2, 16.3, and 16.4 and respective explanations. See also the MooN
redundancy introduced in Sect. 16.5.1.
(2) Since automatic self-correcting codes are not recommended by IEC 61508, it
has to be shown that correction is better than non-correction considering all
possible cases, i.e., also multiple bit flips of higher order as well as the given
operational context. See also the argumentation for the sample case in Sect. 16.5.
(3) Nibble plus parity bit example: 11000. Example after bit flip: 000000.
(4) Nibble example: 1100. Redundant data after double bit flip: 1000, 1000, 1100.
Hence, 2oo3 will select 1000.
(5) Options include a complete assessment of all combinations for expected param-
eter values, combinatorial analysis, or use a weighted Monte Carlo approach
with focus on low-order number of bit flips and bit shifts.
References
Hamming, R. W. (1950). “Error Detecting and Error Correcting Codes.” The Bell System Technical
Journal 29(2): 147–160.
Huffman, W. C. and V. Pless (2003). Fundamentals of Error-Correcting Codes, Cambridge
University Press.
302 16 Error Detecting and Correcting Codes
© The Editor(s) (if applicable) and The Author(s), under exclusive license 303
to Springer Nature Singapore Pte Ltd. 2021
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9
304 Bibliography
B
Base Object Model, 37 D
Basic event, 73 Dangerous detected failures, 122
Bathtub curve, 168 Dangerous undetected failures, 122
Behavior, 241 Data integrity (system safety property), 218
Behavioral analysis, 245 Deductive, 38, 46
Behavior diagram, 266, 268 Demand mode, 165
Big-bang development, 181 Dependability, 162
Birnbaum importance measure, 86 Dependency, 236
Black swan, 50 Deployment Diagram, 244
Block Definition Diagram (BDD), 265–269 Detection measures, 107
Boolean algebra, 77 Detection rating, 112
Bottom-up method, 82 Diagnostic Coverage (DC), 122
Boundaries of a system, 44 Disruption, 5, 6, 9, 13–17, 21, 23, 24, 71, 150
Documentation, 39
Double failure matrix, 65
C Dual fault tree, 83
Cause-oriented safety requirement, 214 Dynamic Fault Tree Analysis (DFTA), 47,
Class, 233 88
© The Editor(s) (if applicable) and The Author(s), under exclusive license 305
to Springer Nature Singapore Pte Ltd. 2021
I. Häring, Technical Safety, Reliability and Resilience,
https://doi.org/10.1007/978-981-33-4272-9
306 Index
N R
Namespace containment relationship, 248 Realization ( (in the context of UML)), 236
Non-functional safety, 212 Redundancy, 52
Non-independent failures, 52 Refine relationship, 248
Non-technical safety requirements, 213 Reflexive association, 234
Numerical simulation, 37 Relationship, 247
Reliability, 32, 51, 162
Reliability prediction, 161
O Reliability (system safety property), 217
Object diagram, 244 Requirement diagram, 246
Object process methodology, 36 Requirements engineering, 265
Occurrence rating, 112 Requirements specification, 104
Operating and Support Hazard Analysis, 137 Requirements tracing, 227, 265
Optimization, 108 Resilience, 2, 3, 5, 6, 9–24, 27, 43, 49, 57,
Order (of a minimal cut set), 74 71, 101, 127, 146, 149, 150, 161, 179,
“or” gate, 75 193, 209, 227, 265, 277, 287
Resilience engineering, 3, 5, 10, 11, 14, 23,
24, 49, 277
P Resilience generation, 3
Package, 234 Resilience improvement, 277
Package diagram, 244 Restriction (of UML or SysML), 246
Part (in the context of UML), 238 Risk, 29
Parts Count, 57 Risk Achievement Worth, 86
Parts Count method, 171 Risk control, 3, 6, 9
Parts Stress method, 170 Risk factor, 117
Passive component, 52 Risk map, 119
Passive safety requirement, 212 Risk matrix, 139, 140, 148
Port (in the context of UML and SysML), Risk minimization, 31
238 Risk Priority Number (RPN), 113
Preliminary Hazard Analysis (PHA), 14–18, Risk Reduction Worth, 86
20–22, 48, 49, 64, 128, 129, 131, Rule of 10, 103
133–137, 139, 151, 279
Preliminary hazard list, 130
Preventive measures, 107 S
Primary failure, 50 Safe Failure Fraction (SFF), 122
Private (in the context of UML), 234 Safe failures, 122
Probabilistic boundaries, 44 Safety, 31, 51
Probability of default, 114 Safety-critical, 211
308 Index
Safety function, 33, 197 247, 250, 253, 257, 259–261, 265–
Safety function aggregation, 274 269, 271, 272, 274, 275, 277–279
Safety function allocation, 199 System-specific safety requirement, 217
Safety integrity level, 33, 165 System state analysis, 164
Safety Life Cycle (SLC), 3, 4, 6, 27, 28, 35,
40, 128, 150, 151, 180, 193, 194, 196,
198, 199, 201, 205, 274, 275 T
Safety Of Intended Functionality (SOIF), Technical safety requirement, 213
209 Telcordia (reliability prediction standard),
Safety-relevant system, 32 172
Safety requirement, 211 Temporal boundaries, 44
Satisfy relationship, 247 Time boundary, 55
Secondary failure, 50 Time-critical safety requirement“, 217
Sequence diagram, 242, 274 Time-Dependent Fault Tree Aanalysis
Severity rating, 112 (TDFTA), 14–16, 18, 20–21, 71
Siemens (reliability prediction standard), Timeline, 9, 17, 18, 21, 23, 111
177 Timing diagram, 243
Signal, 241 Tolerable risk, 196
Simulation, 11, 20, 27, 28, 35, 37–39, 44, 54, Top-down method, 81
150, 169, 252, 257, 294, 295–300 Top event, 73
Spiral model, 184 Total failure rate, 122
SS-SC concept, 76 Total reliability, 115
Standard, 169 Trace relationship, 249
Standardized safety requirement, 215 Transient error, 294
State, 241 Transition (of states), 240
State diagram, 240 Trigger, 240
State machine, 240 Type A component, 268
State machine diagram, 272 Type B component, 268
State (of system), 76
Static safety requirement, 215
Statistical error, 27 U
Stereotype (in the context of UML), 246 Unidirectional association, 234
Structural analysis, 106 Unified Modeling Language (UML), 3, 6,
Structured class, 238 36, 227, 229, 244, 245, 266
Structure diagram, 228, 232, 237, 250, 259, Unacceptable risk, 30, 31, 211
260, 266, 267 Unknown unknown, 50
Subsystem Hazard Analysis (SSHA), 16– Use case diagram, 245, 272
18, 127–129, 131, 134, 135, 139,
151–154, 279
Success space, 48 V
Survival probability, 167 Very High-Speed Integrated Circuit Hard-
System, 28, 43 ware Description Language (VHDL),
System analysis methods, 38 37
Systematic error, 27 Visibility, 233
System failure (state of system failure), 17, V-Model, 187
19, 47, 57–61, 69, 92, 115, 200, 280, V-Model XT, 187, 188
282
System FMEA, 104
System hHazard analysis, 135 W
System integrity (system safety property), Waterfall model, 182
218 Weibull distribution, 168, 169, 176
System recovery (system safety property), Weighting factors, 117
218 W-Model (combination of V-model for HW
Systems modeling language (SysML), 3, 14, and V-model for SW), 204
16–22, 24, 35, 37, 199, 227, 228, 246,