Fault Tree Analysis
Fault Tree Analysis
Topics Covered
! Fault Tree Definition
! Developing the Fault Tree
! Structural Significance of the Analysis
! Quantitative Significance of the Analysis
! Diagnostic Aids and Shortcuts
! Finding and Interpreting Cut Sets and Path Sets
! Success-Domain Counterpart Analysis
! Assembling the Fault Tree Analysis Report
! Fault Tree Analysis vs. Alternatives
! Fault Tree Shortcoming/Pitfalls/Abuses
2
First – A Bit of Background
Origins
4
The Fault Tree is
6
Fault Tree Analysis Produces
Some Definitions
– FAULT
• An abnormal undesirable state of a system or a system element*
induced 1) by presence of an improper command or absence of a
proper one**, or 2) by a failure (see below). All failures cause
faults; not all faults are caused by failures. A system which has
been shut down by safety features has not faulted.
– FAILURE
• Loss, by a system or system element*, of functional integrity to
perform as intended, e.g., relay contacts corrode and will not pass
rated current closed, or the relay coil has burned out and will not
close the contacts when commanded – the relay has failed; a
pressure vessel bursts – the vessel fails. A protective device which
functions as intended has not failed, e.g, a blown fuse.
* System element: a subsystem, assembly, component, piece part, etc.
** This (#1) also describes a “command” failure, which is one of the features of the “state of component”
8 approach to fault tree analysis.
Definitions
4 Identify second-level
contributors
Basic Event (“Leaf,” “Initiator,” or
“Basic”) indicates limit of analytical 6 Repeat/continue
resolution.
12
Some Rules and Conventions
NO YES
14
Some Conventions Illustrated
! MAYBE
Flat Tire – A gust of wind will come
along and correct the skid.
16
Example TOP Events
! Wheels-up landing ! Dengue fever pandemic
! Mid-air collision ! Sting failure
! Subway derailment ! Inadvertent nuke launch
! Turbine engine FOD ! Reactor loss of cooling
! Rocket failure to ignite ! Uncommanded ignition
! Irretrievable loss of ! Inability to dewater buoyancy
primary test data tanks
TOP events represent potential high-penalty losses (i.e., high risk). Either
severity of the outcome or frequency of occurrence can produce high risk.
17
19
20
An Example Fault Tree
Late for Work Undesirable
Event
No “Start”
Pulse Natural
Apathy
Artificial
Bio- Wakeup Fails
rhythm
Fails
22 ?
Verifying Logic
Oversleep
No “Start”
Does this Pulse Natural
“look” correct? Apathy
Should the
gate
be OR? Artificial
Bio-
rhythm Wakeup Fails
Fails
?
23
“trigger” “motivation”
No “Start” “Start”
Pulse Failure Pulse Success Natural
Natural Works
Domain Apathy Domain High
Torque
? ?
24
If it was wrong here……it’ll be wrong here, too!
Artificial Wakeup Fails
Artificial
Wakeup
Fails
Alarm
Clocks
Fail Nocturnal
Deafness
Main Backup
Plug-in (Windup)
Clock Fails Clock Fails
Electrical Mechanical
Fault Fault
What does the tree tell up about system
vulnerability at this point?
Hour Hour
Hand Hand
Falls Jams
Off Works
25
! Relating PF to R
! The Bathtub Curve
! Exponential Failure Distribution
! Propagation through Gates
! PF Sources
26
Reliability and Failure Probability
Relationships
! S = Successes
! F = Failures
!
S
Reliability… R =(S+F)
! Failure Probability… PF = F
(S+F)
S + F !1
R + PF = (S+F)
(S+F)
1
! = Fault Rate = MTBF
27
Significance of PF
Fault probability is modeled acceptably
y)
M IN
" = 1 / MTBF
lit
Random
BU
UT
(In B
T
0
0 1 MTBF
Exponentially Modeled Failure Probability
28
% and PF Through Gates
OR Gate For 2 Inputs AND Gate
Either of two, independent, element Both of two, independent elements must fail to
failures produces system failure. produce system failure.
%T = %A %B R + PF ! 1 %T = %A + %B – %A % B
PF = 1 – % T PF = 1 – % T
PF = 1 – (%A %B) PF = 1 – (% A + % B – %A % B)
29
1 2 1 2
P1 P2 P1 P2
1&2
are
INDEPENDENT
events.
PT = P1 P2 PT = P1 + P2 – P1 P2
Usually negligible
30
“Ipping” Gives Exact OR Gate Solutions
Failure
TOP Success
TOP Failure
TOP
PT = ? PT =
) Pe
PT =) (1 – Pe)
1 2 3 1 2 3 1 2 3
P1 P2 P3 P1 P2 P3
P1 = (1 – P1) P3 = (1 – P3)
The ip operator ( ) is the
) P2 = (1 – P2)
co-function of pi ()). It )
provides an exact solution PT = Pe= 1 – )(1 – Pe)
for propagating probabilities
through the OR gate. Its use PT = 1 – [(1 – P1) ( 1 – P2) (1 – P3 … (1 – Pn )]
is rarely justifiable.
31
Exclusive OR Gate…
PT = P1 + P2 – 2 (P1 x P2)
Opens when any one (but only one)
event occurs.
Conditioning Event
Undeveloped Event Applies conditions or
An event not further restrictions to other
developed. symbols.
33
! Manufacturer’s Data
! Industry Consensus Standards
! MIL Standards
! Historical Evidence – Same or Similar Systems
! Simulation/testing
! Delphi Estimates
! ERDA Log Average Method
34
Log Average Method*
If probability is not estimated easily, but upper and lower credible bounds can be judged…
• Estimate upper and lower credible bounds of probability for the phenomenon in question.
• Average the logarithms of the upper and lower bounds.
• The antilogarithm of the average of the logarithms of the upper and lower bounds is less
than the upper bound and greater than the lower bound by the same factor. Thus, it is
geometrically midway between the limits of estimation.
0.0316+
PL PU
Lower Log PL + Log PU (–2) + (–1) Upper
Probability Log Average = Antilog = Antilog = 10 –1.5
= 0.0316228 Probability
2 2 Bound 10–1
Bound 10–2
Note that, for the example shown, the arithmetic average would be…
0.01 + 0.1 = 0.055
2
i.e., 5.5 times the lower bound and 0.55 times the upper bound
* Reference: Briscoe, Glen J.; “Risk Management Guide;” System Safety Development Center; SSDC-11; DOE 76-45/11; September 1982.
35
36
Typical Component Failure Rates
Failures Per 106 Hours
Main Backup
Plug-in (Windup)
Clock Fails Clock Fails
2.03 x 10–2 2.04 x 10–2
Electrical Mechanical
Fault Fault
3. x 10–4 8. x 10–8
1/15
Hour Hour
Hand Hand
Falls Jams
Off 4. x 10–4 Works 2. x 10–4
40 1/10 1/20
HOW Much PT is TOO Much?
Consider “bootstrapping” comparisons with known risks…
Human operator error (response to repetitive stimulus) # 10–2- 10–3/exp MH†
Internal combustion engine failure (spark ignition) # 10–3/exp hr†
Pneumatic instrument recorder failure # 10–4/exp hr†
Distribution transformer failure # 10–5/exp hr†
U.S. Motor vehicles fatalities # 10–6/exp MH†
Death by disease (U.S. lifetime avg.) # 10–6/exp MH
U.S. Employment fatalities # 10–7-10–8/exp MH†
Death by lightning # 10–9/exp MH*
Meteorite (>1 lb) hit on 103x 103 ft area of U.S. # 10–10/exp hr‡
Earth destroyed by extraterrestrial hit # 10–14/exp hr†
† Browning, R.L., “The Loss Rate Concept in Safety Engineering”
* National Safety Council, “Accident Facts”
‡ Kopecek, J.T., “Analytical Methods Applicable to Risk Assessment & Prevention,” Tenth International
System Safety Conference
41
Apply Scoping
What power outages are of concern?
Power
Outage Not all of them!
Only those that…
1 X 10–2
3/1 • Are undetected/uncompensated
• Occur during the hours of sleep
• Have sufficient duration to fault the system
43
Independent
Hand Hand
Hand Elect. Falls/ Gearing Other
Falls Off Jams Fault Mech.
Works Jams Fails
Works Fault
Alarm Alarm
Failure Failure
True Contributors
45
46
Common Cause Oversight –
An Example
Unannunciated
Intrusion by
Burglar
DETECTOR/ALARM FAILURES
Detector/Alarm Detector/Alarm
Failure Power Failure
Here, power source failure has been recognized as an event which, if it occurs, will
disable all four alarm systems. Power failure has been accounted for as a common
cause event, leading to the TOP event through an OR gate. OTHER COMMON
CAUSES SHOULD ALSO BE SEARCHED FOR.
48
Example Common Cause
Fault/Failure Sources
! Utility Outage ! Dust/Grit
– Electricity ! Temperature Effects
– Cooling Water (Freezing/Overheat)
– Pneumatic Pressure ! Electromagnetic
– Steam Disturbance
! Moisture ! Single Operator Oversight
! Corrosion ! Many Others
! Seismic Disturbance
49
! Separation/Isolation/Insulation/Sealing/
Shielding of System Elements.
! Using redundant elements having differing
operating principles.
! Separately powering/servicing/maintaining
redundant elements.
! Using independent operators/inspectors.
50
Missing Elements?
Unannunciated
Contributing elements Intrusion by SYSTEM
must combine to satisfy Burglar CHALLENGE
all conditions essential to
the TOP event. The logic
criteria of necessity and Detector/Alarm Intrusion By
sufficiency must be Failure Burglar
satisfied.
Detector/Alarm Detector/Alarm
System Failure Power Failure Burglar Barriers
Present Fail
51
1% of infected cases test falsely negative, 2% of uninfected cases test falsely positive,
receive no treatment, succumb to disease receive treatment, succumb to side effects
53
Cut Sets
AIDS TO…
! System Diagnosis
! Reducing Vulnerability
54
Cut Sets
55
57
appearance.
– Construct a matrix, starting 2 3
with the TOP “A” gate.
58
A Cut Set Example
A B D 1 D 1 D
C D 2 D 3
These Boolean-
1 2 1 2 1 2
2 D 3 2 2 3
Indicated Cut Sets… 2 3
Minimal Cut Set
1 4 1 4 1 4 rows are least
2 4 3
D (top row), is an OR …reduce to these groups of initiators
gate; 2 & 4, its D (second row),
minimal cut sets.
inputs, replace it is an OR gate. which will induce
vertically. Each Replace as TOP.
requires a new row. before.
59
1 2 1 2 1 4 2 3
2 3
1 4
…represent this Fault Tree…
…and this Fault Tree is a Logic Equivalent of the original,
for which the Minimal Cut Sets were derived.
60
Equivalent Trees Aren’t Always
Simpler
4 gates
6 initiators This Fault Tree has this logic equivalent.
9 gates
1 2 3 4 5 6 24
TOP initiators
example – note
B C
differences. TOP
gate here is OR. 1 6
example. 3 4 4 1
62
Another Cut Set Example
Construct Matrix – make step-by-step substitutions…
A B 1 D 1 2 1 2
C F 6 F 6 3 5 G 6
1 E 1 E
TOP
1 2 1 3 1 4 3 4 5 6
64
From Tree to Reliability Block
Diagram
Blocks represent functions of system elements.
TOP
Paths through them represent success.
A “Barring” terms ( n ) denotes
consideration of their success
properties. 3
B
2 3 4
C
5
1
1 6 4 1
F
D 6
2 3 5 TOP
G
E The tree models a system fault, in failure
domain. Let that fault be System Fails to Function
3 4 4 1 as Intended. Its opposite, System Succeeds to
function as intended, can be represented by a
Reliability Block Diagram in which success flows
through system element functions from left to right.
Any path through the block diagram, not interrupted
65 by a fault of an element, results in system success.
A
3
2 3 4
B C 5
1
4 1
1 6
D F 6
3 5 Note that
2 3/5/1/6 is a
G 1 2
E Cut Set, but
1 3 not a Minimal
3 4 4 1 Cut Set. (It
1 4
contains 1/3, a
Each Cut Set (horizontal rows in the matrix) 3 4 5 6 true Minimal
interrupts all left-to-right paths through the Minimal Cut Sets Cut Set.)
66 Reliability Block Diagram
Cut Set Uses
! Evaluating PT
! Finding Vulnerability to Common Causes
! Analyzing Common Cause Probability
! Evaluating Structural Cut Set “Importance”
! Evaluating Quantitative Cut Set “Importance”
! Evaluating Item “Importance”
67
System Common-Cause
Fault These Induced Fault
must be
OR
Analyze as …others
usual…
Moisture Human Heat
Vibration Operator
…where Pk = ) Pe = P3 x P4 x P5 x P6
1 6
D
F Minimal Cut Sets
1 2
2 3 5
G 1 3
E
1 4
3 4 4 1 3 4 5 6
Analyzing Quantitative Importance enables numerical ranking of contributions to System Failure.
To reduce system vulnerability most effectively, attack Cut Sets having greater Importance.
Generally, short Cut Sets have greater Importance, long Cut Sets have lesser Importance.
72
Item ‘Importance”
The quantitative Importance of an item (Ie) is the numerical probability
that, given that TOP has occurred, that item has contributed to it.
Path Sets
Aids to…
! Further Diagnostic Measures
! Linking to Success Domain
! Trade/Cost Studies
74
Path Sets
75
A
3
2 3 4
B C 5
1 4 1
1 6
F 6
D
3 5
2 Each Path Set
E
G 1 3 (horizontal rows in the
1 4 matrix) represents a
3 4 4 1
1 5 left-to-right path
1 6 through the Reliability
Block Diagram.
2 3 4
77 Path Sets
For all new countermeasures, THINK… • COST • EFFECTIVENESS • FEASIBILITY (incl. schedule)
AND
Does the new countermeasure… • Introduce new HAZARDS? • Cripple the system?
79
80
Some Diagnostic Gimmicks
Using a “generic” all-purpose fault tree…
TOP
PT
1 2
3 4 5
6 7 8 9
10 11 12 13 14 15
16 17 18 19 20 21
22 23 24 25 26 27 28 29
30 31 32 33 34
81
1 2 3 4 5
P22 = 3 x 10–3
1,000 peg spaces
997 white 30 31
32 33 34
3 red
82
Use Sensitivity Tests
TOP Gauging the “nastiness” of
PT untrustworthy initiators…
1 2 3 4 5
30 31 32 33 34
83
1 2 3 4 5
16 17 18 19 20 21
22 23 24 25 26 27 28 29
30 31 32 33 34
84
How Far Down Should a Fault Tree
Grow?
TOP
Severity Probability
PT
Where do you stop the analysis? The analysis is a Risk Management enterprise. The TOP
statement gives severity. The tree analysis provides probability. ANALYZE NO FURTHER
1 2 3 4 5
DOWN THAN IS NECESSARY TO ENTER PROBABILITY DATA WITH CONFIDENCE. Is
risk acceptable? If YES, stop. If NO, use the tree to guide risk reduction. SOME
EXCEPTIONS…
6 7 8 9
1.) An event within the tree has alarmingly high probability. Dig deeper beneath it
to find the source(s) of the high probability.
2.) Mishap autopsies
10 11 12must sometimes
13 analyze
14 down to15the cotter-pin level to produce a
“credible cause” list.
16 17 18 19 20 21
85
State-of-Component Method
WHEN – Analysis has proceeded to
the device level – i.e., valves,
Relay K-28
Contacts Fail pumps, switches, relays, etc.
Closed
HOW – Show device fault/failure in
the mode needed for upward
propagation.
Preferred
Selection Characteristic
FTA FMECA
Safety of public/operating/maintenance personnel -
Small number/clearly defined TOP events -
Indistinctly defined TOP events -
Full-Mission completion critically important -
Many, potentially successful missions possible -
“All possible” failure modes are of concern -
High potential for “human error” contributions -
High potential for “software error” contributions -
Numerical “risk evaluation” needed -
Very complex system architecture/many functional parts -
Linear system architecture with little/human software influence -
System irreparable after mission starts -
*Adapted from “Fault Tree Analysis Application Guide,” Reliability Analysis Center, Rome Air Development Center.
88
Fault Tree Constraints and
Shortcomings
! Undesirable events must be foreseen and are only analyzed
singly.
! All significant contributors to fault/failure must be anticipated.
! Each fault/failure initiator must be constrained to two
conditional modes when modeled in the tree.
! Initiators at a given analysis level beneath a common gate
must be independent of each other.
! Events/conditions at any analysis level must be true,
immediate contributors to next-level events/conditions.
! Each Initiator’s failure rate must be a predictable constant.
89
Closing Caveats
! Be wary of the ILLUSION of SAFETY. Low probability does not mean
that a mishap won’t happen!
! THERE IS NO ABSOLUTE SAFETY! An enterprise is safe only to
the degree that its risks are tolerable!
! Apply broad confidence limits to probabilities representing human
performance!
! A large number of systems having low probabilities of failure means
that A MISHAP WILL HAPPEN – somewhere among them!
P1 + P2+ P3+ P4 + ----------Pn , 1
More…
92
Caveats
Do you REALLY have enough data to justify QUANTITATIVE ANALYSIS?
For 95% confidence…
We must have no failures in to give PF #… and R # …
Assumptions: 1,000 tests 3 x 10–3 0.997
! Stochastic System
Behavior 300 tests 10–2 0.99
! Constant System
Properties 100 tests 3 x 10–2 0.97
! Constant Service
Stresses 30 tests 10–1 0.9
! Constant
Environmental
10 tests 3 x 10–1 0.7
Stresses
94
Bibliography
Selected references for further study…
! Center for Chemical Process Safety; “Guidelines for
Hazard Evaluation Procedures; 3rd Edition with
Worked Examples;” 2008 (576 pp); John Wiley & Sons
! Mannan, Sam (Ed): “Lee’s Loss Prevention in the
Process Industries, 3rd Edition;” Butterworth-Heinemann;
2004 (3680 pp – three volumes)
! Henley, Ernerst J. and Hiromitsu Kumamoto;
“Reliability Engineering and Risk Assessment;” 1981
(568 pp)
95
Additional Reading
! Clemens, P.L. and R.J. Simmons: System Safety and
Risk Management. Cincinnati, OH: National Institute for
Occupational Safety and Health, 208pp. (1998).
! Clemens, P.L. and R.J. Simmons: The Exposure Interval:
Too Often the Analysts’ Trap,” Journal of System Safety.
Vol 37, No. 1, pp 8-11, 1st Quarter, 2001.
! Clemens, P.L., Pfitzer, T.F., Simmons, R.J., Dwyer, S.,
Frost, J. and E. Olson (2005) The RAC Matrix: A
Universal Tool or a Toolkit? Journal of System Safety.
Vol. 41, No. 2, pp 14-19, 41-42, March/April 2005.
96