Module 2 Modeling and Evaluation
Module 2 Modeling and Evaluation
NETWORKS
NTUEE 1 KUO
Evaluation
Motivation for Evaluation
− Determine whether reliability specification are met
− Determine most cost-effective design technique
− Determine whether redundancy is required
− Useful for comparing redundancy techniques
NTUEE 2 KUO
Evaluation Criteria
A method of evaluation is required in order to compare
the redundancy techniques and make subsequent design
tradeoffs
Modeling techniques are very vital means for obtaining
reasonable predictions for system reliability and
availability
─ Combinatorial: series/parallel, K-of-N, nonseries/nonparallel
─ Markov: time invariant, discrete time, continuous time, hybrid
─ Queuing
Using these techniques probabilistic models of systems
can be created and used to evaluate system reliability
and/or availability
NTUEE 3 KUO
Two approaches
Qualitative evaluation
− aims to identify, classify and rank the failure modes, or
event combinations that would lead to system failures
Quantitative evaluation
− aims to evaluate in terms of probabilities the attributes
of dependability (reliability, availability, safety)
NTUEE 4 KUO
Concepts from Probability Theory
Probability density function: pdf
f(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt
Cumulative distribution function: CDF
F(t) = prob[x ≤ t] = ∫ 0t f(x) dx
Expected value of x
+∞
Ex = ∫ −∞ x f(x) dx = ∑k xk f(xk)
Variance of x
+∞
σ2x = ∫ −∞ (x – Ex)2 f(x) dx
= ∑k (xk – Ex)2 f(xk)
Covariance of x and y
ψx,y = E [(x – Ex)(y – Ey)]
= E [x y] – Ex Ey
NTUEE 5 KUO
Some Simple Probability Distributions
F(x)
1
CDF CDF CDF CDF
f(x)
pdf pdf pdf
NTUEE 6 KUO
Failure Rate
failure rate λ(t)
− expected # of failures per time-unit
− example
» 1000 controllers working at t0
» after 10 hours: 950 working
» failure rate for each controller:
0.005 failures / hour
NTUEE 7 KUO
Failure rate
t
bathtub: I infant mortality, II useful life, III wear-out
for useful life period λ = constant, the reliability is
given by
R(t ) = e − λt
NTUEE 8 KUO
Exponential failure law
1
0.8
0.6
0.4
0.2
0
NTUEE 9 KUO
Reliability and MTTF of a Single
Component (Module)
NTUEE 10 KUO
Probabilistic Interpretation of f(t) and F(t)
F(t) - probability that the component will fail at or
before time t
F(t) = Prob {T ≤ t}
f(t) – not a probability, but the momentary rate of
probability of failure at time t
f(t)dt = Prob {t ≤ T ≤ t+dt}
Like any density function (defined for t ≥ 0)
∞
f(t) ≥ 0 (for all t ≥ 0) and
∫ f (t )dt = 1
0
t
f (t ) = dF (t ) / dt and F (t ) = ∫ f ( s )ds
0
NTUEE 11 KUO
Reliability and Failure Rate
The reliability of a single module - R(t)
− R(t) = Prob {T>t} = 1- F(t)
The conditional probability that the module will fail at
time t, given it has not failed before, is
NTUEE 12 KUO
Constant Failure Rate
If the module has a failure rate which is constant over
time -
λ(t) = λ
dR(t) / dt = - λ R(t) ; R(0)=1
The solution of this differential equation is
R(t ) = e − λt
f (t ) = λe − λt
F (t ) = 1 − e − λt
A module has a constant failure rate if and only if T, the
lifetime of the module, has an exponential distribution
NTUEE 13 KUO
Time varying failure rate
Failure rate is not always constant
– software failure rate decreases as package matures
Weibull distribution:
z(t ) = αλ(λt )α −1
if α=1, then z(t) = constant = λ
NTUEE 14 KUO
Failure rate calculation
determined for components
– systems: combination of components
– λ of the system = sum of λ of the components
determine λ experimentally
– slow
• e.g. 1 failure per 100 000 hours (=11.4 years)
– expensive
• many components required for significance
use standards for λ
The dimension of failure rate is FIT (failures in time)
x FIT = x failures per 10-9 hours
NTUEE 15 KUO
Empirical Formula for λ - Failure Rate
λ = πL πQ (C1 πT πV + C2 πE)
− πL: Learning factor, (how mature the technology is)
− πQ: Manufacturing process Quality factor (0.25 to 20.00)
− πT: Temperature factor, (from 0.1 to 1000), proportional to exp(-Ea/kT)
where Ea is the activation energy in electron-volts associated with the
technology, k is the Boltzmann constant and T is the temperature in Kelvin
− πV: Voltage stress factor for CMOS devices (from 1 to 10 depending on
the supply voltage and the temperature); does not apply to other
technologies (set to 1)
− πE: Environment shock factor: from about 0.4 (air-conditioned
environment), to 13.0 (harsh environment - e.g., space, cars)
− C1, C2: Complexity factors; functions of number of gates on the chip and
number of pins in the package
0 0
λ
MTTF is defined in terms of reliability as:
MTTF = ∫ R(t )dt
NTUEE 18 KUO
MTTF
R(t ) = e − λt
R(t) 1
0.8
0.6
0.4
0.2
0
1/λ 2/λ 3/λ
t
NTUEE 19 KUO
MTTF
NTUEE 20 KUO
Probabilistic Models
Hardware component failure rate function: Experimentally
observed Bathtub Curve
Bath tub curve for failure rate
− implies constant failure rate during useful life
− infant mortality and wear-out periods have variable
failure rates
Infant
mortality
Component wear-out
failure rate
useful life
NTUEE 22 KUO
Weibull distribution - Equation
The Weibull distribution has two parameters, λ and β
The density function of the component lifetime T:
β −1 −λt β
f (t ) = λβ t e
The failure rate for the Weibull distribution is
β −1
λ (t ) = λβt
λ(t) is decreasing with time for β<1, increasing with
time for β>1, constant for β=1, appropriate for infant
mortality, wear-out and middle phases, respectively
NTUEE 23 KUO
Reliability and MTTF for Weibull
Distribution
Reliability for Weibull distribution is
− λt
β
R (t ) = e
MTTF for Weibull distribution is
NTUEE 24 KUO
Mean Time to Repair (MTTR)
The average time required to repair a system.
The MTTR is normally specified in terms of a
repair rate, µ, which is the average number of
repairs that occur per time period (number of
repairs per hour).
difficult to calculate
determined experimentally
normally specified in terms of repair rate repair rate
µ, which is the average number of repairs that occur
per time period
MTTR = 1
µ
NTUEE 25 KUO
MTTR
NTUEE 26 KUO
MTBF
Reliability
computation - mean time between
failure (MTBF)
» Mean time between failure - MTBF
• use heuristic arguments to conclude
– MTBF = (total time T)/(average number of
failures)
• can also argue MTBF = MTTF + MTTR
» Note: often λ << μ and hence MTTF >> MTTR ,
therefore the words MTTF and MTBF are used
interchangeably by some practitioners
NTUEE 27 KUO
Single Parameter Measures
MTBF=MTTF+MTTR
NTUEE 28 KUO
Combinatorial Modeling
NTUEE 29 KUO
Combinatorial Modeling
System is divided into non-overlapping modules
Each module is assigned either a probability of working, Pi, or a probability
as function of time, Ri(t)
The goal is to derive the probability, Psys, or function Rsys(t): Prob that the
system survives until time t
Assumptions:
− module failures are independent
− once a module has failed, it is always assumed to yield incorrect
results
− System considered failed if it does not contain a minimal set of
functioning modules
− once system enters a failed state, other failures cannot return system
to functional state
Models typically enumerate all the states of the system that meet or
exceed the requirements for a correctly functioning system
Combinatorial counting techniques are used to simplify this process
NTUEE 30 KUO
Canonical Structures
NTUEE 31 KUO
Reliability of a Series System
A series system - set of modules so that the failure
of any one module causes the entire system to fail
N
R (t ) = ∏ R (t )
s i
i =1
Ri(t) is the reliability of module i
NTUEE 32 KUO
Series System – Modules Have
Constant Failure Rates
Every module i has a constant failure rate λi
− λi t
Ri (t ) = e
− λs t − Σλi t
Rs (t ) = e =e
λs =Σλi is the constant failure rate of the series system
− Effect is summation of failure rates of components
Mean Time To Failure of a series system -
1 1
MTTFs = =
λs Σλi
NTUEE 33 KUO
Reliability of a Parallel System
A Parallel System - a set of modules connected so
that all the modules must fail before the system fails
N
1
MTTFp = ∑
i =1 iλ
NTUEE 35 KUO
Series-Parallel Systems
Consider combinations of series and parallel
systems
Example, two CPUs connected to two memories
in different ways
a b
Rsys = 1 − (1 − Ra Rb )(1 − Rc Rd )
c d
c d
NTUEE 36 KUO
A Simple Example
Consider dynamic redundant system with spares (dynamic
redundancy)
As soon as fault occurs, a faulty component is replaced by a spare
Up to n-1 spare modules
1
Rsys = 1 - (1-R1) (1-R2)... (1-Rn)
2
3
Consider identical modules with R i = 0.9
n
How can you increase Rsys to 0.999999 = 1-10-6
Prob. of module i to survive = Ri
Number of modules n = ln 10-6 / ln (1-Ri) = 6
Hence, need 5 spares to make reliable system
NTUEE 37 KUO
Non Series/Parallel
Systems
NTUEE 38 KUO
Expanding about C
(a) (b)
The process of expanding can be repeated until the
resulting diagrams are of the series/parallel type
Figure (a) needs further expansion about E
Figure (a) should not be viewed as a parallel connection of
A and B, connected serially to D and E in parallel. Such a
diagram will have the path BCDF which is not a valid path
NTUEE 39 KUO
Expanding about C and E
(a) (b)
NTUEE 41 KUO
Lower Bound on Reliability
A lower bound is calculated based on minimal cut sets of
the system diagram
A minimal cut set: a minimal list of modules such that the
removal (due to a fault) of all modules will cause a working
system to fail
Minimal cut sets: F, AB, AE, DE
and BCD
The lower bound is
Rsystem ≥ ∏ (1-Qcut_i)
− Qcut_i - probability that the minimal
cut i is faulty (i.e., all its modules are faulty)
Example - RA=RB=RC=RD=RE=RF=R
Rsystem ≥ R 5 (24 − R 5 + 9 R 4 − 33R 3 + 62 R 2 − 60 R)
NTUEE 42 KUO
Example – Comparison of Bounds
Example - RA=RB=RC=RD=RE=RF=R
Lower bound here is a very good estimate for a
high-reliability system
NTUEE 43 KUO
Reliability Block Diagram
NTUEE 44 KUO
Reliability Block Diagram
NTUEE 45 KUO
Reliability Block Diagram
NTUEE 46 KUO
Reliability Block Diagram
NTUEE 47 KUO
Reliability Block Diagram
NTUEE 48 KUO
M-out-of-N Systems
NTUEE 49 KUO
Cascading TMR Systems
Consider n stages of original system
Each stage replaced by TMR with Voter
Rsimplex(t) = e - λ t
∫e
− λt
MTTFsimplex = dt = 1/ λ
3 − 2 λt
RTMR (t ) = e −3λt
+ e (1 − e −λt )
2
3 2 5
MTTFTMR = − =
2λ 3λ 6λ
NTUEE 51 KUO
Pitfalls Using Single Model (cont.)
1
0.9
TMR
0.8
0.7
0.6
Reliability
0.5
0.4
0.3
0.2
Simplex
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
λto lambda * t
RTMR (t ) ≥ R (t ) 0 ≤ t ≤ t0
RTMR (t ) ≤ R (t ) t0 ≤ t < ∞
ln 2 0.7
where t0 = ≈
λ λ
NTUEE 52 KUO
Mission Time
Suppose we wish to have super reliable
system with reliability r > 0.9999 → MT(r)
gives the time at which system reliability
falls below r.
R(MT(r)) = r R(t)
MT(R(t)) = t
For constant failure rate
R ( t ) = e − λt = r
− ln r
MT ( r ) =
λ
For nonredundant system with n components MT(r)
− ln r
MT ( r ) = n
∑λ
i =1
i
NTUEE 53 KUO
Availability
Probability system will be operating at time t, A(t).
Steady-state availability: Proportion of total
operating time system is operational.
Steady-state availability with constant λ and µ
MTTF MTTF µ
= =
MTBF MTTF + MTTR λ + µ
1
MTTF =
λ
NTUEE 54 KUO
Comparative Measures
NTUEE 55 KUO
Pitfalls Using Single Model (cont.)
Instead of MTTF, look at mission time
Reliability of M-out-of-N systems very high in the beginning
− spare components tolerate failures
Reliability sharply falls down in end
− system exhausted redundancy, more hardware can possibly fail
Such systems useful in aircraft control
− very high reliability, short time
− 0.99999 over 10 hour period
Improve “Vanilla” TMR: TMR with Recovery, TMR Simplex
NTUEE 56 KUO
Effect of Coverage
Rsys = R1 + c (1-R1) R2
3
Rsys = Rm ∑ c i (1 − Rm )
i =0
NTUEE 57 KUO
Effect of Coverage (cont.)
If coverage is 100%, then given low module reliability, can
increase system reliability arbitrarily
NTUEE 58 KUO
Effect of Voter
3 2
= RV (R + Rm (1 − Rm ))
3
RTMRV m
2
NTUEE 59 KUO
Compensating & Non-overalpping Faults
Conservative assumption - every failure of voter leads
to an erroneous output and any failure of two modules
is fatal
Counter Example - one module produces a permanent
logical 1 and a second module has a permanent
logical 0 - TMR will function properly
− These are compensating faults
A similar situation may arise regarding certain faults
within the voter circuit
Another example - non-overlapping faults - one
module has a faulty adder and another module has a
faulty multiplier
If the circuits are disjoint, they are unlikely to generate
wrong outputs simultaneously
NTUEE 60 KUO
Voters
A voter receives inputs X1, X2,...,XN from an M-of-N
cluster and generates a representative output
Simplest voter - bit-by-bit comparison of the outputs
producing the majority vote
This only works when all functional processors
generate outputs that match bit by bit
− Processors must be identical, be synchronized and
use the same software
Otherwise - two correct outputs can diverge slightly, in
the lower significant bits
NTUEE 61 KUO
Active/Dynamic Redundancy
Example:
NTUEE 62 KUO
Reliability of Dynamic Redundancy -
Powered Spares
All N spare modules are active (powered) and have the
same failure rate – resulting in a basic parallel system
with N+1 modules
System reliability is
N +1
R dynamic (t ) = Rdru (t )[1 − (1 − R (t ) ) ]
R(t) - reliability of module
Rdru(t) - reliability of Detection &
Reconfiguration unit
NTUEE 63 KUO
Dynamic Redundancy with
Unpowered (Standby) Spares
Spare modules are not powered (e.g., to conserve
energy) and cannot fail until they become active
C – coverage factor – probability that faulty active module
is correctly diagnosed and disconnected, and good spare
successfully connected
Calculating exact reliability for the general case is
complicated
Reliability for a special case:
− Very large N ; constant failure rate λ per active module
− Rate of nonrecoverable faults is (1-C)λ
− Reliability at time t – probability of no nonrecoverable faults up to
time t
− (1− c ) λ t
Rdynamic (t ) = Rdru (t )e
NTUEE 64 KUO
Hybrid Redundancy
NMR masks permanent and intermittent failures but its
reliability drops below that of a single module for very
long mission times
Hybrid redundancy overcomes this by adding spare
modules to replace active modules once they become
faulty
A hybrid system consists of
a core of N processors
(NMR), and K spares
NTUEE 65 KUO
Hybrid Redundancy - Reliability
Reliability of a hybrid system with a TMR core and K
spares is
Rhybrid (t ) = Rvot (t ) Rrec (t )[1 − mR(t )(1 − R(t )) m−1 − (1 − R(t )) m ]
NTUEE 66 KUO
Duplex Systems
NTUEE 68 KUO
Duplex - Constant Failure Rates
Duplex reliability -
Rduplex (t ) = e − 2λt + 2Ce − λt (1 − e − λt )
MTTFduplex = 1/(2λ) + C/λ
NTUEE 69 KUO
More Complex Systems
NMR systems in which failing processors are identified
and replaced from an infinite pool of spares - similar
calculation to duplex
Finite set of spares - the summation in the reliability
derivation is capped at that number of spares, rather
than going to infinity
Other variations of duplex systems -
− One processor is active while the second is a standby spare
− Processors can be repaired when they become faulty
Combinatorial arguments may be insufficient for
reliability calculation in more complex systems
If failure rates are constant, we can use Markov Models
for reliability calculations
NTUEE 70 KUO
Markov Approach
NTUEE 71 KUO
Who was Markov?
Andrei A. Markov graduated from Saint
Petersburg University in 1878 and subsequently
became a professor there.
His early work dealt mainly in number theory and
analysis, etc.
Markov is particularly remembered for his study
of Markov chains.
These chains are sequences of random variables
in which the future variable is determined by the
present variable but is independent of the way in
which the present state arose from its
predecessors.
NTUEE 72 KUO
Markov Chains - Introduction
Markov Models provide a structured approach for the
derivation of the reliability of complex systems
A Markov Chain is a stochastic process X(t) - an infinite
sequence of random variables indexed by time t , with a
special probabilistic structure
For a stochastic process to be a Markov Chain, its future
behavior must depend only on its present state, and not on
any past state
X(t+s) depends on X(t), but given X(t), X(t+s) does not
depend on any X(τ) for τ < t
If X(t)=i - the chain is in state i at time t
We deal only with Markov Chains with continuous time
(0≤t≤ ∞ ) and discrete state (X(t)=0,1,2,…)
NTUEE 73 KUO
Markov Chain - Probabilistic Interpretation
Prob{X(t+s)=j | X(t)=i,X(τ)=k} =
Prob{X(t+s)=j | X(t)=i} (τ<t)
Once the chain moves into state i, it stays there for a
length of time which has an exponential distribution with
parameter λi - it has a constant rate λi of leaving state i
The probability that when leaving state i the chain will
move to state j (with j≠i) - Pij
Transition rate from state i to state j is λij = Pij λi
∑
j ≠i
Pij = 1 ∑
j ≠i
λij = λi
NTUEE 74 KUO
Markov chains
Markov chains
− illustrated by state transition diagrams
idea:
− states
» components working or not
− state transitions
» when components fail or get repaired
NTUEE 75 KUO
Single-component system, no repair
Only two states
− one operational (state 1) and one failed (state 2)
− if no repair is allowed, there is a single, non-reversible transition
between the states (used in availability analysis)
− label λ corresponds to the failure rate of the component
NTUEE 76 KUO
Single-component system with repair
If
repair is allowed (used in availability
analysis)
− then a transition between the failed and the operational
state is possible
− the label is the repair rate µ
NTUEE 77 KUO
Failed-safe and failed-unsafe
Insafety analysis, we need to distinguish
between failed-safe and failed-unsafe states
− let 2 be a failed-safe state and 3 be a failed-unsafe
state
− the transition between the 1 and 2 depends on
failure rate and the probability that, if a fault
occurs, it is detected and handled appropriately
(i.e. fault coverage C)
− if C is the probability that a fault is detected, 1-C is
the probability that a fault is not detected
NTUEE 78 KUO
Two-component system
NTUEE 79 KUO
State transition diagram simplification
1 2 3
NTUEE 80 KUO
Model Selection
There are wide range of models available, each
has its strength and weakness.
Combinatorial models (reliability block diagrams,
fault trees) are straightforward and easy to
understand.
However, it is not easy to represent
non-independent behavior using combinatorial
models.
NTUEE 81 KUO
Markov: Pros and Cons
NTUEE 82 KUO
What is Markov Analysis?
Markovian property:
Given the present state, the future is
independent of the past.
NTUEE 83 KUO
Markov chain analysis
The aim is to compute Pi(t), the probability that
the system is in the state i at time t
Once Pi(t) is known, the reliability, availability or
safety of the system can be computed as a sum
taken over all operating states
To compute Pi(t), we derive a set of differential
equations, called state transition equations, one
for each state of the system
NTUEE 84 KUO
Transition matrix
State transition equations are usually
presented in matrix form
Transition matrix M has entries mij,
representing the rates of transition
between the states i and j
index i is used for the number of columns
index j is used for the number of rows
NTUEE 85 KUO
Single-component system, no repair
NTUEE 86 KUO
Single-component system with repair
NTUEE 87 KUO
Single-component system, safety analysis
NTUEE 88 KUO
Two-component parallel system
NTUEE 89 KUO
Important properties of matrix M
NTUEE 90 KUO
State transition equations
NTUEE 91 KUO
Two-component parallel system
Using transition matrix derived earlier, we get:
NTUEE 92 KUO
Solving state transition equations
By solving these equations, we get
P1(t) = e-2λt
P2(t) = 2e-λt - 2e-2λt
P3(t) = 1- 2e-λt + e-2λt
Since the Pi(t) are known, we can compute the
reliability of the system as a sum of probabilities
taken over all operating states
Rparallel(t) = P1(t) + P2(t) = 2e-λt - e-2λt
NTUEE 93 KUO
Comparison to RBD result
NTUEE 94 KUO
Dependent component case
The value of Markov chains become evident
when component failures cannot be assumed to
be independent
− load-sharing components
− examples: electrical load, mechanical load, information
load
If two components share the same load and one
fails, the additional load on the second
component increases its failure rate
NTUEE 95 KUO
Parallel system with load sharing
NTUEE 96 KUO
Parallel system with load sharing
d P4(t) = λ'2P2(t)+λ'1P3(t)
dt
NTUEE 97 KUO
Effect of the load
NTUEE 98 KUO
Availability evaluation
Difference with reliability analysis:
− in reliability analysis components are allowed to be
repaired as long as the system has not failed
− in availability analysis components can also be repaired
after the system failure
NTUEE 99 KUO
Two-component standby system
λ1
λ2 state 1: both OK
µ
1 2 3
state 2: primary failed and
replaced by spare
-λ1 µ 0 state 3: both failed
M = λ1 -λ2-µ 0
0 λ2 0
λ1 λ2
States are the same.
µ µ
1 2 3 Repair replaces a broken
component by a working one.
-λ1 µ 0 Here we assume that there is
M = λ1 -λ2-µ µ only one repair team.
0 λ2 -µ
λ1 λ2
If we assume that there are
µ 2µ
1 2 3 two independent repair teams,
then µ on the edge from 3 to 2
-λ1 µ 0 gets the coefficient 2 (the rate
M = λ1 -λ2-µ 2µ doubles).
0 λ2 -2µ
λC 2
1
λ(1-C) 3
P1(t) -λ 0 0 P1(t)
λC
d
P2(t) = 0 0 · P2(t)
dt
P3(t) λ(1-C) 0 0 P3(t)
P1(t) = e-λt
P2(t) = C(1- e-λt)
P3(t) = (1-C) – (1-C)e-λ t
n!
( nk ) =(n-k)! k!
•For example
4 4!
( )=
2
=6
(4-2)! 2!
So, we get
P>2 failed =4 P1 works 3 failed + P4 failed
where
P1 works 3 failed = R (1-R)3 , P4 failed = (1-R)4
λ0 λ1 λ2 λn
0 1 2 n
Non-identical p1 and p2
Processor Processor
p1 has failure rate λ1, repair rate µ1
1 2
p2 has failure rate λ2, repair rate µ2
Other measurements.
λ λ
PF(s) = λ+µ − λ+µ
s s+(λ+µ)
NTUEE 124 KUO
Analytical Solution- cont.
use the Inverse Laplace transform
note that when t goes to infinity, the above has the steady
state solution
Also, if we change the initial conditions to pw(0)=0 and
pF(0)=1, then we have
λpw= µpF
λ
W F pF + pw= 1
µ
pw= (µ/λ) pF
pF + (µ/λ) pF = 1
[(µ+λ)/λ] pF= 1
pw= µ/(µ+λ)
pF= λ/(µ+λ)
n = 20 2n = 1,048,576
n = 30 2n = 1,073,741,824