0% found this document useful (0 votes)
21 views128 pages

Module 2 Modeling and Evaluation

The document discusses modeling and evaluating dependable computer systems and networks. It describes two approaches to evaluation - qualitative and quantitative. Qualitative evaluation aims to identify and classify failure modes, while quantitative evaluation aims to evaluate reliability, availability, and safety using probabilities. The document also discusses concepts from probability theory used in modeling, such as probability density functions, cumulative distribution functions, expected value, and variance. It describes common probability distributions and the failure rate of components over time. Mean time to failure is defined as the expected time until the first failure occurs.

Uploaded by

ynjuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views128 pages

Module 2 Modeling and Evaluation

The document discusses modeling and evaluating dependable computer systems and networks. It describes two approaches to evaluation - qualitative and quantitative. Qualitative evaluation aims to identify and classify failure modes, while quantitative evaluation aims to evaluate reliability, availability, and safety using probabilities. The document also discusses concepts from probability theory used in modeling, such as probability density functions, cumulative distribution functions, expected value, and variance. It describes common probability distributions and the failure rate of components over time. Mean time to failure is defined as the expected time until the first failure occurs.

Uploaded by

ynjuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

DEPENDABLE COMPUTER SYSTEMS AND

NETWORKS

Mudule 2 – Modeling and Evaluation


Sy-Yen Kuo
郭斯彥

NTUEE 1 KUO
Evaluation
 Motivation for Evaluation
− Determine whether reliability specification are met
− Determine most cost-effective design technique
− Determine whether redundancy is required
− Useful for comparing redundancy techniques

 To decide given an application of fault


tolerance, which scheme to use
− Long life applications with low power and
prolonged reliability need dynamic redundancy
− Aircraft control needs very high reliability, short
term static or masking redundancy

NTUEE 2 KUO
Evaluation Criteria
 A method of evaluation is required in order to compare
the redundancy techniques and make subsequent design
tradeoffs
 Modeling techniques are very vital means for obtaining
reasonable predictions for system reliability and
availability
─ Combinatorial: series/parallel, K-of-N, nonseries/nonparallel
─ Markov: time invariant, discrete time, continuous time, hybrid
─ Queuing
 Using these techniques probabilistic models of systems
can be created and used to evaluate system reliability
and/or availability

NTUEE 3 KUO
Two approaches

 Qualitative evaluation
− aims to identify, classify and rank the failure modes, or
event combinations that would lead to system failures
 Quantitative evaluation
− aims to evaluate in terms of probabilities the attributes
of dependability (reliability, availability, safety)

NTUEE 4 KUO
Concepts from Probability Theory
Probability density function: pdf
f(t) = prob[t ≤ x ≤ t + dt] / dt = dF(t) / dt
Cumulative distribution function: CDF
F(t) = prob[x ≤ t] = ∫ 0t f(x) dx

Expected value of x
+∞
Ex = ∫ −∞ x f(x) dx = ∑k xk f(xk)

Variance of x
+∞
σ2x = ∫ −∞ (x – Ex)2 f(x) dx
= ∑k (xk – Ex)2 f(xk)

Covariance of x and y
ψx,y = E [(x – Ex)(y – Ey)]
= E [x y] – Ex Ey

NTUEE 5 KUO
Some Simple Probability Distributions
F(x)
1
CDF CDF CDF CDF

f(x)
pdf pdf pdf

Uniform Exponential Normal Binomial

NTUEE 6 KUO
Failure Rate
 failure rate λ(t)
− expected # of failures per time-unit
− example
» 1000 controllers working at t0
» after 10 hours: 950 working
» failure rate for each controller:
0.005 failures / hour

NTUEE 7 KUO
Failure rate

 typical evolution of λ(t) for hardware:


λ(t)

t
 bathtub: I infant mortality, II useful life, III wear-out
 for useful life period λ = constant, the reliability is
given by
R(t ) = e − λt

NTUEE 8 KUO
Exponential failure law

− λt If λ is constant, R(t) varies


R(t ) = e exponentially as a function of time

1
0.8
0.6
0.4
0.2
0

NTUEE 9 KUO
Reliability and MTTF of a Single
Component (Module)

 Module operational at time t=0


 Remains operational until it is hit by a failure
 All failures are permanent
 T - lifetime of module - time until it fails
− T is a random variable
 f(t)- density function of T
 F(t) - cumulative distribution function of T

NTUEE 10 KUO
Probabilistic Interpretation of f(t) and F(t)
 F(t) - probability that the component will fail at or
before time t
F(t) = Prob {T ≤ t}
 f(t) – not a probability, but the momentary rate of
probability of failure at time t
f(t)dt = Prob {t ≤ T ≤ t+dt}
 Like any density function (defined for t ≥ 0)

f(t) ≥ 0 (for all t ≥ 0) and
∫ f (t )dt = 1
0

 The functions F and f are related through

t
f (t ) = dF (t ) / dt and F (t ) = ∫ f ( s )ds
0
NTUEE 11 KUO
Reliability and Failure Rate
 The reliability of a single module - R(t)
− R(t) = Prob {T>t} = 1- F(t)
 The conditional probability that the module will fail at
time t, given it has not failed before, is

Prob {t ≤ T ≤ t+dt | T ≥ t} = Prob {t ≤T≤ t+dt} / Prob{T ≥ t}


= f(t)dt / (1-F(t))

 The failure rate (or hazard rate) of a component at time


t, λ(t), is defined as
λ(t) = f(t)/(1- F(t))
 Since dR(t)/dt = - f(t), we get λ(t) = -1/R(t) • dR(t)/dt

NTUEE 12 KUO
Constant Failure Rate
 If the module has a failure rate which is constant over
time -
λ(t) = λ
dR(t) / dt = - λ R(t) ; R(0)=1
 The solution of this differential equation is
R(t ) = e − λt

f (t ) = λe − λt
F (t ) = 1 − e − λt
 A module has a constant failure rate if and only if T, the
lifetime of the module, has an exponential distribution

NTUEE 13 KUO
Time varying failure rate
 Failure rate is not always constant
– software failure rate decreases as package matures
 Weibull distribution:
z(t ) = αλ(λt )α −1
 if α=1, then z(t) = constant = λ

if α>1, then z(t) increases as time increases


if α<1, then z(t) decreases as time increases
α
−(λt )
R(t ) = e

NTUEE 14 KUO
Failure rate calculation
 determined for components
– systems: combination of components
– λ of the system = sum of λ of the components
 determine λ experimentally
– slow
• e.g. 1 failure per 100 000 hours (=11.4 years)
– expensive
• many components required for significance
 use standards for λ
 The dimension of failure rate is FIT (failures in time)
 x FIT = x failures per 10-9 hours

NTUEE 15 KUO
Empirical Formula for λ - Failure Rate
 λ = πL πQ (C1 πT πV + C2 πE)
− πL: Learning factor, (how mature the technology is)
− πQ: Manufacturing process Quality factor (0.25 to 20.00)
− πT: Temperature factor, (from 0.1 to 1000), proportional to exp(-Ea/kT)
where Ea is the activation energy in electron-volts associated with the
technology, k is the Boltzmann constant and T is the temperature in Kelvin
− πV: Voltage stress factor for CMOS devices (from 1 to 10 depending on
the supply voltage and the temperature); does not apply to other
technologies (set to 1)
− πE: Environment shock factor: from about 0.4 (air-conditioned
environment), to 13.0 (harsh environment - e.g., space, cars)

− C1, C2: Complexity factors; functions of number of gates on the chip and
number of pins in the package

− Further details: MIL-HDBK-217E handbook


NTUEE 16 KUO
NTUEE 17 KUO
Mean Time to Failure (MTTF)
 MTTF: mean time to failure
− expected time until the first failure occurs

 MTTF - expected value of the lifetime T



R (t ) = e − λt MTTF = E[T ] = ∫ t ⋅ f (t )dt
If the failure rate is a constant λ
0

∞ ∞
1
MTTF = ∫ t ⋅ λe dt = ∫ e dt =
− λt − λt

0 0
λ
MTTF is defined in terms of reliability as:
MTTF = ∫ R(t )dt
NTUEE 18 KUO
MTTF
R(t ) = e − λt

R(t) 1
0.8
0.6
0.4
0.2
0
1/λ 2/λ 3/λ
t

NTUEE 19 KUO
MTTF

 MTTF is meaningful only for systems


which operate without repair until they
experience a failure
 Most of mission-critical systems undergo
a complete check-up before the next
mission
− all failed redundant components are replaced
− system is returned to fully operational state

NTUEE 20 KUO
Probabilistic Models
 Hardware component failure rate function: Experimentally
observed Bathtub Curve
 Bath tub curve for failure rate
− implies constant failure rate during useful life
− infant mortality and wear-out periods have variable
failure rates
Infant
mortality
Component wear-out
failure rate

useful life

20 weeks 5-25 years Time


NTUEE 21 KUO
Weibull Distribution - Introduction
 Most calculations of reliability assume that a module
has a constant failure rate λ (or equivalently - an
exponential distribution for the module lifetime T)
 There are cases in which this simplifying assumption
is inappropriate
 Example - during the ‘’infant mortality” and ‘’wear-
out” phases of the bathtub curve
 Weibull distribution for the lifetime T can be used
instead

NTUEE 22 KUO
Weibull distribution - Equation
 The Weibull distribution has two parameters, λ and β
 The density function of the component lifetime T:

β −1 −λt β
f (t ) = λβ t e
 The failure rate for the Weibull distribution is
β −1
λ (t ) = λβt
λ(t) is decreasing with time for β<1, increasing with
time for β>1, constant for β=1, appropriate for infant
mortality, wear-out and middle phases, respectively

NTUEE 23 KUO
Reliability and MTTF for Weibull
Distribution
 Reliability for Weibull distribution is
− λt
β
R (t ) = e
 MTTF for Weibull distribution is

MTTF = Γ(1 / β ) /( βλ1/ β )


( Γ(x) is the Gamma function )
 The special case β = 1 is the exponential
distribution with a constant failure rate λ

NTUEE 24 KUO
Mean Time to Repair (MTTR)
 The average time required to repair a system.
 The MTTR is normally specified in terms of a
repair rate, µ, which is the average number of
repairs that occur per time period (number of
repairs per hour).
 difficult to calculate
 determined experimentally
 normally specified in terms of repair rate repair rate
µ, which is the average number of repairs that occur
per time period

MTTR = 1
µ
NTUEE 25 KUO
MTTR

 Low MTTR requirement implies high operational


cost
− if hardware spares are kept on cite and the cite is
maintained 24hr a day, MTTR=30min
− if the cite is maintained 8hr 5 days a week, MTTR = 3
days
 if system is remotely located MTTR = 2 weeks

NTUEE 26 KUO
MTBF
 Reliability
computation - mean time between
failure (MTBF)
» Mean time between failure - MTBF
• use heuristic arguments to conclude
– MTBF = (total time T)/(average number of
failures)
• can also argue MTBF = MTTF + MTTR
» Note: often λ << μ and hence MTTF >> MTTR ,
therefore the words MTTF and MTBF are used
interchangeably by some practitioners

NTUEE 27 KUO
Single Parameter Measures

MTBF=MTTF+MTTR

MTTF MTTR MTTF MTTR


first failure second failure

(system with repair) MTBF = MTTF + MTTR

NTUEE 28 KUO
Combinatorial Modeling

NTUEE 29 KUO
Combinatorial Modeling
 System is divided into non-overlapping modules
 Each module is assigned either a probability of working, Pi, or a probability
as function of time, Ri(t)
 The goal is to derive the probability, Psys, or function Rsys(t): Prob that the
system survives until time t
 Assumptions:
− module failures are independent
− once a module has failed, it is always assumed to yield incorrect
results
− System considered failed if it does not contain a minimal set of
functioning modules
− once system enters a failed state, other failures cannot return system
to functional state
 Models typically enumerate all the states of the system that meet or
exceed the requirements for a correctly functioning system
 Combinatorial counting techniques are used to simplify this process

NTUEE 30 KUO
Canonical Structures

A canonical structure is constructed out of


N individual modules
 The basic canonical structures are
−A series system
−A parallel system
−A mixed system
 We will assume statistical independence
between failures in the individual modules

NTUEE 31 KUO
Reliability of a Series System
 A series system - set of modules so that the failure
of any one module causes the entire system to fail

 Reliability of a series system - Rs(t) - product of


reliabilities of its N modules

N
R (t ) = ∏ R (t )
s i
i =1
 Ri(t) is the reliability of module i

NTUEE 32 KUO
Series System – Modules Have
Constant Failure Rates
 Every module i has a constant failure rate λi
− λi t
Ri (t ) = e
− λs t − Σλi t
Rs (t ) = e =e
 λs =Σλi is the constant failure rate of the series system
− Effect is summation of failure rates of components
 Mean Time To Failure of a series system -
1 1
MTTFs = =
λs Σλi

NTUEE 33 KUO
Reliability of a Parallel System
 A Parallel System - a set of modules connected so
that all the modules must fail before the system fails

 Reliability of a parallel system - R p (t )


N
R (t ) = 1 − ∏ [1 − R (t )]
p i =1
i

 R (t ) is the reliability of module i


i
NTUEE 34 KUO
Parallel System – Modules have
Constant Failure Rates
 Module i has a constant failure rate, λi
− λi t
N
R (t ) = 1 − ∏ [1 − e
− λi t
Ri (t ) = e ]
p i =1

 Example - a parallel system with two modules


− λ1t − λ2t − ( λ1 + λ2 ) t
R p (t ) = e +e −e
 MTTF of a parallel system with the same λ

N
1
MTTFp = ∑
i =1 iλ
NTUEE 35 KUO
Series-Parallel Systems
 Consider combinations of series and parallel
systems
 Example, two CPUs connected to two memories
in different ways
a b
Rsys = 1 − (1 − Ra Rb )(1 − Rc Rd )
c d

Rsys = [1 − (1 − Ra )(1 − Rc )][1 − (1 − Rb )(1 − Rd )] CPU Memory


a b

c d

NTUEE 36 KUO
A Simple Example
 Consider dynamic redundant system with spares (dynamic
redundancy)
 As soon as fault occurs, a faulty component is replaced by a spare
 Up to n-1 spare modules
1
 Rsys = 1 - (1-R1) (1-R2)... (1-Rn)
2

3
 Consider identical modules with R i = 0.9
n
 How can you increase Rsys to 0.999999 = 1-10-6
 Prob. of module i to survive = Ri
 Number of modules n = ln 10-6 / ln (1-Ri) = 6
 Hence, need 5 spares to make reliable system

NTUEE 37 KUO
Non Series/Parallel
Systems

 Each path represents a configuration allowing the system


to operate successfully, e.g., ADF
 The reliability can be calculated by expanding about a
single module i :
 Rsystem=Ri Prob{System works | i is fault-free} + (1-Ri)
Prob{System works | i is faulty}
 Draw two new diagrams: in (a) module i is operational; in
(b) module i is faulty
 Module i is selected so that the two new diagrams are
closer to simple series/parallel structures

NTUEE 38 KUO
Expanding about C

(a) (b)
 The process of expanding can be repeated until the
resulting diagrams are of the series/parallel type
 Figure (a) needs further expansion about E
 Figure (a) should not be viewed as a parallel connection of
A and B, connected serially to D and E in parallel. Such a
diagram will have the path BCDF which is not a valid path

NTUEE 39 KUO
Expanding about C and E

(a) (b)

 Rsystem=RC Prob {System works | C is operational}


+(1-RC) RF [1-(1-RA RD)(1-RB RE)]
 Expanding about E yields
 Prob {System works | C is operational} = RE RF [1-
(1-RA)(1-RB)] +(1-RE)RA RD RF
 Substituting results in
 Rsystem=RC [RE RF(RA+RB-RA RB)+(1-RE) RA RD RF]
+(1-RC) [RF(RA RD+RB RE-RA RD RB RE)]
 Example: RA=RB=RC=RD=RE=RF=R
3 3 2
Rsystem = R ( R - 3R + R +2)
NTUEE 40 KUO
Upper Bound on Reliability
 If structure is too complicated - derive upper and lower
bounds on Rsystem
 An upper bound - Rsystem ≤ 1 - ∏ (1-Rpath_i)
− Rpath_i - reliability of modules in series along path i
− Assuming all paths are in parallel
 Example - the paths are ADF, BEF and ACEF
 Rsystem ≤ 1 -(1-RA RD RF)(1-RB RE RF)(1-RA RC RE RF)
 If RA=RB=RC=RD=RE=RF=R then
Rsystem ≤ R 3 ( R 7 − 2 R 4 − R 3 + R + 2)

NTUEE 41 KUO
Lower Bound on Reliability
 A lower bound is calculated based on minimal cut sets of
the system diagram
 A minimal cut set: a minimal list of modules such that the
removal (due to a fault) of all modules will cause a working
system to fail
 Minimal cut sets: F, AB, AE, DE
and BCD
 The lower bound is
 Rsystem ≥ ∏ (1-Qcut_i)
− Qcut_i - probability that the minimal
cut i is faulty (i.e., all its modules are faulty)
 Example - RA=RB=RC=RD=RE=RF=R
Rsystem ≥ R 5 (24 − R 5 + 9 R 4 − 33R 3 + 62 R 2 − 60 R)
NTUEE 42 KUO
Example – Comparison of Bounds

 Example - RA=RB=RC=RD=RE=RF=R
 Lower bound here is a very good estimate for a
high-reliability system

NTUEE 43 KUO
Reliability Block Diagram

NTUEE 44 KUO
Reliability Block Diagram

NTUEE 45 KUO
Reliability Block Diagram

NTUEE 46 KUO
Reliability Block Diagram

NTUEE 47 KUO
Reliability Block Diagram

NTUEE 48 KUO
M-out-of-N Systems

 Static or masking redundancy A


 Consider TMR (Triple Modular Redundancy)
system (2-out-of-3)
B V
 Out of 3 modules, two need to function
for the system to operate correctly
C
 3 2
=R +
3
RTMR Rm (1− Rm )
m
 2
 For general M-out-of-N system N working
N-1 working
 Out of N modules, need M to function V
N-2 working
N-M working
N−M
 N N −i
RMN = ∑ Rm (1 − Rm ) i
i =0  i 
Failed

NTUEE 49 KUO
Cascading TMR Systems
 Consider n stages of original system
 Each stage replaced by TMR with Voter

Reliability of the system


n
  3  3 2 
Rcascade =  RV  Rm +   Rm (1 − Rm )  
  2  
  
NTUEE 50 KUO
Pitfalls Using Single Model
 Compare reliability of simplex and TMR systems

Rsimplex(t) = e - λ t

∫e
− λt
MTTFsimplex = dt = 1/ λ

 3  − 2 λt
RTMR (t ) = e −3λt
+  e (1 − e −λt )
 2
3 2 5
MTTFTMR = − =
2λ 3λ 6λ

MTTFsimplex > MTTFTMR

NTUEE 51 KUO
Pitfalls Using Single Model (cont.)
1

0.9
TMR
0.8

0.7

0.6
Reliability

0.5

0.4

0.3

0.2
Simplex
0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
λto lambda * t

RTMR (t ) ≥ R (t ) 0 ≤ t ≤ t0
RTMR (t ) ≤ R (t ) t0 ≤ t < ∞
ln 2 0.7
where t0 = ≈
λ λ
NTUEE 52 KUO
Mission Time
 Suppose we wish to have super reliable
system with reliability r > 0.9999 → MT(r)
gives the time at which system reliability
falls below r.
 R(MT(r)) = r R(t)
 MT(R(t)) = t
 For constant failure rate
R ( t ) = e − λt = r
− ln r
MT ( r ) =
λ
 For nonredundant system with n components MT(r)
− ln r
MT ( r ) = n

∑λ
i =1
i

NTUEE 53 KUO
Availability
 Probability system will be operating at time t, A(t).
 Steady-state availability: Proportion of total
operating time system is operational.
 Steady-state availability with constant λ and µ

MTTF MTTF µ
= =
MTBF MTTF + MTTR λ + µ

1
 MTTF =
λ

NTUEE 54 KUO
Comparative Measures

 Reliability difference Rnew (t ) − Rold (t )


 Reliability gain Rnew (t ) 0.9999
 Reliability improvement(RIF) factor =
 Mission time(MT) improvement
Rold (t ) 0.999
1 − Rold (t )
RIF =
1 − Rnew (t )
MTnew ( r )
MTold ( r )

NTUEE 55 KUO
Pitfalls Using Single Model (cont.)
 Instead of MTTF, look at mission time
 Reliability of M-out-of-N systems very high in the beginning
− spare components tolerate failures
 Reliability sharply falls down in end
− system exhausted redundancy, more hardware can possibly fail
 Such systems useful in aircraft control
− very high reliability, short time
− 0.99999 over 10 hour period
 Improve “Vanilla” TMR: TMR with Recovery, TMR Simplex

NTUEE 56 KUO
Effect of Coverage

 Failure detection is not perfect


 Reconfiguration may not succeed
 Attach a coverage “c”
1

One spare system 2

Rsys = R1 + c (1-R1) R2
3

n-1 spare system


n
n −1 i

Rsys = Rm ∑ c i (1 − Rm )
i =0

NTUEE 57 KUO
Effect of Coverage (cont.)
 If coverage is 100%, then given low module reliability, can
increase system reliability arbitrarily

Rm = 0.9 Rm = 0.7 Rm = 0.5

C=0.99, n=2 0.989 0.908 0.748

C=0.99, n=4 0.999 0.988 0.931


With low coverage, C=0.99, n=inf 0.999 0.996 0.990
reliability saturates
C= 0.8 , n=2 0.972 0.868 0.700

C= 0.8 , n=4 0.978 0.918 0.812

C=0.8, n=inf 0.978 0.921 0.833

NTUEE 58 KUO
Effect of Voter

 Previous expression for reliability assumed voter


100% reliable
 Assume voter reliability Rv

 3 2
= RV (R + Rm (1 − Rm ))
3
RTMRV m
 2

NTUEE 59 KUO
Compensating & Non-overalpping Faults
 Conservative assumption - every failure of voter leads
to an erroneous output and any failure of two modules
is fatal
 Counter Example - one module produces a permanent
logical 1 and a second module has a permanent
logical 0 - TMR will function properly
− These are compensating faults
 A similar situation may arise regarding certain faults
within the voter circuit
 Another example - non-overlapping faults - one
module has a faulty adder and another module has a
faulty multiplier
 If the circuits are disjoint, they are unlikely to generate
wrong outputs simultaneously

NTUEE 60 KUO
Voters
 A voter receives inputs X1, X2,...,XN from an M-of-N
cluster and generates a representative output
 Simplest voter - bit-by-bit comparison of the outputs
producing the majority vote
 This only works when all functional processors
generate outputs that match bit by bit
− Processors must be identical, be synchronized and
use the same software
 Otherwise - two correct outputs can diverge slightly, in
the lower significant bits

NTUEE 61 KUO
Active/Dynamic Redundancy

 In previous examples - considerable extra hardware used


to instantaneously mask errors
 In many cases, temporary erroneous results may be
acceptable if
− system can detect an error
− replace the faulty module by a fault-free spare
− reconfigure itself
 This is called dynamic
(or active) redundancy

Example:

NTUEE 62 KUO
Reliability of Dynamic Redundancy -
Powered Spares
 All N spare modules are active (powered) and have the
same failure rate – resulting in a basic parallel system
with N+1 modules
 System reliability is
N +1
R dynamic (t ) = Rdru (t )[1 − (1 − R (t ) ) ]
R(t) - reliability of module
Rdru(t) - reliability of Detection &
Reconfiguration unit

NTUEE 63 KUO
Dynamic Redundancy with
Unpowered (Standby) Spares
 Spare modules are not powered (e.g., to conserve
energy) and cannot fail until they become active
 C – coverage factor – probability that faulty active module
is correctly diagnosed and disconnected, and good spare
successfully connected
 Calculating exact reliability for the general case is
complicated
 Reliability for a special case:
− Very large N ; constant failure rate λ per active module
− Rate of nonrecoverable faults is (1-C)λ
− Reliability at time t – probability of no nonrecoverable faults up to
time t
− (1− c ) λ t
Rdynamic (t ) = Rdru (t )e
NTUEE 64 KUO
Hybrid Redundancy
 NMR masks permanent and intermittent failures but its
reliability drops below that of a single module for very
long mission times
 Hybrid redundancy overcomes this by adding spare
modules to replace active modules once they become
faulty
 A hybrid system consists of
a core of N processors
(NMR), and K spares

NTUEE 65 KUO
Hybrid Redundancy - Reliability
 Reliability of a hybrid system with a TMR core and K
spares is
Rhybrid (t ) = Rvot (t ) Rrec (t )[1 − mR(t )(1 − R(t )) m−1 − (1 − R(t )) m ]

− m=K+3 - total number of modules


− Rvot(t) and Rrec(t) - reliability of voter and comparison
& reconfiguration circuitry
− Assuming: any fault in voter or comparison &
reconfiguration circuit will cause a system fault
 In practice, not all faults in these circuits will be fatal:
the reliability will be higher

NTUEE 66 KUO
Duplex Systems

 Both processors execute the same task


− If outputs are in agreement - result is assumed to be
correct
− If results are different - we can not identify the failed
processor
− A higher-level software has to decide how failure is
to be handled
− This can be done using one of several methods
NTUEE 67 KUO
Duplex Reliability
 Two active identical processors with reliability R(t)
 Lifetime of duplex - time until both processors fail
 C - Coverage Factor - probability that a faulty processor
will be correctly diagnosed, identified and disconnected
 Rduplex(t) - the reliability of duplex system:

Rduplex(t) = Rcomp(t) [ R² (t)+2C R(t)(1-R(t) ]
Rcomp(t) - reliability of comparator

NTUEE 68 KUO
Duplex - Constant Failure Rates

 Each processor has a constant failure rate λ


 Ideal comparator - Rcomp(t)=1

 Duplex reliability -

Rduplex (t ) = e − 2λt + 2Ce − λt (1 − e − λt )
 MTTFduplex = 1/(2λ) + C/λ

NTUEE 69 KUO
More Complex Systems
 NMR systems in which failing processors are identified
and replaced from an infinite pool of spares - similar
calculation to duplex
 Finite set of spares - the summation in the reliability
derivation is capped at that number of spares, rather
than going to infinity
 Other variations of duplex systems -
− One processor is active while the second is a standby spare
− Processors can be repaired when they become faulty
 Combinatorial arguments may be insufficient for
reliability calculation in more complex systems
 If failure rates are constant, we can use Markov Models
for reliability calculations

NTUEE 70 KUO
Markov Approach

NTUEE 71 KUO
Who was Markov?
 Andrei A. Markov graduated from Saint
Petersburg University in 1878 and subsequently
became a professor there.
 His early work dealt mainly in number theory and
analysis, etc.
 Markov is particularly remembered for his study
of Markov chains.
 These chains are sequences of random variables
in which the future variable is determined by the
present variable but is independent of the way in
which the present state arose from its
predecessors.

NTUEE 72 KUO
Markov Chains - Introduction
 Markov Models provide a structured approach for the
derivation of the reliability of complex systems
 A Markov Chain is a stochastic process X(t) - an infinite
sequence of random variables indexed by time t , with a
special probabilistic structure
 For a stochastic process to be a Markov Chain, its future
behavior must depend only on its present state, and not on
any past state
 X(t+s) depends on X(t), but given X(t), X(t+s) does not
depend on any X(τ) for τ < t
 If X(t)=i - the chain is in state i at time t
 We deal only with Markov Chains with continuous time
(0≤t≤ ∞ ) and discrete state (X(t)=0,1,2,…)

NTUEE 73 KUO
Markov Chain - Probabilistic Interpretation
 Prob{X(t+s)=j | X(t)=i,X(τ)=k} =
Prob{X(t+s)=j | X(t)=i} (τ<t)
 Once the chain moves into state i, it stays there for a
length of time which has an exponential distribution with
parameter λi - it has a constant rate λi of leaving state i
 The probability that when leaving state i the chain will
move to state j (with j≠i) - Pij
 Transition rate from state i to state j is λij = Pij λi


j ≠i
Pij = 1 ∑
j ≠i
λij = λi

NTUEE 74 KUO
Markov chains
 Markov chains
− illustrated by state transition diagrams
 idea:
− states
» components working or not
− state transitions
» when components fail or get repaired

NTUEE 75 KUO
Single-component system, no repair
 Only two states
− one operational (state 1) and one failed (state 2)
− if no repair is allowed, there is a single, non-reversible transition
between the states (used in availability analysis)
− label λ corresponds to the failure rate of the component

NTUEE 76 KUO
Single-component system with repair

 If
repair is allowed (used in availability
analysis)
− then a transition between the failed and the operational
state is possible
− the label is the repair rate µ

NTUEE 77 KUO
Failed-safe and failed-unsafe
 Insafety analysis, we need to distinguish
between failed-safe and failed-unsafe states
− let 2 be a failed-safe state and 3 be a failed-unsafe
state
− the transition between the 1 and 2 depends on
failure rate and the probability that, if a fault
occurs, it is detected and handled appropriately
(i.e. fault coverage C)
− if C is the probability that a fault is detected, 1-C is
the probability that a fault is not detected

NTUEE 78 KUO
Two-component system

 Has four possible states


O O state 1
F O state 2
O F state 3
F F state 4
 Components are assumed to be independent
and non-repairable
 If components are in serial
− state 1 is operational state, states 2,3,4 are failed states
 If components are in parallel
− states 1,2,3 are operational states, state 4 is failed state

NTUEE 79 KUO
State transition diagram simplification

 Suppose two components are in parallel


 Suppose λ1 = λ2 = λ
 Then, it is not necessary to distinguish between
the states 2 and 3
− both represent a condition where one component is
operational and one is failed
− since components are independent events, transition
rate from state 1 to 2 is the sum of the two transition
rates

1 2 3

NTUEE 80 KUO
Model Selection
 There are wide range of models available, each
has its strength and weakness.
 Combinatorial models (reliability block diagrams,
fault trees) are straightforward and easy to
understand.
 However, it is not easy to represent
non-independent behavior using combinatorial
models.

NTUEE 81 KUO
Markov: Pros and Cons

 Markov provides flexibility for modeling


reliability, safety, performance and combined
measures.
 The state space grows much faster than the
number of components, making model
specification and analysis difficult.

NTUEE 82 KUO
What is Markov Analysis?

 Markovian property:
Given the present state, the future is
independent of the past.

Definition of Markov Process:


A stochastic process {X(t)| t ∈ T} is called a Markov
process if for any t0 < t1 < ... tn < t, the conditional
distribution of X(t) for given values of X(t0), X(t1), ...X(tn)
depends only on X(tn).

NTUEE 83 KUO
Markov chain analysis
 The aim is to compute Pi(t), the probability that
the system is in the state i at time t
 Once Pi(t) is known, the reliability, availability or
safety of the system can be computed as a sum
taken over all operating states
 To compute Pi(t), we derive a set of differential
equations, called state transition equations, one
for each state of the system

NTUEE 84 KUO
Transition matrix
 State transition equations are usually
presented in matrix form
 Transition matrix M has entries mij,
representing the rates of transition
between the states i and j
 index i is used for the number of columns
 index j is used for the number of rows

NTUEE 85 KUO
Single-component system, no repair

 Transition matrix M has the form:

 entries in each columns must sum up to 0


 entries mii, corresponding to self-transitions,
are computed as –(sum of other entries in
this column)

NTUEE 86 KUO
Single-component system with repair

 Transition matrix M has the form:

NTUEE 87 KUO
Single-component system, safety analysis

 Transition matrix M has the form:

NTUEE 88 KUO
Two-component parallel system

 Transition matrix M has the form:

NTUEE 89 KUO
Important properties of matrix M

 Sum of the entries in each column is 0


 Positivesign of an ijthentry indicates that
the transition originates from the ith state
 Inreliability analysis, M allows us to
distinguish between the operational and
failed states
– each failed state i has a zero diagonal
element mii (a failed state cannot leave)

NTUEE 90 KUO
State transition equations

 Let P(t) be a vector whose ith element is the


probability Pi(t), the probability that the system is
in the state i at time t
 The matrix representation of a system of state
transition equations is given by

NTUEE 91 KUO
Two-component parallel system
 Using transition matrix derived earlier, we get:

This represents the following system of equations

NTUEE 92 KUO
Solving state transition equations
 By solving these equations, we get
P1(t) = e-2λt
P2(t) = 2e-λt - 2e-2λt
P3(t) = 1- 2e-λt + e-2λt
 Since the Pi(t) are known, we can compute the
reliability of the system as a sum of probabilities
taken over all operating states
Rparallel(t) = P1(t) + P2(t) = 2e-λt - e-2λt

NTUEE 93 KUO
Comparison to RBD result

 Since R = e-λt, the previous equation can be


written as
Rparallel(t) = 2R – R2

 which agrees with the expression derived using


RBD
 two results are the same because we assumed
that the failure rates of the two components are
independent

NTUEE 94 KUO
Dependent component case
 The value of Markov chains become evident
when component failures cannot be assumed to
be independent
− load-sharing components
− examples: electrical load, mechanical load, information
load
 If two components share the same load and one
fails, the additional load on the second
component increases its failure rate

NTUEE 95 KUO
Parallel system with load sharing

 As before, we have four states, but after the 1st


component failure, the failure rate of the 2nd
component increases

NTUEE 96 KUO
Parallel system with load sharing

 State transition equations are:

P1(t) -λ1-λ2 0 0 0 P1(t)


λ1 -λ'2 0 0
d P2(t) P2(t)
= ·
λ2 0 -λ'1 0
dt
P3(t) P3(t)
P4(t) 0 λ'2 λ'1 0 P4(t)
d
dt P1(t) = (-λ1-λ2)P1(t)
d P (t) = λ P (t) -λ' P (t)
dt 2 1 1 2 2
d P3(t) = λ2P1(t) -λ'1P3(t)

d P4(t) = λ'2P2(t)+λ'1P3(t)
dt

NTUEE 97 KUO
Effect of the load

 If λ'1= λ1 and λ'2= λ2 , the equation of load


sharing parallel system reduces to well- known

Rparallel(t) = 2e-λt - e-2λt

NTUEE 98 KUO
Availability evaluation
 Difference with reliability analysis:
− in reliability analysis components are allowed to be
repaired as long as the system has not failed
− in availability analysis components can also be repaired
after the system failure

NTUEE 99 KUO
Two-component standby system

 First component is primary


 Second is held in reserve and only brought
to operation if the first component fails
 We assume that
− fault detection unit which detects failure of the
primary component is perfect
− standby component cannot fail while in the
standby mode

NTUEE 100 KUO


State transition diagram for reliability
analysis with repair

λ1
λ2 state 1: both OK
µ
1 2 3
state 2: primary failed and
replaced by spare
-λ1 µ 0 state 3: both failed
M = λ1 -λ2-µ 0
0 λ2 0

NTUEE 101 KUO


State transition diagram for availability
analysis with repair

λ1 λ2
States are the same.
µ µ
1 2 3 Repair replaces a broken
component by a working one.
-λ1 µ 0 Here we assume that there is
M = λ1 -λ2-µ µ only one repair team.
0 λ2 -µ

NTUEE 102 KUO


State transition diagram for availability
analysis with repair

λ1 λ2
If we assume that there are
µ 2µ
1 2 3 two independent repair teams,
then µ on the edge from 3 to 2
-λ1 µ 0 gets the coefficient 2 (the rate
M = λ1 -λ2-µ 2µ doubles).
0 λ2 -2µ

NTUEE 103 KUO


Availability analysis

 None of the diagonal elements of M are 0


 By solving the system, we can get Pi(t) and
compute the availability as a sum of probabilities
taken over all operating states
 Usually steady-state availability rather than time
dependent one is of interest
 As time approaches infinity, the derivative of the
left-hand side of the equation d/dt P(t) = M • P(t)
vanishes and we get time-independent
relationship

NTUEE 104 KUO


Two-component standby system

 Using the transition matrix derived earlier,


we get the following system of equations
-λ1P1(∝) + µP2(∝) = 0
λ1P1(∝) – (λ2+ µ)P2(∝) + µP3(∝) = 0
λ2P2(∝) – µP3(∝) = 0

 By solving the equations, we get


A(∝) ≈ 1 - (λ/µ)2

NTUEE 105 KUO


Safety evaluation

λC 2
1
λ(1-C) 3

 The state transition equations are:

P1(t) -λ 0 0 P1(t)
λC
d
P2(t) = 0 0 · P2(t)
dt
P3(t) λ(1-C) 0 0 P3(t)

NTUEE 106 KUO


Safety evaluation

 By solving these equations, we get

P1(t) = e-λt
P2(t) = C(1- e-λt)
P3(t) = (1-C) – (1-C)e-λ t

 Since the Pi(t) are known, we can compute the reliability


of the system as a sum of probabilities of being the
operational and fail-safe states
R(t) = P1(t) + P2(t) = C + (1-C)e-λt

 At time t=0, the safety is 1. As time approaches infinity,


the safety approaches C

NTUEE 107 KUO


How to deal with cases of systems
with “k out of n choices”
 Suppose we want to solve the following task:
What is the probability that more than two engines in a 4-
engine airplane will fail during a t-hour flight if the failure
rate of a single engine is λ per hour?
 The probability that more than two engines fail
can be expressed as:

Only probabilities of mutually exclusive events


can be summed up like this

NTUEE 108 KUO


“k out of n choices”

 “k out of n choices” can be computed as

n!
( nk ) =(n-k)! k!
•For example
4 4!
( )=
2
=6
(4-2)! 2!

NTUEE 109 KUO


Example cont.

So, we get
P>2 failed =4 P1 works 3 failed + P4 failed

where
P1 works 3 failed = R (1-R)3 , P4 failed = (1-R)4

where R is the reliability of a single


engine computed as R = e-λt

NTUEE 110 KUO


Markov Example (1)

 Example --- A continuous-time, discrete-


state Markov process, also called a
Markov chain

Pure-birth process (Poisson process if λ0= λ1 = λ2 … )

λ0 λ1 λ2 λn
0 1 2 n

NTUEE 111 KUO


Markov Example (2)

Non-identical p1 and p2
Processor Processor
p1 has failure rate λ1, repair rate µ1
1 2
p2 has failure rate λ2, repair rate µ2

1, 2 Both p1 and p2 are working 1, 2


λ2 λ1
p1 is working, p2 is failed
1 µ2 µ1
2 p2 is working, p1 is failed 1 2
µ1 µ2
0 Both p1 and p2 are failed
λ1 0 λ2

NTUEE 112 KUO


TMR System

NTUEE 113 KUO


TMR System

NTUEE 114 KUO


TMR System
 If we assume that each module in the TMR
system obeys the exponential failure law and
has a constant failure rate of λ, the probability
of a module being failed at some time t + ∆t,
given that the module was operational at time
t, is given by 1 - e- λ∆t≈ λ ∆t,
 It is possible to reduce the Markov model.

NTUEE 115 KUO


TMR System
p3 (t + ∆t ) = (1 − 3λ∆t ) p3 (t )
p2 (t + ∆t ) = (3λ∆t ) p3 (t ) + (1 − 2λ∆t ) p2 (t )
pF (t + ∆t ) = (2λ∆t ) p2 (t ) + pF (t )

NTUEE 116 KUO


TMR System

NTUEE 117 KUO


Hybrid Redundancy Technique

NTUEE 118 KUO


Hybrid Redundancy Technique

NTUEE 119 KUO


Hybrid Redundancy Technique

 p2(t + ∆t) = 3 λ ∆ tCp3(t) + (1-2 λ ∆t) p2(t)


 The complete set of equations can be derived
with the same way
 The reliability of the system described by the
Markov model is the probability of being in states
3, 2, 1, or UD.
 R(t) = p3(t)+ p2(t) + p1(t) + pUD(t)

NTUEE 120 KUO


Terminology

 A Markov chain is irreducible if every state can


be reached from every other state.
 Each state of a Markov chain is either transient or
recurrent.
 A state is called absorbing if once the chain
reaches that state, it stays there forever.
 A Markov chain is acyclic if once it leaves any
state, it never returns to that state.

NTUEE 121 KUO


What can be solved ?

 Basically, the probability of each state.

 The transient solution is the probability at a certain


point of time t.

 The steady-state solution is the steady-state


probability (t → ∝).

 Other measurements.

NTUEE 122 KUO


Analytical Solution on a 2-state Model

λ dpw(t)/dt = -λpw(t) + µpF(t)


W F dpF(t)/dt = λpw(t) - µpF(t)
µ

Assume initial conditions are pw(0)=1 and pF(0)=0,


use Laplace transforms:

sPw(s) = 1-λPw(s) + µPF(s) ----(1)


sPF(s) = λPw(s) - µPF(s) ----(2)

NTUEE 123 KUO


Analytical Solution -cont.

Solve (1) and (2) obtains


Pw(s)= 1/(s+(λ+µ)) + µ/(s(s+(λ+µ))
PF(s)= λ/(s(s+(λ+µ))
rewritten as
µ λ
Pw(s) = λ+µ + λ+µ
s s+(λ+µ)

λ λ
PF(s) = λ+µ − λ+µ
s s+(λ+µ)
NTUEE 124 KUO
Analytical Solution- cont.
use the Inverse Laplace transform

pw(t) = µ/(λ+µ) + [λ/(λ+µ)] e-(λ+µ)t


pF (t) = λ/(λ+µ) − [λ/(λ+µ)] e-(λ+µ)t

note that when t goes to infinity, the above has the steady
state solution
Also, if we change the initial conditions to pw(0)=0 and
pF(0)=1, then we have

pw(t) = µ/(λ+µ) − [µ/(λ+µ)] e-(λ+µ)t


pF (t) = λ/(λ+µ) + [µ/(λ+µ)] e-(λ+µ)t

NTUEE 125 KUO


Analytical Steady-state Solution for irreducible
Markov chains

λpw= µpF
λ
W F pF + pw= 1
µ
pw= (µ/λ) pF
pF + (µ/λ) pF = 1

[(µ+λ)/λ] pF= 1
pw= µ/(µ+λ)
pF= λ/(µ+λ)

NTUEE 126 KUO


Issues to be Considered

State Explosion Problem


n:
components
number of states is 2n
in a system
n = 10 2n = 1,024

n = 20 2n = 1,048,576
n = 30 2n = 1,073,741,824

NTUEE 127 KUO


Summary
 Methods for evaluating the reliability, availability
and safety of a system
− RBDs
− Markov chains

NTUEE 128 KUO

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy