Design For Reliability Basics Introduction
Design For Reliability Basics Introduction
Chapter Objectives
1 2 n
RS = R1 R2 ... Rn
14
Serial reliability
Series systems are also referred to as weakest
link or chain systems.
System failure is caused by the failure of any
one component.
Therefore, for a series system, the reliability of
the system is the product of the individual
component reliabilities
More components = less reliability
n
s e r ia l r e lia b ility = xi
i =1
Parallel Systems
1
p a ra llel relia b ility = 1 − (1 − x i )
probability of failure is the product of the i =1
RA RB RD
RC
A B D
C
RC
RA RB RD
A B C’ D
RC’ = 1 – (1-RC)(1-RC)
A Simple Example
A system has 4000 components with a failure
rate of 0.02% per 1000 hours. Calculate λ and
MTBF.
ADESH 18
An Example
A first generation computer contains 10000 components each
with λ = 0.5%/(1000 hours). What is the period of 99%
reliability?
ADESH 19
Reliability Failure Modes
Failures may be SUDDEN (non-predictable) or GRADUAL
(predictable). They may also be PARTIAL or COMPLETE.
Age
27
Steps in Designing for
Reliability
1. Develop a Reliability Plan
• Determine Which Reliability Tools are
Needed
2. Analyze Noise Factors
3. Tests for Reliability
4. Track Failures and Determine Corrective
Actions
Develop a Reliability Plan
Planning for reliability is just as important as planning for
design and manufacturing.
Why?
To determine:
• useful life of product
• what accelerated life testing to be used
Reliability must be as close to perfect as possible for the
product’s useful life.
You MUST know where your product's major points of
failure are!
Tools for testing
Stress Analysis
Reliability Predictions (MTBF)
FMEA (Failure Mode and Effects Analysis)
Fault Tree Analysis
Reliability Block Diagrams
Why do Reliability
Calculation?
Reliability calculations make the product more
reliable which can be used as a selling feature
by the marketing department. Also, this adds
to the company reputation and can be used for
comparisons with competition.
Stress Analysis
It establishes the presence of a safety margin
thus enhancing system life. Stress analysis
provides input data for reliability prediction. It
is based on customer requirements.
Reliability Predictions
(MTBF)
MTBF (Mean Time between Failures) for an existing
product can be found by studying field failure data.
For a new product however, or if significant changes
are made to the design, it may be required to estimate
or calculate MTBF before any field data is available.
Failure Modes and Effects
Analysis
Failure modes and effects analysis (FMEA) is a
qualitative technique for understanding the
behaviour of components in an engineered systems
The objective is to determine the influence of
component failure on other components, and on the
system as a whole
FMEA can also be used as a stand-alone procedure
for relative ranking of failure modes that screens
them according to risk.
ADESH
Failure mode and effects
analysis (FMEA)
Failure Mode: Consider each component or functional block and
how it can fail.
Determine the Effect of each failure mode, and the severity on
system function.
Determine the likelihood of occurrence and detecting the failure.
Calculate the Risk Priority Number (RPN = Severity X
Occurrence X Detection).
Consider corrective actions (may reduce severity of occurrence, or
increase probably of detection).
Start with the higher RPN values (most severe problems) and
work down.
Recalculate RPN after the corrective actions have been
determined, the aim is to minimize RPN.
Reliability Block Diagrams
Most systems are defined through a combination of both series and
parallel connections of subsystems
Reliability block diagrams (RBD) represent a system using
interconnected blocks arranged in combinations of series and/or parallel
configurations
They can be used to analyze the reliability of a system quantitatively
Reliability block diagrams can consider active and stand-by states to get
estimates of reliability, and availability (or unavailability) of the system
Reliability block diagrams may be difficult to construct for very complex
systems
ADESH
CASE STUDY: Network Storage
Evaluations Using
Reliability Calculations
This section uses a case study to introduce concepts
and calculations for systematically comparing
redundancy and reliability factors as they apply to
network storage configurations. We will determine a
reliability figure on three very basic architectures.
The starting point of our study is the network storage
requirements.
Network Storage Requirements
We want networked storage that has access to one server. Later, this storage
will be accessible to other servers. The server is already in place, and has
been designed to sustain single component hardware failures (with dual
host bus adapters (HBAs), for example). Data on this storage must be
mirrored, and the storage access must also stand up to hardware failures.
The cost of the storage system must be reasonable, while still providing
good performance.
Architecture 1
Architecture 1 provides the basic
storage necessities we are looking for
with the following advantages and
disadvantages:
Advantages:
Storage is accessible if one of the
links is down.
Storage A is mirrored onto B.
Other servers can be connected to
the concentrator to access the
storage.
Disadvantages:
If the concentrator fails, we have no
more access to the storage. This
concentrator is a single point of
failure(SPOF).
Architecture 2
Architecture 2 has been improved
to take into account the previous
SPOF. A concentrator has been
added.
Advantages:
If any links or components go
down, storage is still accessible
(resilient to hardware failures).
Data is mirrored (Disk A <-> Disk B).
Other servers can be connected to
both concentrators to access the
storage space.
Architecture 3
The main difference is that Disk A and Disk
B have only one data path. Disk A is still
mirrored to Disk B, as required.
This architecture has all the advantages of
the previous architectures with the
following differences:
Disk A can only be accessed through Link C,
and Disk B only through Link D.
There is no data multi pathing software
layer, which results in easier administration
and easier troubleshooting.
Determining Reliability
*(The AFR for each component was calculated using the MTBF where
(8760/MTBF) = AFR). The example MTBF values were taken from real
network storage component statistics. However, such values vary greatly,
and these numbers are given here purely for illustration.
Determining Reliability
Component AFR Sample MTBF Values AFR
Variable (hours)
HBA 1 H 800,000 0.011
HBA 2 H
LINK A L 400,000 0.022
LINK B L
Concentrator 1 C 580,000 0.0151
Concentrator 2 C
Disk B D
Determining Reliability
Having the rate of failure of each individual
component, we can obtain the system's annual failure
rate AFR and consequently the system reliability (R)
and system MTBF values. The AFR values of
redundant components are multiplied to the power
equal to the number of redundant components. The
AFR values of non-redundant components are
multiplied by the number of those components in
series.
Calculation
In case of Architecture 1, concentrator(C) is the
only non-redundant component.
AFR1 = (H+L)2 + C + L2 + D2
AFR1 = (0.011+0.022) 2 + 0.0151 + (0.022)2 +
(0.0088)2 = 0.0167
R1 = 1 - AFR1 = 1 – 0.0167 = 0.9833, or 98.33%
MTBF1= 8760/AFR1 = 8760/0.0167 = 524,551
hours.
Calculation
◦ Deformation depends on
1. The applied load.
2. The duration through which the load is applied
3. Elevated temperature
Design Against Creep Failure
◦ Two Types
◦ Corrosion
◦ Intermetallic Diffusion
Design Against Corrosion-
Induced
What is Chemical Corrosion?
Failure
◦ The chemical or
electrochemical reaction
between a material, usually a
metal, and its environment that
produces a deterioration of the
material and its properties.
Design Guidelines to Reduce
Corrosion
Metals with a high oxidation potential tend to
corrode faster.
Feedback
Redesign
Improved fabrication
Verification of redesign
References
“Mechanical reliability and design” by “A.D.S Carter”
http://www.reliabilityanalysislab.com/ReliabilityServices.as
p
http://pms401.pd9.ford.com:8080/arr/concept.htm