100% found this document useful (1 vote)
22 views67 pages

Design For Reliability Basics Introduction

The document discusses the importance of design for reliability (DFR) in engineering, outlining its definitions, objectives, and various reliability terms. It covers the causes of reliability failures, failure modes, and the significance of reliability calculations and testing methods. Additionally, it presents a case study on network storage architectures to illustrate reliability assessments and comparisons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
22 views67 pages

Design For Reliability Basics Introduction

The document discusses the importance of design for reliability (DFR) in engineering, outlining its definitions, objectives, and various reliability terms. It covers the causes of reliability failures, failure modes, and the significance of reliability calculations and testing methods. Additionally, it presents a case study on network storage architectures to illustrate reliability assessments and comparisons.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

DESIGN FOR RELIABILITY

Chapter Objectives

Introduce the need for design for reliability


List the main causes of reliability failures
How do failures relate to their mechanisms
Describe each failure
Propose design guidelines against the failure
What is Reliability?
Reliability is:
• The ability of an item to perform its required
function under defined customer operating
conditions for a stated period of time.
• The probability that no (system) failure will occur
in a given time interval
In research, the term reliability means "repeatability"
or "consistency". A measure is considered reliable if
it would give us the same result over and over again
Other Names of DFR

DFR has many aliases:

Design for Durability


Design for Robustness
Design for Useful Life
What do Reliability
Engineers Do?
Implement Reliability Engineering Programs across
all functions
◦ Engineering
◦ Research
◦ manufacturing
◦ Testing
◦ Packaging
◦ field service
What is Probability?
Probability is:
• A measure that describes the chance or
likelihood that an event will occur.
• The probability that event (A) occurs is
represented by a number between 0 (zero) and
1.
• When P(A) = 0, the event cannot occur.
• When P(A) = 1, the event is certain to occur.
• When P(A) = 0.5, the event is as likely to
occur as it is not.
Why Design for Reliability?
Reliability can make or break the long-term
success of a product.
◦Too high reliability will cause the product to
be too expensive
◦Too low reliability will cause warranty and
repair costs to be high and therefore market
share will be lost.
Cost-Reliability Functions
What are Noise Factors?

Noise Factors are sources of disturbing


influences that can disrupt the ideal
function, causing error states which lead to
quality problems.
Reliability Terms
Mean Time To Failure (MTTF) for non-repairable
systems
Mean Time Between Failures for repairable systems
(MTBF)
Reliability Probability (survival) R(t)
Failure Probability (cumulative density function )
F(t)=1-R(t)
Failure Probability Density f(t)
Failure Rate (hazard rate) λ(t)
MTBF & MTTF
Mean Time Between Failures – Applies to
repairable items.

Mean Time To Failure – Applies to non-repairable


items.

Both of these terms indicate the average time an


item is expected to function before failure.
Reliability Function
Probability density function of failures
f(t) = e-t for t > 0
Probability of failure from (0 to T)
F(t) = 1 – e-T
Reliability function
R(T) = 1 – F(T) = e-T
Series Systems

1 2 n

RS = R1 R2 ... Rn

14
Serial reliability
Series systems are also referred to as weakest
link or chain systems.
System failure is caused by the failure of any
one component.
Therefore, for a series system, the reliability of
the system is the product of the individual
component reliabilities
More components = less reliability
n

s e r ia l r e lia b ility =  xi
i =1
Parallel Systems
1

RS = 1 - (1 - R1) (1 - R2)... (1 - Rn)


15
Parallel reliability
oParallel systems are also referred to as
redundant.
oThe system fails only if all of the components
fail.
oTherefore, for a parallel system, the system n


p a ra llel relia b ility = 1 − (1 − x i )
probability of failure is the product of the i =1

individual component probabilities.


Series-Parallel Systems
C

RA RB RD
RC
A B D
C

RC

Convert to equivalent series system

RA RB RD

A B C’ D

RC’ = 1 – (1-RC)(1-RC)
A Simple Example
A system has 4000 components with a failure
rate of 0.02% per 1000 hours. Calculate λ and
MTBF.

λ = (0.02 / 100) * (1 / 1000) * 4000 = 8 * 10-4


failures/hour

MTBF = 1 / (8 * 10-4 ) = 1250 hours

ADESH 18
An Example
A first generation computer contains 10000 components each
with λ = 0.5%/(1000 hours). What is the period of 99%
reliability?

MTBF = t / (1 – R(t)) = t / (1 – 0.99)


◦ t = MTBF * 0.01 = 0.01 / λav
◦ Where λav is the average failure rate
◦ N = No. of components = 10000
◦ λ = failure rate of a component
◦ = 0.5% / (1000 hours) = 0.005/1000 = 5 * 10-6 per hour

◦ Therefore, λav = N λ = 10000 * 5 * 10-6 = 5 * 10-2 per hour

◦ Therefore, t = 0.01 / (5 * 10-2 ) = 12 minutes

ADESH 19
Reliability Failure Modes
Failures may be SUDDEN (non-predictable) or GRADUAL
(predictable). They may also be PARTIAL or COMPLETE.

A Catastrophic failure is both sudden and complete.

A Degradation failure is both gradual and partial.

Two root causes:


1. lack of robustness
2. mistakes
Causes of Failure
Misuse – Failures attributable to the application of
stresses beyond the stated capabilities of the item.

Inherent Weakness – Failures attributable to


weakness inherent in the item itself when subjected
to stresses within the stated capabilities of the item.
Classifications of Reliability
Failure
Early stage failure – Causes for such type of failure are
inadequate design, poor manufacturing, and inappropriate
usage. these can be catastrophic to human life.

Overstress Mechanisms – These occur due to insufficient safety


factor in design, higher than expected
random loads, human errors, misapplication.

Wearout Mechanisms – Occur late in life and then increase with


age.This happens on corrosion, material fatigue, poor
maintenance, creep , degradation in strength.
Common Measures of
Unreliability
• % Failure - % of failures in a total population

• MTTF (Mean Time To Failure) - the average time of


operation to first failure.

• MTBF (Mean Time Between Failure) - the average time


between product failures.

• Repairs Per Thousand (R/1000)

• Bq Life – Life at which q% of the population will fail


Cumulative Failure Rate Curve
The Bathtub Curve
Reliability specialists often describe the lifetime of
a population of products using a graphical
representation called the bathtub curve. The bathtub
curve consists of three periods: an infant mortality
period with a decreasing failure rate followed by a
normal life period (also known as "useful life") with
a low, relatively constant failure rate and concluding
with a wear-out period that exhibits an increasing
failure rate.
90
Reliability
80
70
60
Prob 50
of dying 40
in the next 30
year (deaths/ 20
1000) 10
0
0 2 5 12 16 19 30 50 70 86

Age

From the Statistical Bulletin 79, no 1, Jan-Mar 1998

27
Steps in Designing for
Reliability
1. Develop a Reliability Plan
• Determine Which Reliability Tools are
Needed
2. Analyze Noise Factors
3. Tests for Reliability
4. Track Failures and Determine Corrective
Actions
Develop a Reliability Plan
Planning for reliability is just as important as planning for
design and manufacturing.
Why?
To determine:
• useful life of product
• what accelerated life testing to be used
Reliability must be as close to perfect as possible for the
product’s useful life.
You MUST know where your product's major points of
failure are!
Tools for testing

Stress Analysis
Reliability Predictions (MTBF)
FMEA (Failure Mode and Effects Analysis)
Fault Tree Analysis
Reliability Block Diagrams
Why do Reliability
Calculation?
Reliability calculations make the product more
reliable which can be used as a selling feature
by the marketing department. Also, this adds
to the company reputation and can be used for
comparisons with competition.
Stress Analysis
It establishes the presence of a safety margin
thus enhancing system life. Stress analysis
provides input data for reliability prediction. It
is based on customer requirements.
Reliability Predictions
(MTBF)
MTBF (Mean Time between Failures) for an existing
product can be found by studying field failure data.
For a new product however, or if significant changes
are made to the design, it may be required to estimate
or calculate MTBF before any field data is available.
Failure Modes and Effects
Analysis
Failure modes and effects analysis (FMEA) is a
qualitative technique for understanding the
behaviour of components in an engineered systems
The objective is to determine the influence of
component failure on other components, and on the
system as a whole
FMEA can also be used as a stand-alone procedure
for relative ranking of failure modes that screens
them according to risk.

ADESH
Failure mode and effects
analysis (FMEA)
Failure Mode: Consider each component or functional block and
how it can fail.
Determine the Effect of each failure mode, and the severity on
system function.
Determine the likelihood of occurrence and detecting the failure.
Calculate the Risk Priority Number (RPN = Severity X
Occurrence X Detection).
Consider corrective actions (may reduce severity of occurrence, or
increase probably of detection).
Start with the higher RPN values (most severe problems) and
work down.
Recalculate RPN after the corrective actions have been
determined, the aim is to minimize RPN.
Reliability Block Diagrams
Most systems are defined through a combination of both series and
parallel connections of subsystems
Reliability block diagrams (RBD) represent a system using
interconnected blocks arranged in combinations of series and/or parallel
configurations
They can be used to analyze the reliability of a system quantitatively
Reliability block diagrams can consider active and stand-by states to get
estimates of reliability, and availability (or unavailability) of the system
Reliability block diagrams may be difficult to construct for very complex
systems

ADESH
CASE STUDY: Network Storage
Evaluations Using
Reliability Calculations
This section uses a case study to introduce concepts
and calculations for systematically comparing
redundancy and reliability factors as they apply to
network storage configurations. We will determine a
reliability figure on three very basic architectures.
The starting point of our study is the network storage
requirements.
Network Storage Requirements
We want networked storage that has access to one server. Later, this storage
will be accessible to other servers. The server is already in place, and has
been designed to sustain single component hardware failures (with dual
host bus adapters (HBAs), for example). Data on this storage must be
mirrored, and the storage access must also stand up to hardware failures.
The cost of the storage system must be reasonable, while still providing
good performance.
Architecture 1
Architecture 1 provides the basic
storage necessities we are looking for
with the following advantages and
disadvantages:
Advantages:
Storage is accessible if one of the
links is down.
Storage A is mirrored onto B.
Other servers can be connected to
the concentrator to access the
storage.
Disadvantages:
If the concentrator fails, we have no
more access to the storage. This
concentrator is a single point of
failure(SPOF).
Architecture 2
Architecture 2 has been improved
to take into account the previous
SPOF. A concentrator has been
added.
Advantages:
If any links or components go
down, storage is still accessible
(resilient to hardware failures).
Data is mirrored (Disk A <-> Disk B).
Other servers can be connected to
both concentrators to access the
storage space.
Architecture 3
The main difference is that Disk A and Disk
B have only one data path. Disk A is still
mirrored to Disk B, as required.
This architecture has all the advantages of
the previous architectures with the
following differences:
Disk A can only be accessed through Link C,
and Disk B only through Link D.
There is no data multi pathing software
layer, which results in easier administration
and easier troubleshooting.
Determining Reliability

Using the reliability formulas , we can determine which


architecture has the highest reliability value. For the
purpose of this article , we will use sample MTBF
values (as obtained by the manufacturer) and
AFR*(Annual Failure Rate) values shown in the table
below:

*(The AFR for each component was calculated using the MTBF where
(8760/MTBF) = AFR). The example MTBF values were taken from real
network storage component statistics. However, such values vary greatly,
and these numbers are given here purely for illustration.
Determining Reliability
Component AFR Sample MTBF Values AFR
Variable (hours)
HBA 1 H 800,000 0.011
HBA 2 H
LINK A L 400,000 0.022

LINK B L
Concentrator 1 C 580,000 0.0151
Concentrator 2 C

LINK C L 400,000 0.022


LINK D L
Disk A D 1,000,000 0.0088

Disk B D
Determining Reliability
Having the rate of failure of each individual
component, we can obtain the system's annual failure
rate AFR and consequently the system reliability (R)
and system MTBF values. The AFR values of
redundant components are multiplied to the power
equal to the number of redundant components. The
AFR values of non-redundant components are
multiplied by the number of those components in
series.
Calculation
In case of Architecture 1, concentrator(C) is the
only non-redundant component.
AFR1 = (H+L)2 + C + L2 + D2
AFR1 = (0.011+0.022) 2 + 0.0151 + (0.022)2 +
(0.0088)2 = 0.0167
R1 = 1 - AFR1 = 1 – 0.0167 = 0.9833, or 98.33%
MTBF1= 8760/AFR1 = 8760/0.0167 = 524,551
hours.
Calculation

The architecture 2 has a different configuration with no


non-redundant components.
AFR2 = (H+L+C+L) 2 + D2
AFR2 = (0.011+0.022+0.0151+0.022) 2 + (0.0088)2 =
0.0005
R2 = 1 – AFR2 = 1 – 0.0005 = 0.995, or 99.50%
MTBF2= 8760/AFR2 = 8760/0.0005 = 1,752,000
hours.
Calculation

Architecture 3 has yet another configuration and has


no non-redundant components.
AFR3 = (H+L+C+L+D) 2
AFR3 = (0.011+0.022+0.0151+0.022+0.0088) 2 =
0.0062
R3 = 1 – AFR3 = 1 – 0.0062 = 0.9938, or 99.38%
MTBF3= 8760/AFR3 = 8760/0.0062 = 1,412,903
hours.
Conclusion
When the calculations are complete, we compare the data:
Architecture 1 = 98.33%, or a System's MTBF = 524,551 hours
Architecture 2 = 99.50%, or a System's MTBF = 1,752,000 hours
Architecture 3 = 99.38%, or a System's MTBF = 1,412,903 hours
The MTBF figures are the most revealing, and indicate that architecture
2 is statistically the most reliable of all.
Failure Effects
(What customer
experiences)
Noise
Inoperability
Instability
Intermittent operation
Roughness
Excessive effort requirements
Unpleasant or unusual odor
Poor appearance
Factors Affecting
Reliability

Installation & Environmental Design & Manufacture

Temperature Pre-Production Design


Humidity Control of Production
Vibration Working Tolerances
Chemical Attack
Material Quality
Interconnections
Component Quality
Component Stress
Design against failure
Important to understand the failure (why, where, how
long, application, etc.)

Two methods for design against failure:


1. By reducing the stress that cause the failure.
2. By increasing the strength of the component.

Either one can be achieved by:


◦ Selecting materials
◦ Changing the package geometry
◦ Changing the dimensions
◦ Protection
Fatigue Failure?
Fatigue is the most common mechanism of failure
and responsible for 90% of all structural and
electrical failures.

Occurs in metals, polymers, and ceramics.

Metal paper clip example


◦ Bend in both directions
◦ Repeat the process
Design Against Fatigue Failure

Increase fatigue strength.

Reduce the amplitude of cylic loading.

avoid stress concentration region


Design Against Brittle
Fracture
Brittle fracture is an overstress failure mechanism
that occurs rapidly with little or no warning when the
induced stress in the component exceeds the
fraction strength of the material.

Occurs in brittle materials (ceramics, glasses and


silicon).

Applied stress and work could break the atomic


bonds.
Design Guidelines to Reduce
Brittle Fracture
Designs with materials and processing
conditions that would produce the least stress
in brittle materials should be created.

The brittle material should be polished to


remove surface flaws to enhance reliability.
Design Against Creep Failure
What is Creep?
◦ A time-dependent deformation process under
load.

◦ Thermally-activated process: the rate of


deformation for a given stress level increases
significantly with temperature.

◦ Deformation depends on
1. The applied load.
2. The duration through which the load is applied
3. Elevated temperature
Design Against Creep Failure

Creep can occur at any stress level.

Creep is most important at elevated


temperatures.
Design Guidelines to Reduce
Creep-Induced Failure.
Use materials with high melting point if the
application calls for harsh temperature conditions.

Reduction of mechanical stress will reduce creep


deformation.

Creep is a time controlled phenomenon.


Design Against Plastic
Deformation
What is Plastic Deformation?
◦ When the applied mechanical stress exceeds the
elastic limit or yield point of a material.
◦ It is permanent.

Excessive deformation and continued accumulation


of plastic strain due to cyclic loading will eventually
lead to cracking of the component and make it
unusable.
Design Guidelines Against
Plastic Deformation
Limit the design stresses in the packaging structure
below the yield strength of the materials used. If
possible, use materials that have high yield strength.

Design and control the local plastic deformation at


regions of stress concentrations.
Chemically Induced Failures

What are Chemically Induced Failures?


◦ Chemical process such as electrochemical
reactions can result in cracking of components
leading to electrical failures.

◦ Two Types
◦ Corrosion
◦ Intermetallic Diffusion
Design Against Corrosion-
Induced
What is Chemical Corrosion?
Failure
◦ The chemical or
electrochemical reaction
between a material, usually a
metal, and its environment that
produces a deterioration of the
material and its properties.
Design Guidelines to Reduce
Corrosion
Metals with a high oxidation potential tend to
corrode faster.

Use hermetic packages to prevent moisture


absorption.

Ensure there are no trapped moisture or


contaminants during the processing an assembly of
the packages.
Design Against Intermetallic
Diffusion
What is Intermetallic Diffusion?

◦ During wirebonding and solder reflow, the joining


process generates intermetallic layers which are
byproducts of the joining process.
Design Guidelines Against
Intermetallic Diffusion
Limit the process temperatures and control the time
exposed to high temperatures during the joining
process.

Control the temperature range and cycles of exposure at


the high temperature period.

Application of nickel/gold coating on the bare copper


pad surfaces.
Achieving reliability growth
Detect failure causes

Feedback

Redesign

Improved fabrication

Verification of redesign
References
“Mechanical reliability and design” by “A.D.S Carter”

“Introduction to reliability in design” by “Charles O. Smith.”

http://www.reliabilityanalysislab.com/ReliabilityServices.as
p

http://pms401.pd9.ford.com:8080/arr/concept.htm

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy