Getting Failure Rate Data: Random Failures Versus Systematic Failures
Getting Failure Rate Data: Random Failures Versus Systematic Failures
8
Getting Failure
Rate Data
Introduction
When ISA84.01 (Ref. 1) was first released in 1996, several made the
comment, “No one has good failure rate data.” This led some to believe
that the whole idea behind probabilistic failure calculations was
impractical.
In the early years of the functional safety standards, industry failure
databases could provide failure data information. While this failure data
was not product specific or application specific, it helped designers
recognize problems in their designs. One such problem was the “weak
link” design (Ref. 2). Such a design included a high quality SIL3 safety
PLC that was connected to a switch and a solenoid. Many of the engineers
thought they had a SIL3 design until they did the safety verification
calculations. Such a design may not even meet SIL1 depending on proof
test effectiveness and manual proof test time interval!
Even with approximate data, the methods began to show how designers
could achieve higher levels of safety while optimizing costs. The safety
verification calculations required by the new functional safety standards
have shown designers how to design much more balanced designs. The
calculations have shown many how to do a better job. But, failure rate and
failure mode data for random failures on the chosen equipment is
required.
117
Goble05.book Page 118 Thursday, March 31, 2005 10:39 PM
calculations, the failure rate data due to random failures during the useful
life of a product is required.
Several problems exist with this method of getting failure rate data. Often
needed information about a failure is not collected. This includes total
time in operation, failure confirmation, technology class, failure cause and
stress conditions. The results are usually a significantly higher failure rate
than the number needed for probabilistic SIF verification. This is due to:
1. Lack of distinction between random failures and wear out failures,
5. Other issues.
When total time in operation is not recorded, failures due to wear out
cannot be distinguished from random failures during the useful life. If
these failures are grouped together, the data analyst cannot distinguish
between them and will typically assume that all failures are random. The
resulting failure rate number is too high. In addition, the opportunity to
establish the useful life period is also lost.
When details about failure cause are not collected, failures due to
maintenance errors, calibration errors and other systematic faults cannot
be distinguished from random failures. The result is a number that can be
high.
When failure confirmation is not done, there are times when multiple
instruments are replaced during system restoration. When the exact cause
of a failure is not identified, multiple “failures” are reported when the
maintenance technician replaces several items in an effort to find out
which one has actually failed. During a period of unexpected downtime,
the emphasis is clearly on system restoration and often time is not
allocated for failure identification. This is understandable given many
restore situations where there may be a harsh environment, little time and
lack of test equipment. The result of recording multiple failures when only
one exists is a failure rate that is too high.
In some databases technology classes are mixed. The authors have seen
equipment more than fifty years old in operation in industrial processes.
Some of this equipment with vacuum tube technology has a significantly
different failure rate than solid state integrated circuit technology. When
failures from these different technology classes are mixed, the resulting
failure rate data is often too high being dominated by the older, less
reliable equipment.
Available Databases
One of the most popular failure rate databases is the OREDA database
(Ref. 4). OREDA stands for “Offshore Reliability Data.” This book presents
detailed statistical analysis on many types of process equipment. Many
engineers use it as a source of failure rate data to perform safety
verification calculations. It is an excellent reference for all who do data
analysis.
Other industry failure database sources include:
1. FMD-97: Failure Mode / Mechanism Distributions. Reliability
Analysis Center, 1997 (Ref. 5)
Many companies have an internal expert who has studied these sources,
as well as their own internal failure records, and maintains the company
failure rate database. Some use failure data compilations found on the
Internet. While the data in industry databases is not product specific or
application specific, it does provide useful failure rate information for
specific industries (nuclear, offshore, etc.) and a comparison of the data
provides information about failure rates versus stress factors.
Failure rate data alone is not enough to do a good job with probabilistic
safety verification. A probability of fail-danger calculation for safety
verification purposes requires failure mode data. For each piece of
equipment, one must know the failure modes (safe versus dangerous) and
the effectiveness of any automatic diagnostics (the diagnostics coverage
factor). This information is included only in rough form if at all in industry
databases. So many engineers doing safety verification calculations
provide an educated and conservative estimate. For most electronic
equipment, the safe percentage is set to 50%. Relays have a higher
percentage of safe failures with many picking a value of 70% or 80%.
Mechanical components like solenoids might be more like 40% safe with
many failure modes causing stuck in place failures that end up being
dangerous in a safety protection application.
Goble05.book Page 121 Thursday, March 31, 2005 10:39 PM
Generally, less specific data turns out to be more conservative and that is
appropriate for safety verification purposes following the rule that “the
less one knows, the more conservative one must be.” Remember that
industry databases may include systematic failures, multiple technology
classes, wear out failures and possible multiple reports per failure. These
issues naturally cause the numbers from such sources to be high.
Figure 8-1. SERH FMEDA Based Data Page (Ref. 9) (reprinted with permission of
exida)
group has defined data gathering techniques (Ref. 11) and failure
taxonomies for various types of process equipment. The important data
that must be collected for a failure event has been defined. Operating
companies from chemical, petrochemical, industrial gases and other
industries become members and are working to set up inspection and
failure reporting. They have created data collection software that members
use to report field failures to a central database. There is potential that this
Goble05.book Page 125 Thursday, March 31, 2005 10:39 PM
Figure 8-2. SERH Generic Failure Data – Switch (Ref. 9) (reprinted with permission of
exida)
Figure 8-3. SERH Generic Failure Data – Transmitter (Ref. 9) (reprinted with permis-
sion of exida)
Exercises
8-1. What are the key objectives and differences between FMEA and
FMEDA analysis?
8-2. Explain why the failure rate data for SIS components obtained
from field data can be considerably different from that obtained by
means of FMEDA analysis for the components.
Goble05.book Page 127 Thursday, March 31, 2005 10:39 PM
8-3. A valve manufacturer has issued a certificate for their valve stating
that according to IEC 61508 the Mean Time between Failures
(MTBF) for the valve was found to be 12000 years. No additional
performance data was listed on the certificate. Can this data be
used for SIL verification calculations?
8-4. The failure rate for an electronic component is 15.6 FITS. What is
the equivalent failure in units of failures per hour?
8-5. A FMEDA report has been issued for a Solenoid valve based on
IEC 61508 requirements. What parts of the solenoid valve would
have been included in the FMEDA analysis?
8-6. List some of the limitations in using industry database data for SIL
verification calculations.
8-7. Does FMEDA based failure rate data include failure rates due to
expected human error?
10. Arner, D. C. and W. C. Angstadt. “Where For (sic) Art Thou Failure
Rate Data.” ISA Technology Update, 2001.
11. Guidelines for Improving Plant Reliability Through Data Collection and
Analysis. Center for Chemical Process Safety, American Institute of
Chemical Engineers, 1998.