Real Time Fault Monitoring of Industrial Processes
Real Time Fault Monitoring of Industrial Processes
International Series on
MICROPROCESSOR-BASED AND
INTELLIGENT SYSTEMS ENGINEERING
VOLUME 12
Editor
Professor S. G. Tzafestas, National Technical University, Athens, Greece
A. D. POULIEZOS
Technical University 0/ Crete,
Department 0/ Production Engineering and Management,
Chania, Greece
and
G. S. STAVRAKAKlS
Technical University 0/ Crete,
Electronic Engineering and Computer Science Department,
Chania, Greece
Poullezos. A. 0 .• 1951-
Real tlme fault monltoring of industrial processes I by A.D.
Pouliezos and G.S. Stavrakakls.
p. cm. -- (Internatlonal series on mlcroprocessor-based and
lntelligent systems engineering: v. 12)
Includes bibllographical references and indexes.
ISBN 978-90481-4374-0 ISBN 978-94-015-8300-8 (eBook)
DOI 10.1007/978-94-015-8300-8
1. Fault location (Engineering) 2. Process control. 3. Quality
control. I. Stavrakakis. G. S .• 1958- II. Serles.
TA189.8.P88 1994
870.42--dc20 94-2137
ISBN 978-90-481-4374-0
CHAPTER 1
FAULT DETECTION AND DIAGNOSIS METHODS IN THE
ABSENCE OF PROCESS MODEL
1.1 Introduction .................................................................................................. 1
1.2 Statistical aids for fault occurrence decision making ....................................... 2
1.2.1 Tests on the statistical properties of process characteristic
quantities ............................................................................................ 2
1.2.1.1 Limit checking fauIt monitoring in electrical drives ................ 20
1.2.1.2 Steady-state and drift testing in a grinding-classification
circuit.................................................................................... 21
1.2.1.3 Conclusions ........................................................................... 24
1.2.2 Process Control Charts .......................................................................................... 26
1.2.2.1 An application example for Statistieal Proeess Control
(SPC) ................................................................................... 40
1.2.2.2 ConcIusions ........................................................................... 42
1 3 Fault diagnosis based on signal analysis instrumentation .............................. 43
1.3 I Machine health monitoring methods ................................................... 43
1.3.2 Vibration and noise analysis applieation examples ............................. 64
1.3.3 ConcIuslOns ....................................................................................... 77
References . ... ................................................................................................ 78
Appendix 1 A..................... .............................................................. ............... 82
Appendix I.B ................................................................................................... 87
CHAPTER2
ANALYTICALREDUNDANCYMETHODS
2.1 Introduction ................................................................................................ 93
2.2 Plant and failure models .............................................................................. 94
2.3 Design requirements ................................................................................... 97
2.4 Mcthods of solution ..................................................................................... 98
2.5 Stochastic l110deling methods ..................................................................... 102
2.5.1 Simple tests ..................................................................................... 103
2.5 1.1 Tests ofmean ...................................................................... 104
2.5.1.2 Tests of covariance .............................................................. 105
v
vi Real time fault monitoring of industrial processes
CHAPTER 3
PARAMETER ESTIMATION METHODS FOR FAULT MONITORING
3.1 Introduction .............................................................................................. 179
3.2 Process modeling for fault detection ........................................................... 182
3.3 Parameter estimation for fault -detection ..................................................... 186
3.3.1 Recursive least squares algorithms ................................................... 187
3.3.2 Forgetting factors ............................................................................ 191
3.3.3 Implementation issues ...................................................................... 196
3.3.3.1 Covariance instability ......................................................................... 196
3.3.3.2 Covariance singularity......................................................................... 200
3.3.3.3 Speed - Fast algoritluns ..................................................................... 202
3.3.3.4 Data weights selection ........................................................................ 205
3.3.5 Robustness issues ............................................................................ 211
3.4 Decision rules ........................................................................................... 218
3.5 Practical examples .................................................................................... 224
3.5.1 Evaporator fault detection................................................................ 224
3.5.2 Gas turbine fault detection and diagnosis ......................................... 228
3.5.3 Fault detection for electromotor driven centrifugal pumps ................ 231
3.5.4 Fault detection in power substations ................................................. 237
3.5.5 Fault diagnosis in robotic systems .................................................... 242
3.6 Additional references ................................................................................. 246
Appendix 3.A.................................................................................................. 247
Appendix 3.B .................................................................................................. 249
References .. '" ................................................................................................. 250
Table of eontents VII
CHAPTER4
AUTOAfATIC EXPERT PROCESS FAULT DIAGNOSIS AND
SUPERVISION
4.1 Introduction .............................................................................................. 256
4.2 Nature of automatie expert diagnostie and supervision systems .................. 257
4.2.1 Expert systems for automatie process fault diagnosis ....................... 257
4.2.1.1 The tenninology of knowledgc engineering ................................. 257
4.2.1.2 Teehniques for knowledge aequisition ........................................... 261
4.2.1.3 Expert system approaehes for automatie process fault
diagnosis ..................................................................................................... 271
4.2.1.4 High-speed implementations of rule-based diagnostie
systems ....................................................................................................... 277
4.2.1.5 Validating expert systems ................................................................. 283
4.2.2 Event-based arehitecture for real-time fault diagnosis ....................... 284
4.2.3 Curve analysis teehniques for real-time fault diagnosis ..................... 287
4.2.4 Real-time fault detection using Petri nets .......................................... 291
4.2.5 Fuzzy logie theory in real-time process fault diagnosis ..................... 297
4.3 Application exarnples ................................................................................ 301
4.3.1 Automatie expert diagnostie systems for nuclear power plant
(NPP) safety................................................................................... 301
4.3.1.1 Diagnostie expert systems for NPP safety .................................... 301
4.3.1.2 Fuzzy reasoning diagnosis for NPP safety ..................................... 305
4.3.2 Automatie expert fault diagnosis ineorporated in a process
SCADA system ............................................................................... 311
4.3.3 Expert systems for quiek fault diagnosis in the meehanieal and
electrical systems domains ............................................................... 328
4.3.4 Automatie expert fault diagnosis for maehine tools, robots and
CIM systems ................................................................................... 335
4.4 Conclusions .............................................................................................. 343
References ...................................................................................................... 346
Appendix 4.A A generie hybrid reasoning expert diagnosis model .................. 352
Appendix 4.B Basie definitions of place/transition Petri nets and their use
for on-line process failure diagnosis ......................................... 360
Appendix 4.C Analytieal expression for exception using fuzzy logie and its
utilization for on-line exeeptional events diagnosis ................... 364
CHAPTERS
FAULT DIAGNOSIS USING ARTIFICIAL NEURAL NETWORKS
(ANNs)
5.1 Introduction .............................................................................................. 369
5.2 Introduction to neural networks ................................................................. 372
5.3 Charaeteristies of Artifieial Neural Networks ............................................ 374
5.4 ANN topologies and leartiing strategies ..................................................... 378
5.4.1 Supervised learning ANNs ............................................................... 378
viii Real time fault monitoring of industrial processes
CHAPTER6
IN-TIME FAlL URE PROGNOSIS AND FATIGUE LIFE PREDICTION
OF STRUCTURES
6.1 Introduction .............................................................................................. 430
6.2 Recent non-destructive testing (NDT) and evaluation methods with
applications ............................................................................................... 431
6.2.1 Introduction..................................................................................... 431
6.2.2 The main non-destructive testing methods ........................................ 435
6.2.2.1 Liquid penetrant inspection................................................... 435
6.2.2.2 Magnetic particle inspection ................................................. 436
6.2.2.3 Electrical test methods (eddy current testing (ECT )) ............ 438
6.2.2.4 Ultrasonic testing ................................................................. 440
6.2.2.4 Radiography ........................................................................ 449
6.2.2.5 Acoustic emission (AE) ........................................................ 451
6.2.2.6 Other non-destructive inspection techniques .......................... 452
6.2.3 Signal processing (SP) for NDT ..................................................... .456
6.2.4 Applications of SP in automated NDT ............................................ 459
6.2.5 Conclusions ..................................................................................... 461
6.3 Real-time structural damage assessment and fatigue life prediction
methods .................................................................................................... 463
6.3.1 Introduction ..................................................................................... 463
6.3.2 Phenomenological approach for fatigue failure prognosis ................. 464
6.3.3 Probabilistic fracture mechanics approach for FCG life
estimation ........................................................................................ 467
6.3.4 Stochastic process approach for FCG life prediction ........................ 478
6.3.5 Time series analysis approach for FCG prediction ............................ 482
Table of contents ix
6.3.6 Intelligent systems for in-time structural damage assessment ............ 488
6.4 Application examples ........................................................................... ..... 506
6.4.1 Nuclear reactor safety assessment using the probabilistic fracture
mechanics method ...................................................... ..................... 506
6.4.2 Marine structures safety assessment using the probabilistic
fracture mechanics method .............................................................. 509
6.4.3 Structural damage assessment using a causal network ...................... 519
References ......................................................................................... ............. 523
AuthoT index ............................................................................................................ 529
Subject index ............................................................................................................ 535
Preface
Tbis book is basicaUy concemed with approaches for improving safety in man-made
systems. We caU these approaches, coUectively, fault monitoring, since they are
concemed primarily with detecting faults occurring in the components of such systems,
being sensors, actuators, controUed plants or entire strucutures. The common feature of
these approaches is the intention to detect an abrupt change in some characteristic
property of the considered object, by monitoring the behavior of the system. This
change may be a slow-evolving effect or a complete breakdoWD.
In tbis sense, fault monitoring touches upon, and occasionaUy overIaps with, other areas
of control engineering such as adaptive control, robust controller design, reIiabiIity and
safety engineering, ergonomics and man-macbine interfacing, etc. In fact, a system
safety problem, could be attacked from any of the above angles of view. In tbis book,
we don't touch upon these areas, unless there is a strong relationship between the fauIt
monitoring approaches discussed and the aforementioned fields.
When we set out to write tbis book, our aim was to incIude as much material as possible
in a most rigorous, unified and concise format. Tbis would incIude state-of-the-art
method as weil as more cIassical techniques, stilI in use today. AB we proceeded in
gathering material, however, it soon became apparent that these were contradicting
design criteria and a trade-off had to be made. We believe that the completeness vs.
compactness compromise that we made, is optimal in the sense that we have covered the
majority of available methodologies in such a way as to give to the researcbing engineer
in the academia or the professional engineer in industry, a starting point for the solution
to his/her fault detection problem. Specifically, tbis book may be ofvalue to workers in
the foHowing fields:
• Automatic process control and supervision.
• Statistical process contro!.
• Applied statistics.
• Quality contro!.
• Computer-assisted predictive maintenance and plant monitoring
• Structural reliability and safety.
The book is structured according to the main categories of fault monitoring methods, as
considered by the authors: cIassical techniques, model-based and parameter estimation
methods, knowledge- and rule-based methods, techniques based on artificial neural
networks plus a special chapter on safety of structures, as a result of our involvement in
tbis related field. The various methods are complemented with specific applications from
industrial fields, thus justifying the title of the book. Wherever appropriate, additional
references are summarized, for the sake of completeness. Consequently, it can also be
used as a textbook in a postgradute course on industrial process fault diagnosis.
xi
xü Real time fault monitoring of industrial processes
We would like at this point, firstly, to cite our distinguished colleagues, who have before
us attempted a similar task, and have in this way guided us in the writing ofthis book:
Anderson T. and PA Lee (1981). Fault tolerance: Prineiples and practice. Prentice-
Hall International.
Basseville M. and A Benveniste, Eds. (1986). Detection of abrupt changes in signals
and dynamical systems, Springer-Verlag.
Basseville M. and I. Nikiforov (1993). Detection of abrupt changes: Theory and
application. Prentice Hall, NJ.
Brunet J., Jaume D., Labarn~re M., Rault A and M. Verge (1990). Detection et
diagnostic de pannes: approche par modelisation. Hermes Press.
Himmelblau D.M. (1978). Fault detection and diagnosis in chemical and petrochemical
processes. Elsevier Press, Amsterdam.
Patton RJ., Frank P.M. and RN. Clark, Eds. (1989). Fault diagnosis in dynamic
systems: theory and application, Prentice-Hall.
Pau L.F. (1981). Failure diagnosis and performance monitoring. Control and Systems
Theory Series ofMonographs and Textbooks, Dekker, New York.
Telksnys L., Ed. (1987). Detection of changes in random processes. Optimization
Software Inc., Publications Division, New York.
Tzafestas, S. (1989). Knowledge-based system diagnosis, supervision and control.
Plenum Press, London.
Viswanadham N., Sarma V.V.S. and M.G. Singh (1987). Reliability of computer and
control systems. Systems and Control Series, vol.8, North-Holland, Amsterdam.
Secondly, we would like to eite some very important survey papers, that provided us
with useful insights:
Basseville M. (1988). Detecting changes in signals and systems - A survey. Automatica,
24, 309-326.
Frank P.M. (1990). Fault diagnosis in dynamic systems using analytical and knowledge-
based redundancy - A survey and some new results. Automatica, 26, 459-474.
Gertler J.J. (1988). Survey of model-based failure detection and isolation in complex
plants. IEEE Control Systems Magazine, 8, 3-11.
Iserman R (1984). Process fault detection based on modeling and estimation methods:
A survey. Automatica, 20, 387-404.
Mironovskii L.A (1980). Functional diagnosis of dynamic systems - A survey.
Automation and remote control, 41, 1122-1143.
Willsky AS. (1976). A survey of design methods for failure detection in dynamic
systems. Automatica, 12,601-611.
Thirdly, we would Iike to note some important international congresses, devoted to fault
monitoring, which show the great importance that this field has recently acquired:
1st European Workshop on Fault Diagnostics, Reliability and related Knowledge-based
approaches. Rhodes, Greece, August 31-September 3, 1986. Proceedings appeared in
Preface xiü
Tzafestas S., M. Singh and G. Schmidt, Eds. System fault diagnostics and related
knowledge-based approaches, D. Reidel, Dordrecht, 1987.
Ist IFAC Workshop on fault detection and safety in chemical plants, Kyoto, Japan,
September 28th-October 1st, 1986.
2nd European Workshop on Fault Diagnostics, Reliability and related Knowledge-based
approaches. UMIST, Manchester, England, April 6-8, 1987. Proceedings appeared in
M. Singh, K.S. Hindi, G. Schmidt and S.G. Tzafestas (Eds.). Fault Detection and
Reliability: Knowledge-based and other approaches, Pergamon Press, 1987.
IFAC-IMACS Symposium SAFEPROCESS '91, Baden-Baden, Germany, September
10-13, 1991.
International Conference on Fault Diagnosis TOOLDIAG '93, Toulouse, France, April 5-
7, 1993.
IFAC Symposium SAFEPROCESS '94, Espoo, Finland, June 13-15, 1994.
Next, we would like to express our sincerest thanks to all those who helped us in tbis
effort: our secretaries Stella Mountogiannaki, lrini Marentaki, Dora Mavrakaki and
Vicky Grigoraki, our postgraduate students George Tselentis, Michalis Hadjikiriakos and
Eleftheria Sergaki and our wives Olga and Aithra who beared with us through the
writing ofthis book.
Lastly we would like to deeply thank Professor S. Tzafestas, not only because as the
Editor of this series, showed trust in us, but also because he has been constantly
encouraging and helping us in our career so far.
A.D. Pouliezos
G.S. Stavrakakis
December 1993,
Chania, Greece.
List o[figures
Figure 1.1 Grinding-classification circuit. ..................................................................... 22
Figure 1.2 Test of steady state app1ied on Q6 .............................................................. 24
Figure 1.3 Drift test app1ied on Q9 .............................................................................. 24
Figure 1.4 Standard deviation test applied on Q9 .......................................................... 25
Figure 1.5 Shewhart control chart ................................................................................ 26
Figure 1.6 Flowchart for computer operated control chart ............................................. 27
Figure 1.7 Three variable polyplot ................................................................................ 39
Figure 1.8 Five variable polyplot .................................................................................. 39
Figure 1.9 Seventeen variable polyplot ......................................................................... 39
Figure1.10 Six variable polyplot with Hotelling's T2 of production data, 2
observations per glyph ................................................................................. 41
Figure 1.11 Frequency analyzed results give earlier warning .......................................... .45
Figure 1.12 Vibration Criterion Chart (from VDI 2056) ................................................ .48
Figure 1.13 Benefits offrequency analysis for fault detection ......................................... .49
Figure 1.14 Typical machine "signature" ........................................................................ 50
Figure 1.15 Effect of misalignment in gearbox ................................................................ 51
Figure 1.16 Electric motor vibration signature ................................................................ 52
Figure 1.17 Mechanical levers ........................................................................................ 53
Figure 1.18 Proximity probe .......................................................................................... 53
Figure 1.19 Accelerometer ............................................................................................. 53
Figure 1.20 Extraction fan control surface ......................................................... ,............ 56
Figure 1.21 System analysis measurements .................................................................... 60
Figure 1.22 Differences between H 1 and H 2 measurements ............................................. 64
Figure 1.23a Effect of tooth deflection ............................................................................. 65
Figure 1.23b Effect of wear ............................................................................................. 65
Figure 1.24 Gear toothmeshing harmonics ...................................................................... 66
Figure 1.25 The use of the cepstrum for fault detection and diagnosis of a gearbox ......... 67
Figure 1.26 Faults in rolling element bearings ................................................................ 68
Figure 1.27 Faults in ball and roller bearings .................................................................. 68
Figure 1.27a Block diagram representation of the on-line bearing monitoring
system ......................................................................................................... 69
Figure 1.28 Reciprocating machine fault detection .......................................................... 72
Figure 1.29 Basic steps used in the analysis for collecting spectra ................................... 72
Figure 1.30 Simplified logic tree and complementary interrogatory diagnosis .................. 73
Figure 1.31 Flow chart of the automated spectral pattern fault diagnosis method
for gas turbines ........................................................................................... 74
Figure 1.A1 Operating characteristic curves for the sampie mean test, Pf = 0.01 ............. 82
Figure 1.A.2 Operating characteristic curves for the sampie mean test, Pf = 0.05 .............. 82
Figure 1.A.3 Power curves for the two-tailed x2-test at the 5% level of
significance ................................................................................................. 83
Figure I.B.I Derivation of the power cepstrum ................................................................ 92
Figure 2.1 General architecture ofFDI based on analytical redundancy ......................... 98
xv
xvi Real time fault monitoring of industrial processes
Figure 3.18 Change of pump packing box friction by tightening and loosening of
the cap screws ............................................................................................. 236
Figure 3.19 Detailed one-line diagram of a typical high voltage substation ...................... 242
Figure 3.20 Four processor real-time computer implementation of DC-drive fault
detection algorithm ...................................................................................... 246
Figure 4.1 Relationship between tenns in knowledge engineering .................................. 259
Figure4.2 Event based diagnostic architecture and messages ........................................ 286
Figure 4.3 Curve analysis based diagnosis combining digital signal processing
and rule-based reasoning ............................................................................. 290
Figure 4.4 Diagnosis of sensors .................................................................................... 292
Figure 4.5 Diagnosis of sensors .................................................................................... 293
Figure4.6 Different states in the Petri net based monitoring concept ............................. 293
Figure4.7 Concept of the mechanism, which handles the rules in Petri net based
fault diagnosis ............................................................................................. 295
Figure 4.8 Representation of the fuzzy function ............................................................ 298
Figure4.9 Detennination of the maximum ordinate of intersection between A and
A* ............................................................................................................... 299
Figure 4.10 The expert system diagnostic process for NPP safety ................................... 303
Figure 4.11 Diagram ofboiling water reactor cooling system .......................................... 305
Figure 4.12 Flow offailure diagnosis with implication and exception .............................. 309
Figure 4.13 Example of fuzzy fault diagnosis by CRT tenninal ...................................... 31 0
Figure 4.14 The general appended KBAP configuration .................................................. 313
Figure 4.15 Metalevel control in a KBAP ....................................................................... 315
Figure 4.16 The internal organization or the meta1evel control rule node ...................... 316
Figure 4.17 General object level rule examples for a low voltage bus .............................. 318
Figure4.18 Partial decision tree diagram and corresponding PRL rules for a
motor pump fault diagnosis ......................................................................... 323
Figure 4.19 Circuit to detect transient faults in a microcomputer system ......................... 327
Figure4.20 A production system workstation monitoring system.................................... 336
Figure 4.21 CIM system layout ...................................................................................... 339
Figure4.22 CIM system example diagnosis .................................................................... 341
Figure 4.23 Updated probabilities in deep KB for CIM system diagnosis ........................ 343
Figure 4.24 D-S (deep-shallow) type of expert hybrid reasoning ..................................... 352
Figure 4.25 Functional hierarchy as deep knowledge base ............................................... 353
Figure 4.26 Rule hierarchy as shallow knowledge base ................................................... 354
Figure4.27 Schematic diagram for diagnostic strategy ................................................... 355
Figure 4.28 Sets ofrelation between failure and symptom ............................................... 365
Figure 4.29 Linguistic truth value offailure derived from exception ................................ 368
Figure 5.1 Feedforward and CAM/AM Neural Network structure ................................. 373
Figure 5.2 Features ofartificial neurons ....................................................................... 375
Figure 5.3 Neuron activation characteristics ................................................................. 376
Figure 5.4 Neuron output function characteristics ......................................................... 377
Figure 5.5 Structure of multiple-Iayer feedforward neural network ................................ 380
Figure 5.6 Structure of ART network ........................................................................... 386
Figure 5.7 Expanded view of ART networks ................................................................ 387
Figure 5.8 Topological map configurations ................................................................... 390
xviii Real time fault monitoring of industrial processes
xxi
Introduction
The writing of this book has been motivated from the fact that a very large amount of
knowledge, regarding fault monitoring, has been accumulated. This accumulation is the
result of two factors: firstly, man has always been interested in preventing catastrophes,
being a consequence of his works or of natural causes, and secondly, as technology
advances, it seems that unfortunately, the risk of catastrophic events occurring,
increases.
The latter fact is the result of bigger and more complicated plants, which makes it
impossible for human operators to manage or control them. Thus the need for automatic
or operator-aiding fault monitoring systems. Fortunately, results from research into
man-made system safety, is equally applicable to protection from undesirable natural
phenomena, such as earthquake prediction or meteorological forecasts. Even more, the
same results find applications in many diverse scientific disciplines, such as
bioengineering (e.g. arrythmia detection), speech processing, traffic control (incident
detection) and in any other area where dynamic phenomena with possibly time-varying
parameters occur. In this sense, the meaning of the term jailure can be extended to
mean change. Thus the following definition can be made:
A change is any discrepancy between an assumed value of a monitored parameter of an
object and its measured, estimated or predicted value. This change may be the result of a
natural operation, assuming many operational modes, or the result of a malfunction.
Since this book is about fault monitoring in industrial, thus man-made, systems, let us
concentrate henceforth in this area. Malfunctions can occur in sensors, actuators,
controller hardware or software, the process itself and to structures (vessels, pipes,
beams etc.). The terms fault and failure are used interchangeably, but a subtle difference
does exist: a jault occurs when the item in question operates incorrectly in a permanent
or intermittent manner; jai/ure on the other hand denotes a complete operational
breakdown. To avoid confusion these two terms will be used with the same meaning
throughout this book. Related to this terminology is the notion of jault to[erance,
signifying the ability of a system to withstand malfunctions, whilst still maintaning
tolerable performance. It is obvious however, that fault tolerance includes fault
monitoring and diagnosis and the ability of the system to reorganize or restructure itself,
following fault identification.
A fault monitoring system should perform the following tasks:
• Fault detection and isolation (FDI).
• Diagnosis of effect, cause and severity of faults in the components of a system.
• Reconfiguration or restructuring of appropriate control laws, to effect tolerable
operation of the system if possible. If not, issue of shutdown or other emergency
advice (eg. abandon aircraft).
xxiii
xxiv Real time fault monitoring of industrial processes
Additionally, the performance of the above tasks, should meet certain requirements.
Stated informally, these are:
• As many as possible true faults should be detected, while as few as possible false
alarms should be triggered.
• The delay time between a fault occurrence and a fault declaration should be small.
• The accuracy of the estimated fault parameters (location, size, occurrence time etc.)
must be high.
• The employed method must be insensitive (robust) to model inaccuracies (if a
mathematical model is used) such as simplification errors resulting from linearization
or unmodeled, usually non-linear components, e.g. friction, and external phenomena
such as noise, load variation etc.
rt is obvious, even to the uninitiated, that simultaneous satisfaction of the above
requirements leads to contradiction, which is usually resolved by trade-off methods.
Incorporating a fault monitoring system into an industrial process results in improved
reliability, maintanability and survivability. These terms are defined as:
• Reliability deals with the ability to complete a task satisfactorily and within the
period oftime over which that ability is retained.
• Maintanability concerns the need for repair and the ease with which repairs can be
made, with no premium placed on performance.
• Survivability relates to the likelihood of conducting an operation safely (without
danger to human operators or the system) whether or not the task is completed.
Furthermore increased system autonomy is achieved.
The main types of failures and errors which end anger the safe operation of a technical
process are:
• System component failures caused by physical or chemical faults.
• Energy supply failures, caused for example by power supply faults.
• Environmental disturbances and external interference.
• Human operator errors.
• Maintenance errors and failures caused by wrong repair actions.
• Control system failures.
Since "a chain is not stronger than its weakest link", the task of achieving a high system
availability must be performed with a total system availability attitude. This concept,
however is not easy to be realised in practice. The key point in this practical
implementation is the concept of lije-cycle maintenance. This is defined as those actions
that are appropriate for maintaining a facility in a proper condition, so that its required
function·s are performed throughout its life-cycle.
Two major problems have to be considered in life-cycle maintenance. One is how to
cope with unexpected deteriorations and failures. The other is how the maintenance
activities are properly adjusted to the various changes inside and outside the facility.
Introduction xxv
. 11 Pred,cl,on
- mSlll. auen of
- mocbficauon dei 'ora-
_knowledge en
(rom OIhcr ·'11111 .. I'on
planlS er mode
researches &
i. deterioration
E
unexpected
~ {failure
=
subsystem needs to refer to specific data about the facility in question. This data is
contained in a facility data base which consists of a facility model, an environment model
and an operation and maintenance record.
The maintenance management subsystem manages and controls the actual maintenance
actions based on the strategy selected in the strategy planning subsystem. From the
results of maintenance actions, the deteriorations and failures detected in the facility are
analyzed. If the deteriorations and failures correspond to those predicted in the strategy
planning subsystem, the maintenance management subsystem keeps the same strategy
and makes a plan for the next maintenance cycle. On the other hand, occurrences of
unpredicted deteriorations or failures indicate improper predictions in the strategic
planning subsystem. In this case, the information is fed back to the strategy planning
subsystem, and the prediction of deteriorations is carried out over again for revising the
maintenance strategy plan.
To sum up, there are two feed back loops in CAPMS. One is the routine feed back loop
to provide the information gathered during maintenance actions to the next maintenance
plan. The second is the strategic feed back loop which becomes active when the actual
data is reco gnized as inconsistent with the assumed scenario of the maintenance
strategy.
To reaJize a CAPMS the followings major items have to be studied:
• Prediction of deterioration. As mentioned already, the prediction of deterioration is
an essential function of a CAPMS.
• Deterioration ejject evaluation. The effects that the deteriorations propagate in the
facility cause functional degradations and failures. To evaluate these effects with a
computer, one needs functional models of the facility. Significant amount of works
on this subject have been done in conjuction with diagnostic expert systems.
• MOl1itoring and diagnosis. Although a number of technologies have been developed
for monitoring and diagnosis, one still needs techniques for detecting the progress of
various deteriorations at their early stages.
• Selectiol1 of maintenance strategies. The selection of the optimum maintenance
strategy is the key function of a CAPMS.
• Commol1 data base and facility data base. CAPMS requires sophisticated data
bases. It is necessary to make a study on the structure of the common data base to be
effective in deterioration prediction. With regard to the facility data base, the product
model should be used as the foundation.
The subject ofthe present book is the detailed exposition ofthe first three items.
It is weil known that the use of traditional techniques to build a desired control system
requires a huge effort. Thus, there is also significant need for better ways to create a
system. If the basic tasks of process automation, the feed-forward and feedback control,
are dedicated to a first automation level, the various tasks with supervisory junctions can
be considered as forming a second level. These supervisory functions serve to indicate
Introduction xxvii
undesired or unpennitted process states and to take appropriate actions in order to avoid
a damage of the process or an accident with human beings.
It is assumed that faults affect the technical process and its control. As mentioned earlier,
a fault is to be understood as a nonpermitted deviation of a characteristic property of the
process itself, the actuators, the sensors and controllers. If these deviations influence the
measurable variables of the process, they may be detected by an appropriate signal
evaluation. The corresponding fault detection and isolation (FDI) functions, called
monitoring, consist of checking the measurable variables with regard to a certain
tolerance of the normal values (limit or trend checking) and triger alarms if the tolerances
are exceeded. Based on these alarms the operator takes appropriate actions. In cases
where the limit value violation signifies a dangerous process state, an appropriate action
can be initiated automatically. This is called automatie proteetion. Both supervisory
functions may be applied directly to the measured signals or to the results of a following
signal analysis, as in the case of frequency spectra of vibrations for rotating machines.
These classical ways of limit value checking of some important measurable variables are
appropriate for the overall supervision of the processes. However, developing internal
process faults are only detected at a rather late stage and the available information does
not allow an in-depth fault diagnosis. This is one of the reasons that process operators
are still required for the supervision of important processes. These human operators use
their own sensors, data records, own reasoning and long term experience to obtain the
required information on process changes and its diagnosis.
If the supervision is going to be improved and automated, a natural first step consists of
adding more sensors and a second step to transfer the operators' knowledge into
computers. Here it is usually desirable to add such sensors which directly indicate faults.
Because the number of sensors, transmitters and cables increases, the cost goes up and
the overall reliability is not necessarily improved. Furthermore many faults cannot be
detected directly by available sensor technology.
In practice, the most frequently used diagnostic approach is the limit checking of
individual plant variables. While very simple, this approach has serious drawbacks,
namely:
• Since the plant variables may vary widely due to input variations, the test thresholds
have to be set quite conservatively;
• Since a single component fault may cause many plant variables to exceed their limits,
fault isolation is very difficult (multiple symptoms of a single fault appear as multiple
"faults").
Consistency checks for groups of plant variables eliminate the above problems; the price
to be paid is the need for an accurate mathematical model. Model-based FDI consists of
two stages: residual generation and decision making based on these residuals.
In the first stage, outputs and inputs of the system are processed by an appropriate
algorithm (a processor) to generate residual signals which are nominally near zero and
xxviii Real time fault monitoring of industrial processes
wbich deviate from zero in characteristic ways when particular faults occur. The
techniques used to generate residuals differ markedly from method to method.
In the second (decision making) stage, the residuals are examined for the likelihood of
faults. Decision functions or statistics are calculated using the residuals, and adecision
rule is then applied to determine if any fault has occurred. Adecision process may consist
of a simple threshold test on the instantaneous values or moving averages of the
residuals, or it may be used directly on methods of statistical decision theory, e.g.
sequential probability ratio testing.
For a static system, the residual generator is also static; it is simply a rearranged form of
an input-output model, e.g. a set of geometric relationsbips or of material balance
equations. For a dynamic system, the residual generator is dynamic as weIl. It may be
constructed by a number of different techniques. These include parity equations or
consistency relations, obtained by the direct conversion of the input-output or state-
space model of the system, diagnostic ob servers and Kalman filters.
While a single residual is sufficient to detect a fault, a set of residuals is required for fault
isolation. To facilitate isolation, residual sets are usually enhanced, in one of the
following ways:
• In response to a single fault, only a fault-specific sub set of the residuals becomes
nonzero (structured residuals).
• In response to a single fault, the residual vector is conflned to a fault specific
direction (fixed direction residuals).
Also, to simplity statistical testing in a noisy system, it is useful if the residuals are
"wbite", that is, uncorrelated in time. Residuals need to be insensitive to some
disturbance variables. Tbis may be addressed as an explicit disturbance decoupling
problem or handled as a special case of structured residuals.
A fundamental issue in the generation of residuals is their robustness (insensitivity)
relative to unavoidable modeling errors. Robustness concems have plagued
implementation of detection filters (as weIl as other failure detection methods) since their
introduction. False alarms and incorrect identification of faults due to noise,
disturbances, plant parameter uncertainties and unmodelled system dynamics have led to
the design of robust detection filters and to determining appropriate thresholds for a
given detection filter. Various techniques have been proposed to make the failure
detection process more robust. Design methods are proposed with the goal of making
the detection filter very much more sensitive to one fault than others. These methods
have been shown to be specific cases of the unknown input observer approach. In tbis
approach, noise, disturbances, parameter uncertainties and unmodelled dynamics are
modeled as "fault events" of the system, along with the fault events arising from actual
system failures. An ob server is then designed to be sensitive to a fault event of interest,
while insensitive to as many other real and pseudo-fault events as possible.
Introduction xxix
to the high-end "supervision and planning level". The execution level includes the use of
techniques such as "fuzzy control" or "neural control" for closed loop control. Fuzzy
logic is used to express and manipulate ill-defined qualitative terms like "large", "smali",
"very smalI", etc. in a weil defined mathematical way to mimic the human operator's
manual control strategy. Qualitative rules are used to express how the control signal
should be chosen in different situations. "Neural control" refers to the use of neural
networks in developing process models which are then used to implement robust, model-
predictive controllers.
The high-end "supervisory level", on the other hand, seeks to extend the range of
conventional control algorithms through the use ofknowledge based systems (KBSs) for
tuning controllers, performing fault diagnosis and on-line reconfiguration of control
systems and process operation.
For diagnosis, these knowledge-driven techniques involve the interpretation of sensor
readings and other process observations, detection of abnormal operating conditions,
generation and testing of malfunction hypotheses that can explain the observed
symptoms and finally resolution of any interactions between hypotheses. Fundamentally,
diagnosis is viewed as a decision-making activity that is not numeric in nature. While the
governing elements are symboJic, numeric computations still play an important role of
providing certain kinds of information for making decisions and drawing diagnostic
conclusions.
Neural nets are expected to improve today's automated supervising systems, because
complex classifiers can be designed with neural nets. Artificial neural networks, even in
their simplest form, are good pattern recognizers. Input vectors are introduced into the
network and via supervised or unsupervised learning the weights on the connections of
the network are adjusted to achieve certain goals: matching targets for supervised
learning or forming clusters with unsupervised learning. Subsequent input vectors of
similar types can be classified properly, but, of course, novel input patterns that the
network was not trained to recognize, cannot be classified succesfully any more than a
clustering code will correctIy classify a new cluster on which it was not trained.
As the pattern vectors get to be large and complex, conventional numerical algorithms
may not be able to properly handle the task of recognition promptly (with a computer of
reasonable cost). For example, in the analysis of faults in rotating equipment, several
sensors could be used to collect measurements of the vibrations in the x, y, and z axes.
Ultra high-speed data sampling would be applied to get the complex waveforms
involved, followed by Fourier analysis to extract the frequency components. Then
statistical reduction might be applied to isolate patterns of condition from which a
decision could be made as to whether the equipment was operating normally, or not. But
all ofthese calculations take time and computer power. An ANN, once trained, can reach
the decision state far more rapidly.
Another desirable feature about ANN is that good models are not required to reach the
decision stage. In a typical operation in a chemical plant, the process model may be only
Introduction xxxi
approximate and the critical measurements may all be correlated with each other and
include non-normally distributed noise. Thus, the assumptions underlying the usual
statistical analysis for faults are violated to some unknown extent. An ANN seems to be
able to intemally map the functional relations that represent the process, filter out the
noise, and handle the correlations as weil.
Petri nets are suitable for the description of discrete events or processes. An important
property of these nets are their capability to model and describe concurrent and
asynchronous processes. Petri nets are supported by a rich mathematical theory, enabling
one to simulate and analyze them. By this characteristic they can be used for processing
in computers and controllers as weil as in tool automation.
Since the presentation of the Petri nets in 1962, a range of net classes has been defined.
These classes are divided according to the different quantity of inherent information. In
Petri nets-based diagnosis, place/transition nets are used on the area level and
conditionlevent nets on the component level. These net classes are, on the one hand,
sufficient for modeling events and processes in machines or plants and, on the other
hand, they provide high performance during their processing in a computer or a PLC.
Ageing plant (nuclear, conventional power, chemical, offshore, etc.) life management,
life-cycle optimization and in-time safety assessment are the integral elements in safe
operation and maintenance practices. Life management relies on accurate condition
assessment and this can be achieved by integration of on-line monitoring and off-line
inspections. The essential role of the maintenance activities is to provide plant operators
with all the functions needed for safe and economical plant operation.
A systematic approach to life management has three steps:
• Data management and selection of critical and/or important areas.
• Life assessment in the critical/important areas or condition assessment.
• Control of life and life extension, if needed, including possible refurbishment or
upgrading.
During the first step, for example, the economical benefits of the process, the life-limiting
factors of components, fault history and safety factors affect the decision of criticality.
The second step includes conventional life assessment work. The third step is the actual
process of managing plant life by operation, maintenance, refurbishment, training and
cost control.
Tools are needed for organizing, analyzing and transferring knowledge between the
people involved. Tools must be closely related to the strategies ofthe company and they
must fit the tasks of the users. Therefore, the strategies and the operation models have to
be defined before the system definition.
The data to be used in the analyses has to be systematically gathered and saved.
Inspection results, maintenance history and fault history are alt important when trying to
assess the life of a certain component. Data management is a necessity. Before doing
any life assessment or action plan, the critical components have to be defined. They are
xxxii Real time fault monitoring of industrial processes
the components that for some reason are suspected as critical and require further
analysis.
The criticality of a component can be defined with several methods. Theoretical
criticality consists of analysis of economical aspects, like the operationaI effects of a
damage in the component, the delivery time of a component, the statistical fault
frequency of the component and safety factors analysis. One way of determining
criticality is to find out the components with the lowest remaining Iife time based on
stress calculations and the international standards for Iife assessment. The effect of creep
can be calculated based on estimated (static) temperatures and pressures or actual
temperature and pressure history. The calculation gives as a result an estimated
remaining Iife in hours. However, the calculations are very conservative and therefore the
results can not be taken as accurate facts. Instead, they offer quite a good picture of the
most critical parts of a critical installation, ego piping. The components with the lowest
remaining Iife time or the components whose usage factor exceeds a certain level are
gathered as critical. On the basis of calculations the first plans for more accurate methods
Iike non-destructive tests can be made. Operational history affects the life of
components. Also authorities' demands have to be taken into account when planning the
components to be checked. NOT test results and fault history give quite an accurate
estimate about the condition of a component. There are several kinds of standards,
material expert knowledge and company specific directions for determining the
criticality.
The planned Iife time of the utility controls all Iife management decisions. The decisions
made in operationallevel have to follow the strategy of the company. A schedule of
reinspections and other maintenance operations is needed for planning for instance the
overhauls and budgeting. It is also needed for the authorities who must be confident on
the safe operation of the plant. The schedule gets more tight when the utility gets older.
Also, even if a plan for a longer period is needed, it changes over time according to the
entering information.
There are several kinds of decisions concerning the Iife management:
• Reinspection intervals and reinspection methods to be used have to be frequently
determined.
• It may be necessary to decide if a component should be repaired or changed
(investment costs have to be taken into account).
• If the strategy of the company is to extend the life of the plant, change of the
operation parameters of a critical component can be considered.
The following modules are examples of the domain knowledge that is needed for life
management:
• Components: the component classification (pipe, bend, weId, etc.) and the relations
between the components have to be described.
Introduction xxxiii
1.1 Introduction
this chapter because no process model is needed for their application and important
signal processing is needed for their performance. llIustrative examples from practical
fault diagnosis cases for a11 the above methods are presented.
The diagnosis of the working state of an installation is a treatment lying between the
acquisition phase of the information and the oontrol phase of this installation.
The treatment is necessary to have representative information about the process state in
order to act on the process operation. This representativeness implies that information
must be sufficient (observability ooncepts), that it must be oollected from judiciously
positioned sensors and that it has to be free from errors being detrimental to the
interpretation of the phenomena represented by these information.
It is also a delicate treatment, because there are various types of information (Iogical,
anaIogical, deterministic, statistical, fuzzy, ... ) and because they are relative to different
subsets (the process itself, the actuators, the regulation systems, the chains of meas-
urement acquisition, ... ). This treatment oomprises oftwo aspects:
• The first concerns the detection and localization of failures that may affect the
process and also the instrumentation set-up, except the sensors. "Failure" may be an
alteration in the operation of a proceeding element such as sensor bias, actuator
locking, chocking up or significant deviation of the process state variables from the
no-fault operation limits. The noises ofmeasurement are not considered as failure.
• The second aspect forms part of the data validation. It consists of detecting failures
in sensors and in correcting the doubtful measurement if the case arises. This aspect
will be dealt with in Chapter 2.
In this section some of the basic statistical tools used in failure decision making will be
described. The strategy proposed is based on exploitation of the information given by
sensors and detectors at local level. The results may be synthesized by a "knowledge
based system" proposing a diagnosis. Knowledge based diagnosis systems are examined
in detail in Chapter 4.
It is first assumed that the means f.1o and JJl before and after the jump are known. Several
possible solutions in the (real) situation where f.1o is known (possibly via a recursive
parameter identification) and JJI is unknown are indicated in Chapters 2 and 3.
General comments on hypolhesis lesting. To test any hypothesis on a basis of a
random sampie of observations, the sampie space n (Le. all the possible sets of ob-
servations) is divided into two regions. If the observed point, say r, falls into one of these
regions, say CiJ, the hypothesis is rejected in favour of an alternative hypothesis; if r falls
into the complementary region tJ-CiJ the hypothesis is accepted. CiJ is known as the critical
region ofthe tests and tJ-CiJ is called the acceptance region.
When making statistical hypothesis tests, the possibility of erroneous inference exists.
This falls into two categories for the case where a null hypothesis (Ho) is tested against
an alternative hypothesis (HI ):
Type I: (Ho) is rejected when it is true.
Type II: (Ho) is accepted when it is false.
The probability of a type I error is equal to the size of the critical region used, termed the
significance level of the test and denoted by a. Thus,
p[r E CiJIHo] =a
In the present context a will be defined as the probability, Pf , of a false alarm.
Hence,
(1.3)
The probability of a type II error is a function ofthe alternative hypothesis (HI ), termed
the operating characteristic (OC) of the test and denoted by ß. Hence,
P[r E tJ-mI Hd = ß
P[r E mI Hd = l-ß
The complementary probability l-ß is called the power of the test and in the present
context it will be defined as the probability, Pd, of correct fault detection. Thus,
p[r E (QIHd = Pd
For a given Pf solution of (1.3) will generally yield an infinity of sub regions all obeying
(1.3). In this case (Q is chosen so that Pd is maximum. This is a fundamental principle in
statistical decision theory first expressed by 1. Neyman and E. S. Pearson (KendalI1982).
A critical region whose power is no smaller than that of any other region of the same size
for testing a hypothesis Ho against an alternative H I is called a best critical region
(BeR), and a test based on a BCR is called a most powerjul (MP) test.
When testing a hypothesis (Ho) against a class of alternatives, i.e. a composite hypothesis
(for example, when testing for a zero mean against non zero mean) a MP test could be
Fault detection and diagnosis methods in the absence ofprocess model 5
found for the different members of (H I ) (an infinity for the aforementioned example). If
there exists a BCR which is best for every member of (HI ) then this region is called
unijormly most powerful (UMP) and the test based on it a UMP test.
Fault diagnosis. Any kind of fault occurrence makes the standardized scalar residual
stochastic process y(k) = y(k) - p(k) depart from its zero mean, (f2 - variance and/or
whiteness properties. Therefore it is useful to perform the foUowing four statistical tests:
a. Sampie mean (parametrie test).
The test statistic commonly used for testing,
(Hoyy(k) =0
against (H1):y(k) = Yl(k)"* O;k = i, ... ,j
is the sampie mean defined by:
A 1 j
r=-Lr(k) (1.4)
n k=i
Under the null hypothesis, the sampie mean is normally distributed with zero mean and
variance ein, where c is the variance of the observations sequence (i.e. (f2 or the
calculated sampie variance if (f2 is unknown) and n is the size ofthe sampie.
The probabilities Pfand Pd are respectively given by:
Pr = P [Irl>-;-zPf
~..Jc 12 ] (1.5)
and
A 1
cP(z)= ~
JZ e-Y 2/2dy
v2n -00
Pf ' Pd and n are functionally related in the two equations defining Pf and Pd" Pd also
depends on the unknown Y(k) . Typical values for Pf' are 0.1,0.05 though this will of
course depend on the specific application requirements. Having fixed Pf , then Pd, n and
1;1
the critical region can be chosen using equations (1.5), (1.6).
The UCL and LCL values (defined in Section 1.2.2) are given by:
6 Real time fault monitoring of industrial processes
UCL=-zp
.Jc /2'
n f
LCL=--zp
.Jc 12
n f
The graph of I-Pd' called the operating characteristics (OC) curve is shown in Appendix
I.A (figs.I.A.I and 1.A.2 of Appendix I.A) for different values of the sampIe size n and
for PI = 0.01 and 0.05 respectively. As it can be seen from the graphs, increasing the
sampIe size increases Pd, but at the expense of an increase in the detection delay time,
since by averaging a larger number of residuals the effect of a fault is smoothed out.
In the case of a sliding window the sampie mean r
given by equation (1.4), can be
calculated iteratively, thus reducing the amount of computation in on-line operations.
Define the window sampie mean using a new notation as folIows:
r
A"
I ,}
1 j
=- Lr(k)
nk=i
Then, in the case of a sliding window
r
i +1,j+1
j+l
=~ L r(k)
n k=i+l
= -Ht,r(k) + +I)-r(i)}
r(j
1 {rU+l)-rU) }
=yI,} +-
A"
n
If the residuals are correlated, the sampIe mean test may still be used but its controllimits
have to be modified accordingly. Statistical tests for the mean in the presence of
correlated measurements do not appear to exist in the statistical literature. This means
that in such cases the robustness of the appropriate tests must be examined when the
assumption of independence is violated.
To calculate the effect on the sampie mean control limits, consider the varlance of the
residual sampie mean, which is now calculated using the formula (Kendall, 1982):
rA] C
n
2c n-I
var[ =-+2"" ~)n-k)Pk
n k=1
~.E..{1+2tPk}
n k=1
=-
k=1 k=1 1- P
Consequently,
var[r] ~ ~{1 + 2P } =~{1 +p} (1.7)
n I-p n I-p
This result implies that in the case of correlated measurements, the limits of the control
chart (see section 1.2.2.) for the sampie mean have to be modified according to (1.7). If
the correlation is negative the limits have to be decreased, whereas if the correlation is
positive the limits have to be increased, since,
l+p.
--IS
{<I if p<O
I-p >1 if p>O
In the first stage of the fault monitoring process the correlation is not known, therefore if
the occurred fault induces large p the mean test will give erroneous results.
b. Sign test (non-parametric test).
This is a non-parametric test used to test hypotheses on the value of the median of a
population. Since the residuals are normal under alt hypotheses the median is equal to the
mean and therefore this test ean be applied to test for zero mean.
The sign test proeedure is as folIows: the number of positive residuals in a batch is
ealculated and compared to two thresholds whieh depend on the sampie size n and
significance level a. Thus if,
nl < (number ofpositive residuals) < n2 : aeeept (Ho)
otherwise :(Ho) is rejected.
Table I.A.I of Appendix I.A is a table of the percentage points of the symmetrie
binomial distribution for different sampie sizes and signifieanee levels. It is shown in
Bennett and Franklin (1954), that it may be used for the sign test as folIows:
i. Count the number ofvalues above and below zero, say n+ and n-.
ii. Choose the smallest ofthe two values, say n+.
iii. Compare n+ with the table entry for chosen n and a, say n a .
iv. Ifn+ < n a , reject (Ho); otherwise aeeept it.
8 Real time fault monitoring of industrial processes
The entries in Table I.AI of Appendix I.A may be modified to indicate percentage
points for the number ofpositive residuals. Iffor a sampie size n the table entry is na' it
fellows that the number ofpositive residuals can vary from n a to n-na.
The thresholds nj and n2 are then chosen to satisfy:
Hence nl and n2 represent the UCL and LCL respectively for the sign test.
Oixon gave the Tables l.A2 and l.A3 of Appendix l.A, where values of I-Pd for PI
equal 0.05 and 0.01 respectivelyare shown.
In the case of a sliding window of observations the number of positive residuals can also
be calculated iteratively. Let n~j be the number of positive residuals in the residual
vector rj,j = [r(i) y(i + 1) ... y(j)Yand,
nj= I if r(i»O
=0 if r(i)<O
The best procedure for residual values that are equal to zero is to disregard them and
reduce the sampie size by their number. This is also intuitively appealing since a zero
value contributes equally to both negative and positive values. Then,
+ +
nj+l,j+1 = nj,j + n j+1 - nj
The robustness of the sign test in the case of correlated residuals, can be investigated
similarly to the case ofthe sampie mean test. Let,
ni = I if r (i) > 0
= -I if r (i) < 0
Then,
E[njl = 0 , var [njl = I
The random variable nj , may be associated with the positive and negative residuals.
Hence, ifthe r(i) are correlated,
where,
Pj = E[y(i)r(i + j)]
Fault detection and diagnosis methods in the absence of process model 9
If _ j h [ ] - 2 . -1 j _ (s)
Pj -P , t en E njnj +j --sm P -Pj
n
The variance of the sign test statistic will be given by:
~2 . -1..fl
,t... -sm p
h=I 7r
Expanding the inverse sine (i.e. sin- I (.» in Taylor's series about 0,
where yijis the sampie mean. For small sampie sizes «20) more accurate forms may
be used (Kendall, 1982).
10 Real time fault monitoring of industrial processes
r' =
Lj .{m-i+l-y'(m)Y
lD=1
n(n 2 -1)
where r'(m) is the rank of y(m) among the y(i)'s. The calculated value of r' is then
compared to its LCL and UCL values which are found from Table 1.A.4 of Appendix
1.A, for different PI'
d Testingfor variance.
The residual population variance is calculated from the sampie by the formula,
Fault detection and diagnosis methods in the absence ofprocess model II
2 1 j ·2
=-
A.
S L(y(m) - r")
nm=i
against (H1):c:#: u 2 ;C = u/
are found using the fact that the quantity,
.;1. _ (n -1)8 2 (1.9)
.{n-l - 2
U
is distributed;(l with (n-l) degrees offreedom. It then follows that the relation,
(n -1)8 2 (n -1)8 2
~---=~ < c < ~---=:...-
in-I, a/2 in-I,I-a/2
will have a probability of (l-a) of being correct (Benett and Franklin, 1954).
Equivalently,
Cin-I,I-aI2)
C A2 c
---------<5 <--------
trn-l,al2)
n-l n-l
represent the confidence limits on 82 with a probability of error Q. Hence, the UCL and
LCL are given by,
UCL =
trn-l,aI2) c (1.10)
n-l
LCL =
(rn-II-al2)
'
c
(l.ll)
n-l
The power ofthe test is given by,
where,
Figure 1.A.3 of Appendix 1.A shows some power curves for Pf =O.05 and for 1F3, 10,
30.
12 Real time fault monitoring of industrial processes
For real time applieations, the varianee and the first order serial eorrelation, as indeed
eorrelations of higher order ean be ealeulated iteratively. The equations deseribing the
evolution of eorrelations are developed by Pouliezos, (1980). These are:
co = c
j j-I
0
+ .!.. {y2 (j) - y2 (i -1)} - a j (a j + 2yi-l,j-l)
n
yi,j =yi-I,j-I + aj
aj =.!..{y(j) - y(i -I)}
n
and c~ ean be ealeulated from C~_I using:
where,
k l =j-n+m+2
q~ = q~-l - y(j - n - m)
P~ = P~_l-y(j-n+l)
pt =qt =n yi,j
and c~ denotes the sampie serial eorrelation of lag m ealeulated from the residual
sampie r ij. Specifieally the varianee, 52 = cg , and the first order serial eorrelation
1j = cl ean be ealeulated iteratively using the above formulae.
Both the tests of mean and varianee assume that the residuals sequenee is white (Kendall
1982, Mehra 1971). Therefore, it is important to test the residuals sequenee for
whiteness first, especially using tests whieh are invariant with respect to the mean and
varianee ofthe distribution as those presented in paragraph (e) previously.
AJ. Test 0/ steady state
The aim of this test is to determine whether the examined variables are in statie or
dynamie state, so that appropriate treatments ean be realized afterwards. When the
distribution law of measurements is postulated to be normal, the test of the mean square
ofthe sueeessive differenees is utilized (Commissariat al' Energie Atomique, 1978).
Suppose the measurements Xi represent a gaussian sequenee, then the variable r,
Fault detection and diagnosis methods in the absence ofprocess model 13
where x is the mean of the sequence and n the window's length, has a probable value
equal to 1. Ifn is bigger than 25, it can be assumed that,
u = (1- r)~(n -1)· (n + 1) / 2
follows a zero-mean standardized gaussian law. The decision rule used is:
U >I:: dynamic state case
U~ 1:: static state case
The confidence limits for hypothesis testing can be found in a similar way to that used for
the first order serial correlation test'I (see paragraph (c) previously).
When the distribution law is anomalous, the test can be substituted by the serial
correlation test.
A4. Drift test
This test is aimed at detecting the slow variations. It is based on the exploitation of the
results obtained by the previous test, utilized with the smallest and the largest window.
The small window is first considered; if the steady state test, utilizing the small window
detects a steady state for a duration equal to the large window and if for the same
duration the steady state test utilizing the large window detects a dynamic state, then a
drift is present.
A5. Robust univariate signal detector
In the previous paragraphs the case in which the signal was disturbed by a white noise
was considered. However, in many situations one may only have inexact knowledge of
the underlying univariate noise distribution. In these situations, it is desirable to employ a
robust detection scheme which ofTers some degree of protection against deviations from
the "assumed" noise distribution.
The discrete-time detection problem under consideration reduces to a hypothesis test of
the following form:
Ho:~ =Nj
Hl:~=Nj+s
for i=l, ... , n, where n denotes the number ofsamples, the positive signal s is assurned to
be known and the Ni are ii.d. random variables. Based on the realizations {y i} 7=1 of the
random variables {Yj } 7=1 the detector attempts to decide between hypotheses R o and R I .
14 Real time fault monitoring of industrial processes
In practical situations one would not likely know the precise distribution of the noise.
Accordingly, one will assurne that the noise distribution lies witbin an appropriately
defined neighbourhood of the nominal Laplace distribution and employ a robust test. The
nominal noise has a probability density function (Pd!) given by,
Po(z) =I e-r~1
2
where r is a positive parameter. The Laplace noise model exhibits the "heavy tail"
behavior wbich typically characterizes impulsive noise, and it appears in many
engineering investigations. Examples where the Laplace pdf is used as a noise model
indude undersea, atmospheric and speech processing applications.
Let B(R) denote the (f - algebra of Borel sets on the real line R, and let M denote the
dass of all probability measures on (R, B(R». Let Po and PI denote the probability
measures on (R, B(R». The admissible noise families under Ho and H I are given by Po
and P 1> respectively.
Po = {Q eM:Q«-oo,z» ~ (1- EO)Po«-OO,z» - Öo for all zeR}
and,
II ={Q eM:Q«z,oo» ~ (1- E1)1l«z, 00» - Öl for all zeR}
where the non negative numbers E 0, E 1> ~, and öl are sufficiently small to insure the
disjointness of Po and PI' The designer must consider the definitions of the
aforementioned admissible noise families when assigning values to these parameters, as
tbis will determine the breadth of Po and/or PI' Smaller choices of the E and 15 parameters
lead to a correspondingly smaller dass of admissible distributions. It should be noted that
several popularty used noise ctasses such as E-contamination, total variation,
Kolmogorov distance, Levy distance, and Prohorov distance are sub sets of Po and/or PI
(see Thompson, 1991).
Loosety speaking, the saddlepoint approach is one wbich takes the "least favorable"
distribution from the dass of admissible distributions and then specifies a detector to
maximize performance for tbis "least favorable" distribution. Without specifying all of the
details, the robust detector, wbich is the solution of a specific saddlepoint criterion, has
the canonical detector structure of a nonlinearity, followed by an accumulator, followed
by a threshold comparator. The ditIerence between the Neyman-Pearson optimal
detector for Laplace noise and the robust detector for Laplace noise lies in the
nonlinearities. Both the Neyman-Pearson (gNP) and robust (gk) detector nonlinearities are
illustrated in the figure below.
The Neyman-Pearson and robust nonlinearities. The robust nonlinearity is obtained
trom the Neyman-Pearson nonlinearity by censoring at prescribed vertical heights. The
method for determining the censoring height can be found in Thompson, (1991), and
because of space considerations cannot be induded here. Note that by putting k=r s, the
Fault detection and diagnosis methods in the absence ofprocess model 15
Neyman-Pearson nonlinearity is obtained !Tom the robust nonlinearity. It follows that the
test statistic for the robust detector is given by,
Tk =:L;=lgk(YJ
The distribution function of the test statistic Tk is defined and given in closed form by
Thompson, (1991).
For relatively small sampie sizes, the evaluation oftbis detector revealed that a significant
loss in detection probability can occur when an overly cautious approach toward
robustness is taken.
For larger sampie sizes, the closed form test statistic distribution function proposed by
Thompson, (1991), is computationally difticult to use. For tbis reason, simulation
techniques can be used for larger sampie sizes. A modified Monte Carlo simulation
technique known as improved importance sampling has been suggested by Thompson
and has been shown to significantly reduce the number of simulation runs required to
estimate smaller false alarm probabilities.
continuous residual population with a p-vector mean 8. One wishes to test (Ho): 8 = 0
versus (Ha): (}:f:. O. Here 0 is used without loss of generality, since (HO):8=80 can be
tested by subtracting 8 0 from each observation vector and testing whether these
differences are located at o.
The classical procedure for this problem is Hotelling's T2 (Anderson, 1958, Randles,
1989), which assurnes that the underlying population is p-variate normal, that is, an
NpC8,I) distribution with mean vector 8and variance-covariance matrix I(pxp). It rejects
(Ho) in favour of(HJ if,
2 -T 1-
T =nX S- X~(n-l)pFa(p,n-p)/(n-p)
is the unbiased estimator of the variance-covariance matrix and Fa (nb n2) is the upper
ath quantile of an F distribution with nl and n2 degrees of freedom. This test is quite
effective and has many nice properties, inc1uding the intuitive property that it is invariant
under all nonsingular linear transformations of the data. That is, if Jj=DXj; j=l, ... , n,
where D is any nonsingular pxp matrix, then T 2 (l] . . . Yn ) =T 2 (XI .. . X n)
Thus, if the data points are rotated or if they are reflected around a p-l dimensional
hyperplane or if the scales of measurement are altered, the value of T2 stays the same.
This property is intuitively appealing, and it also ensures that the performance of T2 is
the same for any variance-covariance matrix I. Many nonparametric competitors to
Hotelling's r 2 have been proposed which concentrate on sign statistics, that is, ones that
use the direction of the observations from 0 rather than the distances from o. The most
popular such statistic is the component sign test, which uses a sign statistic for each
component of the vectors and combines them in a quadratic form. Let
ST = (SI ... Sp) where,
n
Sj = Lsgn(Xij) and sgn(t)=I,(O,-I) as t>(=,<)O
j=l
by conditioning on the observed vectors, giving equal probability to each data vector
being the observed one or a point on the opposite side of o. The performance properties
of S: vary depending on X and the direction of shift from o. It has been demonstrated by
Randles, (1989), that it may not perform weIl when there are substantial correlations
among variates in the vectors. In an etfort to stabilize the performance, this test can be
performed on transformed data. This creates invariant (or asymptotically invariant)
procedures, that for any X, have significance levels and power comparable with those of
S: when X= I, (Randles 1989).
Distribution-free tests are also investigated by Randles, (1989), for the one-sample
multivariate location problem. Counts, called interdirections, which measure the angular
distance between two observation vectors relative to the positions of the other
observations, are introduced there. These counts are invariant under nonsingular linear
transformations and have a small-sample distribution-free property over a broad class of
population models, called distributions with elliptical directions, which include all
elliptically symmetrie populations and many skewed populations. A sign test based on in-
terdirections is described, including, as special cases, the two-sided univariate sign test
and Blumen's bivariate sign test (Blumen, 1958). The statistic is shown to have a limiting
z! null distribution and, because it is based on interdirections, it is also seen to be
invariant and to have a small-sample distribution-free property. Pitman asymptotic
relative efficiencies and a Monte Carlo study have shown the test to perform weil
compared with Hotelling's T2, particularly when the underlying population is heavy-
tailed or skewed. In addition, it consistently outperforms the component sign test, which
is often recommended in the nonparametrie literature (Anderson (1958), MacNeill
(1974), Randles (1989».
B2. Test 0/ covariance
The unbiased covariance of the p-dimensional residual population sequence is estimated
as,
1 ~ - -T
SO =-L..J(Xj -X)(Xj -X)
A
n -1 j=l
Under the null hypothesis,So has a Wishart distribution (Kendall 1982, Anderson 1958).
The trace of So has a chi-square distribution with (n-l)p degrees of freedom. Thus So
can be tested for its null hypothesis covariance, equal to an identity matrix for the case of
the standardized residual sequence.
Also here, the parametrie tests of mean and covariance assume that the residual sequence
is white.
18 Real time fault monitoring of industrial processes
where Yi = L(Yil / n);i = 1,2, ... ,m and eit =Yit - Yi; ;=1, 2, ... , m; t=1, 2, ... , n. Then
1=1
for k=O, 1,2, ... , n-l,
n-k
ck(i,j) = ~)ejtej(t+k» / n
t=1
and,
Ik(i,j) = ck(i,j) / ~co(i,i)coU,j); k =1,2, ... ,(n -I); i,j =1,2, ... ,m.
A test for lag k; k= 1,2, ... , autocorrelation and cross-correlation, pi; ,j), can be based
on the statistic rk(i ,j). The null hypothesis ofthis test is thatYt; t=l, 2, ... , n, is a random
sampie. The alternative is that Pk(i ,j) 'I:- 0.
Fault detection and diagnosis methods in the absence ofprocess model 19
Qk = nLijr;(i,j)
An analogous test based on the following statistic is suggested by Ali (1989),
R k =CkC~l
Under the null hypothesis of randomness, following Chitturi (1976), rk (i ,j) are normal
asymptotically and both Qk and QSk are distributed as the chi- squared variable with m2
degrees of freedom. Unfortunately, the distribution of these statistics , rk (i ,j), Qk and
QSk is unknown in small sampies.
There is a long history in the investigation of the distribution of the sampie lag
autocorrelation, rt<i, i) (Kendall (1982), Anderson, 1971). Except for some partial
successes, there has been no practical solution to this problem. Only recently, Ali (1989),
has observed that for relatively small sampie sizes, the null distribution of rt<i, i) can be
weil approximated by a normal distribution matching the first two exact moments. Thus,
it is suggested that the null distribution of rk (i,j), ;:t:j, be approximated by anormal
distribution matching the first two moments. Alternatively, the statistic rk (i ,j) IS
modified to,
where E(vecCk) is the estimated exact first moment of(vecCk), i.e. E(vecCk), and Bk is
the estimated exact covariance matrix for (vecCk) , i.e. Bk' Both E(vecCk) and Bk involve
nuisance parameters Po(i ,j), which are replaced by their consistent estimates TOO ,j) to
obtain E(vecCk ) and Bk . The exact mean vector and covariance matrix for (vecC0
are derived by Ali (1989), assuming Yt to be normally distributed. QA k can be shown to
be asymptotically equivalent to QSk . The null distribution of QA k will be approximated
by a chi-squared distribution with m2 degrees of freedom.
The statistic QA k is a modification of QSk in that the asymptotic mean and covariance
matrix of (vecCk) in QSk are replaced by their exact counterparts. Thus QA k is the mean-
and covariance-corrected QSk' Similarly, one could modify QSk by correcting only the
mean of (vecCk) and the covariance matrix of (vecCk). These modified statistics are,
respectively,
and,
QAVk =(vecCk ) T Bk
"_1
(vecCk )
One may also apply a correction to QAMk to obtain another statistic,
QAMHk = ( n / (n - k) ) QAMk
Each of the statistics QAMk, QAVk, and QAMHk is asymptotically distributed as a chi-
squared variable with m2 degrees of freedom. It is expected that this asymptotic
distribution will provide adequate approximation to the null distributions of these
statistics in small sampIes.
The statistic QSk may be modified to QHk = (n/(n - k»QSk and to QMk = QSk + m2/n.
The adequacy of approximating the null distribution of QHk and QMk by a chi-squared
distribution with m2 degrees offreedom, were also examined by Ali (1989).
In all, the statistics,
Tk(i,j), Qk' QAk , QAMk , QAVk , QAMHk , QSk' QHk and QMk
may be considered. The null distribution of Tk(i,j) is approximated by the standard
normal distribution, and the null distribution of each of the rest of the eight statistics is
approximated by a chi-squared distribution with m2 degrees of freedom. These eight
statistics are referred to as Q statistics.
Several devices have been developed for fault detection in industrial drives, performing
limit checking tests.
Fault detection and diagnosis methods in the absence ofprocess model 21
Limiting states for the healthy or faulty operation of a Switched Reluctance Motor
(SRM) drive is given in the table below:
One obvious fault detection device is a simple overcurrent detector operating from the
eurrent sensor signal, setting a comparator having a threshold above the normal
operating range of the phase eurrents. This detector is easy to implement but is not fast
acting, since a fault indication is not set until the eurrent is already very high. Since the
detector operates from the current sensor, it will also not detect all kinds of faults. The
detectors must be able to operate quickly enough to interrupt a fault in progress before
damage to the inverter power switches occurs.
Differential detectors are devices utilizing comparators to provide a logic signal
indication of a winding fault and generally they do not exhibit the previously mentioned
drawbacks. Current differential, flux differential and rate-of-rise detectors are the most
utilized limit checking devices. Stephens, (1991), describes a SRM drive with an
implemented limit checking fault detection and management system which should be an
ideal selection for reliability-premium drive systems in aerospace, industry and
automotive.
Spee and Wallace, (1990), propose failure diagnostics and remedial operating strategies
for brushless dc-drives, based on correlations between the predictions of a simulation
program and test measurements. Comparison of normal operation with the performance
that occurs at the onset of faults has been shown to be capable of predicting post-fault
performance to a good level of confidence. The reader is referred to the latter paper for
details.
minerals. The role of tbis circuit is to reduce a mineral to a size small enough to free the
useful mineral from its gangue (Ceccbin, 1985).
,O,~~ ____________ ~
• The value ofthe variable "result ofthe tests". Tbis variable has 3 states:
It is zero when the variable Q9 is in static state, the process operation being normal.
It is negative when the variable Q9 is in dynamic state
It is positive when the variable Q9 is in static state and when the standard deviation
test has detected an anomaly.
lIh \Jh
~
620
600
00wc
o . V--y"lV* J-"1 ,
'" .~
f1=
test variable u, window = 50 pts.
~-
test variable u
result of test
rQ'_W"~
test variable U, window = 300 pts.
Figure 1.2 Test ofsteady state applied on Q6 Figure 1.3 Drift test applied on Q9
1.2.1.3 Conclusions
The a1gorithms described in tbis section may be used to provide a11 basic functions of the
fault monitoring strategy.
The efficiency of the localization and the precision of the diagnosis depend on the struc-
ture of the installation and on the knowledge of its operation quality and quantity of in-
formation.
Fault detection and diagnosis methods in the absence of process model 25
stati
result of tests
An extension of the diagnosis strategy concems the study of decision- making, that can
be considered from different aspects:
• Proceed to a thorough study of the behaviour of suspect elements.
• Correct defective measurements (in case ofa sensor failure).
• Compensate control signals (in case of actuators' failures).
• Reconfigure the control structure (in the case of loss of essential organs) and proceed
the substitutions and reparations of organs.
• Put installation in "emergency state" (in case ofa non-satisfying reconfiguration).
• Shut down production.
The knowledge- base aspects of the fault diagnosis problem will be investigated in detail
in Chapter 4.
26 Real time fault monitoring of industriaI processes
Control charts are used in monitoring the statistical state of a process whose measure-
ments are available sequentially in time. Some statistic w (sampIe mean or sampIe range
etc.) is computed from successive sampIes of size n and plotted on a graph containing
lower and upper limits corresponding to the critical region of the hypothesis on wunder
test. If the statistic w is distributed normally with mean mw and variance sw' where mw
and Sw are calculated a-priori, then typical limits are mw + 3sw for the Upper Control
Limit (UCL) and mw - 3sw for the Lower Control Limit (LCL). Such charts are usually
referred to as Shewhart charts (Fig. 1.5), originated in 1931.
1J
UCL ---------------------------------
•
•
LCL ------------------------------------
Control charts were used by Himmelblau, (1978), to monitor dynarnical systems and
detect malfunctions in the plant equipment quite recently. An extensive bibliography on
control charts is given by Vance, (1983).
Univariate Shewhart control chan techniques.
Univariate control charts can be used for the first stage of fault monitoring (as are the
cases of Section 1.2.1, paragraph A), as folIows: given successive sampIes of residuals
ri,j,ri+l,j+\·",ri+m,j+m, where rj,k =[rU) rU + 1) ... r(k)r, an appropri-
ate statistic is calculated and plotted on a corresponding control chart with the precom-
puted UCL aod LCL. Adecision that a fault has occurred is made when the statistic falls
outside its normal operation level for a specified subsequent number of times. This
procedure will decrease the probability oftype I errors (see section 1.2.1., paragraph A).
A logical tIowchart for a computer-operated control chart is shown in Fig. 1.6. The
calculations of the central line, UCL and LCL for the appropriate test statistic of
Fault detection and diagnosis methods in the absence ofprocess model 27
univariate Shewhart charts are summarized in Himmelblau (1978), where the reader is
referred for further details. The Shewhart chart essentially treats each sampie separately.
However, practical experience has shown that by taking into account the information
from all collected sampies, it is possible to determine in a better way whether the process
is in control or out of contro!.
_ t _.-pl• • 1._ n
",,4
upper an4 10wer 11.m1.ta
w , w
Many efforts to use Shewhart charts in industry have been of limited success. This is not
because of a technical shortcoming of the method, but typically because of one or more
ofthe following reasons:
1. The formulas used to calculate the limits are incorrect.
2. The sampling and subgrouping plans used to supply and group data for the charts are
poorly chosen.
3. The improvement work demanded by the charts is so radical in the context of the
organization's culture, that the organization is unable to properly respond.
To one not familiar with industrial environments, the three preceding items may seem
uninteresting or even trivial, but they are the critical criteria for successful application.
Item 1 seems on the surface to be beneath one's attention. This problem is so widespread
however, that when consulting, one must regularly question clients on how they did their
28 Real time fault monitoring of industrial processes
calculations. The root cause in this case is a lack of understanding that: "the essential
statistieal power of control charts comes from using variation within the subgroups (or in
the case of individuals charts, between neighboring measurements) to calculate the width
of the control limits". Those who have missed this point have recommended, for
example, that limits on a control chart for individuals be based on the sampIe standard
deviation of all of the data thrown together, instead of the average moving range.
Similarly, some statistieal software allows the user to calculate control limits using the
sampIe standard deviation of the subgroup averages or of all of the individuals. Charts
with limits calculated in this fashion are nearly doomed to failure because special causes
occurring between subgroups will inflate the limits, making it harder to detect such
causes, which is the original purpose of the chart.
Item 2, the problem of poor sampling plans, is probably the least understood of the is-
sues. The ability of a control chart to signal trouble depends primarilyon the sampling
plan used to supply the chart with data. Indeed, being identified as special or common is
not due to a property of a cause itself but rather to the way in which the control chart
works as a window or filter through which the process is observed. By carefully selecting
the sampling and subgrouping, one can control what sources of variation will show up as
special causes. This makes control charts very versatile tools, since one can select several
filters with which to view the same process.
Proper response to the charts, item 3, is the most critical issue in successful application.
This requires prompt action by those closest to the process. In a manufacturing environ-
ment, this is typically the operator.
Every one should agree that Automatie Process Control (APC) is the most etfective
means of maintaining a setpoint with minimal variation. If one deals only with statistical
theory and ignores practical issues, however, one can make APe seem like a panacea.
Since APC systems are continually making physical adjustments to a process, there can
be increased wear, especially in a typieal industrial environment. This, together with the
maintenance requirements of the control equipment itself, can substantially increase
maintenance costs.
Experts estimate that at any point in time, from 25% to 35% of the world's advanced
automatie control systems are on manual. Lack of operator confidence is one critical rea-
son for this. Operators, like most people, have an inherent distrust of a "black box" that
makes decisions on a basis that they do not understand. This is not an inherent fault of
APC but rather of the way in which it may have been misapplied.
Box and Kramer, (1992), discuss rationales for process monitoring using some of the
control chart techniques of Statistieal Process Control (SPC) for feedback adjustment.
Minimum cost feedback schemes are discussed for some practically interesting models.
The critical question is how to integrate APC and SPC for total system improvement.
Hoerl and Palm, (1992), provided a more basic discussion of this topic. APC should be
applied to the critical variables that have a direct control knob to minimize variation,
Fault detection and diagnosis methods in the absence ofprocess model 29
manage setpoint changes, and ensure safety. The time and money required should be in-
vested. SPC should be applied to tbis system to monitor its long-term effectiveness, de-
tect special causes, and monitor the on-line measurement system. Experimental design
could be used to tune the control system. SPC or algorithmic statistical process control
(Vander Wiel et al., 1992) can be used as a substitute when the cost of APC cannot be
justified.
Cumulative sum (CUSUM) control charts are very effective in detecting special causes.
The CUSUM chart is usually maintained by taking sampies at fixed time intervals and
plotting a cumulative sum of differences between the sampie means and the target value
ordered in time, on the chart. The process mean is considered to be on target as long as
the CUSUM statistic computed from the sampies does not fall into the signal region of
the chart. A value of the CUSUM statistic in the signal region should be taken as an
indication that the process mean has changed and that the possible causes of the change
should be investigated.
The number of observations taken before an out-of-control signal is triggered, is called
the run length. The performance of a CUSUM chart is commonly measured by the
average run length (ARL) corresponding to various possible choices of the mask
parameters.
Cumulative sum (CUSUM) charts are often used instead of standard Shewhart charts
when detection of small changes in a process parameter is important. For comparable
average run lengths (ARLs) when the process is on-target, CUSUM charts can be de-
signed to give shorter ARLs than Shewhart charts for detecting certain small changes in
process parameters. The superiority of the CUSUM chart over the Shewhart chart also
holds when the Shewhart chart is augmented with runs rules. Thus, it is only natural to
investigate whether the shorter ARLs for the univariate case can be extended to the
multivariate case (Hawkins, (1992), Pignatiello, (1990), Blazek, (1987».
The exponentially weighted moving average control chart. A control chart technique
that may be of value to both rnanufacturing and continuous process quality control
engineers is the exponentially weighted moving average (EWMA) control chart (Hunter,
(1986), Lucas and Saccucci, (1990». The EWMA has its origins in the early work of
econometricians, and although its use in quality control has been recognized, it remains a
largely neglected tool. The EWMA chart is easy to plot, easy to interpret, and its control
limits are easy to obtain. Further, the EWMA leads naturally to an empirical dynamic
control equation.
The exponentially weighted moving average (EWMA) is a statistic with the characteristic
that it gives less and less weight to data as they get older and older. A plotted point on
an EWMA chart can be given a long memory, thus providing a chart similar to the ordi-
nary CUSUM chart, or it can be given a short memory and provide achart analogous to
a Shewhart chart.
30 Real time fault monitoring of industrial processes
The EWMA is very easily plotted and may be graphed simultaneously with the data ap-
pearing on a Shewhart chart. The EWMA is best plotted one time position ahead of the
most recent observation since it may be viewed as the forecast for the next observation.
The immediate purpose is only to plot the statistic. The EWMA is equal to the present
predicted value plus Ä times the present observed error of prediction. Thus,
EWMA =Yt+l =Yt + Äet
=Yt + Ä(Yt - Yt)
=ÄYt + (1- Ä)Yt
where, Yt+l is the predicted value at time t+l (the new EWMA), Yt is the observed
value at time t, Yt is the predicted value at time t (the old EWMA), et = Yt - Yt is the
observed error at time t and and Ä is a constant (O<Ä<I) that determines the depth of
memory of the EWMA. As shown in Hunter, (1986), the EWMA can be written as,
t
Yt+l = LWiYi
i=O
where the wi are weights defined by,
wi = l(1-Ä)t-i
with sum L~=o W j =1. The constant Ä determines the "memory" of the EWMA statistic.
That is, Ä determines the rate of decay of the weights and hence the amount of
information secured form the historical data. Note, that as Ä -+- 1, wl -+- 1 and Yt+l
practically equals the most recent observation Yt.
When the process is under control and Ä=I, points plotted on the classical Shewhart
chart and those on an EWMA chart are therefore almost equal in their ability to detect
signals of departures from assumptions. As Ä~O, the most recent observation has small
weight, and previous observations near equal (though lower) weights. Thus, as Ä~, the
EWMA takes on the appearance ofthe CUSUM. The EWMA control chart for values of
O<Ä<1 stands between the Shewhart and CUSUM control charts in its use ofthe histori-
cal data.
The choice of Ä can be left to the judgment of the quality control analyst or can be
estimated using an iterative least squares procedure. Tbe analyst, by considering the data
as new data arriving sequentially, could for different values of Ä., compute the
corresponding sequential set of predicted values j based on the EWMA. The value of Ä.
which provides the smallest error sum of squares is preferred, based upon this very
limited evidence.
As shown in Hunter, (1986), the variance ofthe EWMA is,
Var (EWMA) = [ÄI (2 - Ä)] (12
and thus,
Fault detection and diagnosis methods in the absence of process model 31
control schemes have average TUn length properties similar to those for cumulative sum
control schemes.
The EWMA can be used as a dynamic process control tool. When the process mean Tl is
on the target (i.e., Tl=!') all three charting procedures, the Shewhart, CUSUM and
EWMA, are roughly equivalent in their ability to monitor departures from target.
However, the EWMA provides a forecast of where the process will be in the next in-
stance oftime. It thus provides a mechanism for dynamic process control.
To control a process it is convenient to forecast where the process will be in the next in-
stance of time. Then, if the forecast shows a future deviation from target that is too
large, some electro-mechanical control system or process operator can take remedial ac-
tion to compel the forecast to equal the target. In modem manufacturing, and particularly
where an observation is recorded on every piece manufactured or assembled, a forecast
based on the unfolding historical record can be used to initiate a feedback controlloop to
adjust the process.
Of course, if an operator is part of a feedback control loop he/she must know what
corrective action to perform, and Care must be taken to avoid inflating the variability of
the process by making changes too often. But control engineers long aga leamed how to
elose the feedback loop linking forecast and adjustment to target. The same information
feedback loop exists in many situations which only the operator can control. The EWMA
chart not only provides the operator with a forecast, but also with control limits to in-
form when the forecast is statistically significantly distant from the target. Thus, when an
EWMA signal is obtained, appropriate corrective action based on the size of the forecast
can often be devised.
The EWMA can be modified to enhance its ability to forecast. In situations where the
process mean steadily trends away from target the EWMA can be improved by adding a
second term to the EWMA prediction equation. That is,
modified EWMA: Yt+! =Yt + A!ct + ~ L ct
where Al and ~ are constants that weight the error at time t and the sum of the errors
accumulated to time t. The coefficients Al and Al can be estimated from historical data
by an iterative least squares procedure similar to that mentioned earlier for the estimation
of the constant A.
A third term can be added to the EWMA prediction equation to give the empirical con-
trol equation,
Yt+! = Yt + A1et + ,.1,2 :Let + ,.1,3Vet
where the symbol V means the first difference of the errors et ; that is Vet =ct - et-I'
Now observe that the forecast Yt+l equals the present predicted value (zero ifthe proc-
ess has been adjusted to the target) plus three quantities: one proportional to et> the sec-
ond a function ofthe sum ofthe et , and the third a function ofthe first difference ofthe
Fault detection and diagnosis methods in the absence ofprocess model 33
et. These terms are sometimes called the "proportional", "integral", and "differential"
terms in the process control engineer's basic proportional, integral, differential (PID)
control equation. The parameters Ä}> ~ and ÄJ weight the historical data to give the best
forecast.
The EWMA may thus be viewed as more than an alternative to either the Shewhart or
the CUSUM control charts. The EWMA may also be viewed as a dynamic control
mechanism to help keep a process mean on target whenever discrete data on manufac-
tured items or on the process operation are sequentially available.
Multivariate control charts.
The qUality of the output of a production process is often measured by the joint level of
several correlated characteristics. For example, a chemical process may be a function of
temperature and pressure, both of which need to be monitored carefully; a particular
grade oflumber might depend on correlated characteristics such as stiffness and bending
strength. In a geochemical process in coal mining each observation consists of 14
correlated characteristics. In these types of situations, separate univariate control charts
for each characteristic are often utilized to detect changes in the inherent variability of
the process. When these characteristics are mutually correlated, however, the univariate
charts are not as sensitive as multivariate methods that capitalize on the correlation.
One common method of constructing multivariate control charts is based on Hotelling's
T2 statistic. Currently, when a process is in the start-up stage and only individual obser-
vations are available, approximate F and chi-square distributions are used to construct
the necessary multivariate control limits. These approximations are conservative in this
situation. Tracy et al. (1992), present an exact method based on the beta distribution, for
constructing multivariate control limits at the start-up stage. An example from the
chemical industry illustrates that this procedure is an improvement over the approximate
techniques.
Note that in the following, monitoring of the mean of a multivariate normal process is
required. The term "on-target" is used to indicate that the process is in-control with
respect to its mean. Likewise, the term "off-target" is used to indicate that the mean of
the multivariate normal process has shifted.
For successive sampies, multivariate control chart techniques used for controlling the
mean of the multivariate normal process can be interpreted as repeated tests of signifi-
cance of the form,
Ho: P = Po
H 1: p;:I; Po
where p represents a multivariate normal process mean, whose true value is unknown
and Po is the target value for the parameter vector. For simplicity, it will be assumed that
Po=O, since the general case can be handled easily by translation.
34 Real time fault monitoring of industrial processes
and Z;;a is the upper lOOa percentage point of the z2 distribution with p degrees of
freedom. The noncentrality parameter associated with z2 is,
Ä. 2(p) =(p - Po)T E- 1(p - Po)
Note that M,p), the square root ofthe noncentrality parameter, is often used to represent
a measure of the distance of p from Po. This measure of distance is also called the
Mahalanobis distance or the statistical distance. Note that the straight line or Euclidian
distance assumes an identity covariance matrix instead. Henceforth, the word "distance"
will be used to mean the square root of the noncentrality parameter defined above.
A z2 control chart operates by plotting z2 on achart with an appropriate UCL. If a point
plots above the upper controllimit, the process mean is deemed to be out of control and
the assignable causes ofthe variation are sought. The average run length (ARL) ofthis
control scheme can be calculated as 1/P where P denotes the probability that z2 exceeds
the UCL. The on-target value of P is determined from the probability that z2 exceeds the
UCL under the central Z; distribution while the off-target value of P is the probability
that z2 exceeds the UCL under the non-central Z; distribution.
The multiple univariate CUSUM scheme.
A p-dimensional multivariate normal process can be monitored by using p two-sided uni-
variate CUSUM charts. The ph two-sided univariate CUSUM is operated by forming
the cumulative sums,
Sj,t =max(O, Sj,t-l +Xjt - kj )
multiple univariate CUSUM scheme signals an off-target condition when any of the p
two-sided schemes produces an off-target signal. Therefore, the on-target average ron
length of the multiple univariate scheme is less than the average ron length of any one of
the univariate CUSUM charts.
Two new multivariate CUSUM charting procedures are proposed by Pignatiello, (1990).
These procedures make direct use of the covariance matrix and are based on quadratic
forms of the mean vector. The ARL performance for these charts has shown a superior
performance with the ARL of the classical CUSUM charts as compared by Pignatiello,
(1990). Overall, these charts appear to be a good control charting procedure for
detecting a variety of shifts in the mean of a multivariate normal process.
To introduce the first multivariate CUSUM scheme, the multivariate sum
t
Ct = :L(Xi - Po)
i=t-nt+ l
is considered, where nt can be interpreted as the number of subgroups since the most
recent renewal (i.e. zero value) of the CUSUM, and is formally defined below. Since
(1/ nt )Ct may be written as ,
the vector Ct / nt represents the difference between the accumulated sampie average and
the target value for the mean. Consequently, at time t, the multivariate process mean can
be estimated to be (Cr! nt)+Po. The norm of Ct
11 Ct 11 = ~"--C-:-E--I-C-t
is seen as a measure of the distance of our estimate of the mean of the process trom the
target mean for the process. A multivariate control chart can then be constrocted by
defining MC 1 as,
and,
nt - l + 1 if MCl t _l > 0
nt ={
1 otherwise
where the choice ofthe reference value k>O is discussed below. The MCl chart operates
by plotting MCl t on a control chart with an upper controllimit ofUCL 1. IfMCl t ex-
ceeds UCL I then the process is deemed to be off-target. Rather than basing a multivari-
ate CUSUM statistic on the square of the distance of the accumulated sampie average
from Po, one could consider the square of the distance of each sampie mean from Po and
36 Real time fault monitoring of industrial processes
chart methodology by utilizing the time series analysis approach and by introducing de-
pendence via a second order autoregressive process (AR(2) model). Curves of the
modified auxiliary quality control factors are presented, showing the substantial effect of
dependence on the classical quality control factors.
Yashchin, (1993), discusses a situation in which one is interested in evaluating the run-
length characteristics of a cumulative sum control scheme when the underlying data
show the presence of serial correlation. In practical applications, situations of this type
are common in problems associated with monitoring such characteristics of data as
forecasting errors, measures of model adequacy, and variance components. The
discussed problem is also relevant in situations in which data transformations are used to
reduce the magnitude of serial correlation. The basic idea of analysis involves replacing
the sequence of serially correlated observations by a sequence of independent and
identically distributed observations for which the run-Iength characteristics of interest are
roughly the same. Applications of this method to several classes of processes arising in
the area of statistical process control are discussed in detail, and it is shown that it leads
to approximations that can be considered acceptable in many practical situations. The
reader is referred to Vasilopoulos, (1978), and Yashchin, (1993), for details.
Statistical process control (SPC) by displaying multivariate data.
Control charts are more valuable in practice especially when used as simple graphical
aids to let the process operator, who is untrained in statistical techniques, get amental
picture of the process history and interpret whether or not the quality of the process op-
eration is at a satisfactory level.
Displaying multivariate data is probably the most popular topic of current statistical
graphics research (Blazek, 1987). Particularly in the area of process control, where mi-
cro- and mini-computers collect, analyze, and store thousands of observations on each
phase of a process each day, effective graphical display of these data is a necessity. In
one plant the data collected by process control computers can amount to millions of bits
ofinformation. Even narrowing down the variables to Juran's "significant few" and using
computers to help process the data, graphics is the only means available to present this
plethora of data to the operators and supervisors for rapid interpretation of results.
This problem has been most keenly feit with the increasing use of statistical process con-
trol (SPC) methods. Although weil versed in basic techniques and SPC charting, opera-
tors and supervisors feel overwhelmed when they realize that their complex, interde-
pendent processes produce ten to twenty significant variables to be monitored simultane-
ously instead of just two or three. Although summary measures, like Hotelling's T2 pre-
sented in the previous section, can quantify the overall status of the system, these people
also need information on the individual contributors as weil. This is especially true in
facilities where modernization has produced new measurements for which component
interdependencies are unknown.
38 Real time fault monitoring of industrial processes
A graphic which will display the individual and collective relationship sequentially, called
polyplot, will be presented in the following. The polyplot is a g1yph in the shape of a
polygon with rays emanating from the vertices. A g1yph can represent a single observa-
tion, or, as it is often the case in SPC applications, a composite of several observations.
Each side is of equallength, and the number of sides of the polygon corresponds to the
number of variables being studied. Each vertex and corresponding ray is associated with
a specific variable. The rays of the polygon are oriented a10ng invisible line segments
passing from the center of the polygon through the appropriate vertex. The length of a
ray varies with the value of each observation. The vertex corresponds to the mean of
each variable Xj, and the distance from the mean is transformed into standard error units.
Thus one can plot,
which indicates the number of standard error units the ith data average is from its proc-
ess mean (Xj is the ith average of n data points; s is the estimate of the process standard
deviation and n is the sampie size). Values less than the mean are mapped by rays start-
ing at the vertex and going toward the center. Rays starting at the vertex and going away
from the center correspond to observed values greater than the mean. The distance be-
tween the vertex and the center is four standard error units. To ensure that the rays of
one g1yph do not overwrite a nearby g1yph nor extend past the center, rays greater than
four standard error units are truncated at that limit.
The use of different line styles or colors for the rays encodes the level of significance of
the deviation from the estimated mean. For example, the ray is dotted if the deviation is
witbin two standard error units, dashed between two and three standard error units, and
solid beyond three standard error units. If colors can be used, it is suggested to use blue
for rays less than two standard error units, green between two and three, and red to
signify rays beyond three standard error units.
Figs. 1. 7, 1. 8 and 1. 9 demonstrate the appearance of the g1yph for three, five and sev-
enteen variables, respectively. The glyph displays the relative number of variables dra-
matically as the polygon changes from a triangle to a pentagon to a virtual circle with
seventeen variables. Yet the figure still a1lows the user to identify individual variables
easily. The use of different line styles or colors is also effective in tbis setting. The pat-
tern of blue, green and red rays in addition to the varying ray-Iength lends itself to effec-
tive pattern recognition among related variables. For black and wbite presentation, it is
suggested that the use ofvarying line styles is a good substittute to color.
Because the initial application of the polyplot was as an SPC tool, the characteristics of
the glyph were designed to relate multivariate quality control information. The sequenc-
ing of the glyphs displays the production narrative over time, wbile the number of verti-
ces informs the reader of the number of variables being tracked. The length of each vari-
able's arm away from the polygon is significant information. That is, when a variable is
"in control" near its mean value, the arm length is very short.
Fault detection and diagnosis methods in the absence ofprocess model 39
~, I'
"-
1-
') ,
I /
/'
r / -/
/
Figure 1. 7 Three variable polyplot Figure 1.8 Five variable polyplot
.,
The dotted line style of the ray, or a1ternately the blue color, is unobtrusive. However,
when the variable is "out of control", the arm lengthens and the line becomes solid or
changes color, thus calling attention to itself visually. The glyphs are arranged in a time
sequence in a left to right and top to bottom order, the natural reading sequence in the
western world. Also, the use of differing line styles or colors emphasizes the points that
SPC personnel are most interested in studying. The shift from dotted to dashed (or blue
to green) occurs when the variable is approaching an "out-of-control" condition; the
change to asolid (or red) line occurs when that condition is reached. The experience has
40 Real time fault monitoring of industrial processes
been that tbe line pattern or color in combination witb tbe different arm lengtbs identifY
many relationships wbicb can be otberwise overlooked. Tbe use of patterned lines or
color therefore efficiently relates the scale without using "non-data" ink.
Polyplots are very useful in an SPC environment, especially when multivariate control
charting is appropriate. Univariate control charts can be misleading even when the corre-
lation between variables is moderate. Bivariate control charts limit the study to pairs of
variables and lack time sequence information (Himmelblau, 1978). In these SPC applica-
tions, the number of individual observations used to calculate the standard error and av-
erage for each glyph are under user control. The rays of each glyph correspond to uni-
variate X-bar chart entries. Each polyplot, then, represents the condition of tbe process
for a given unit of time. By reading consecutive glyphs in normal order, left to right and
top to bottom, the sequential bistory of a process is narrated. Finally, the bar to the left
of each glyph represents the relative value of Hotelling's T2 for the data displayed by
each glyph. Cbarting Hotelling's T2 provides an assessment of multivariate control in
time sequence, but usually one would like to combine the multivariate control
information with information about tbe univariate behavior. By including T2 information
in a polyplot, one can compactly convey both tbe multivariate and univariate control
information (Tracy et al., 1992). The following production quality control example taken
from Blazek, (1987), illustrates tbis idea.
In Fig. 1.10 tbe values of T2 associated witb vector subgroup averages are represented
by the height of a vertical bar plotted to the left of eacb glypb. The critical value for T2 is
the hash mark above the T2 bar. When tbe T2 value becomes significant and crosses the
hash mark, the bar changes line style from dots to solid. Therefore, when T2 exceeds a
controllimit, one not only knows that the process is out of control, but also which com-
bination of univariate signals led to the event. Note that tbis implementation efficiently
displays multivariate and univariate control information simultaneously. Figure 1.10
displays six measurements from 50 coils produced at a particular plant. The
measurements are identified as variable 1 through variable 6. Each glyph represents the
composite statistic on two consecutive coils.
The most interesting variable is Variable 1. After producing four good coils (the first two
glyphs), Variable 1 went out of control when the control system lost its calibration. As a
result the next 13 coils were out of control in a positive direction. During tbis time
Variables 2 and 4 were also atfected and went out of control at several points. The con-
trol system was tuned between coils 17 and 18 (and again after coil20 as it was overcor-
rected the first time), and good coils were produced until coil 36 (glyph 18). Then,
Variable 1 went out of control in a negative direction uotil the system was repaired
again. During tbis time Variables 2, 4 and 5 also showed out-of-control values. Normal
processing then resumed through tbe end of these records.
Fault detection and diagnosis methods in the absence ofprocess model 41
r
I <~j I-~1\~ LEGEND
\
-i )- -i
~ J
> l"{ .:. -< >.• \
- l' - CRITICfL
{
')-
~
/
-<\
>
·
....
(.
,.
\.
/
-(
)
1-: ,~ !-< '.~
t~
\. ~
1 l' -
2
0
1
I
-< 3 6
/ I.)
r. r "\
\~
....· .:.
, < >. >
"- -( ~ ..ä.. \. ~ .... \. -(
-! 5
r ,.
.. .., ':: -\
~ ,\r ~
1 VARIABLE -I
J .
j--< >- <
\. J · <
( )
.:.
\
2 VARIABLE -2
,) I-~/>. I ~,
... .J ..ä.. .~ ~
3 VARIABLE -3
4 VARIABLE -4
r. "'
.. · / .. 5 VARIABLE -5
. .:. J
(. )
6 VARIABLE -6
IJ
:
..... .:. -< ~ \
I {':
,
,. -<
I
-<
I
r "l\ r, -<
.) >-- ( .) >-
... - ...
\ \
\ 1
-( ~
..i.. ~ ~
. <\ "l\
>-
... -i
With tbis technique, both the ray and Hotelling "thermometer" visually indicated when
the system went out of control and why. The polyplot also demonstrates the process
relationsbip between Variables 2, 4 and 5 and Variable 1. The nature of the process is
such, that only certain types of Variable 1 problems also manifest themselves in the other
variables. Therefore, a conventional analysis of the data would probably not correlate the
42 Real time fault monitoring of industrial processes
two. With this graphical hint and some process knowledge, conjunctures about the actual
relationship can be developed.
1.2.2.2 Conclusions
The success of a condition monitoring (CM) system is as dependent on its planning and
design as on the sensors and signal analysis techniques used. Before the sensor fit and
techniques can be practically decided, various features should be considered by the op-
erator to define the requirements:
• Which items of machinery should be monitored ?
• What sort of faults should the system detect ?
• What level of diagnostic/prognostic information is required ?
GeneraIly the answers indicate that the monitoring strategy should aim at reducing the
number and severity of failure incidents between overhauls (which have high consequen-
tial costs in terms of damage and loss of availability) and increasing the prognostic ca-
pability so that maintenance can be planned effectively. The high plant availability and
low maintenance cost requirements demanded in the current economic c1imate necessi-
tate efficient, cost-effective monitoring systems. Their performance can be assessed by
the criteria:
44 Real time fault monitoring of industrial processes
vibration level
overall level
J
Frequency
The vibration from a rotating maehine varies as parts wear out and are replaeed.
However, this variation is over sueh a long period that the signal ean usually be regarded
as being stationary. Truly stationary signals do not really exist in nature. Non-stationary
signals ean exhibit a short-term or a long-term variation. For instanee, vibration from a
reciproeating maehine is stationary when regarded over several eycles, but over a single
eycle, whieh eonsists of several transients, it is non-stationary. Vibration from a maehine
which is running up or down in speed, however, is non-stationary on a long term basis.
Stationary deterministie signals show the well-known line speetra. When the speetral
Iines show a harmonie relationship, the signal is deseribed as being periodie. An example
of aperiodie signal is vibration from a rotating shaft. Where no harmonie relationship
exists, the signal is deseribed as being quasi-periodic. An example of a quasi-periodie
signal is vibration from a turbojet engine, where the vibration signal from the two or
more shafts rotating at different frequencies produee different harmonie series bearing no
relationship to eaeh other (Lyon 1987, RandalI).
The weil known Fourier Transform (Lyon 1987) gives the mathematieal eonneetion be-
tween time and frequeney, and viee versa, and given a time signal allows ealculation of
the speetrum. The Fast Fourier Transform (FFT), see Appendix I.B, is merely an effi-
eient means of ealeulating the diserete form ofthe Fourier Transform (DFT).
A deterministie signal ean be analyzed by stepping or sweeping a filter aeross the fre-
queney span of interest and measuring the power transmitted in eaeh frequeney band. A
random signal is a eontinuous signal whose properties ean only be deseribed using statis-
tieal parameters. Examples of random signals are eavitation and turbulenee. Random sig-
nals produee eontinuous speetra. Sinee random signals have eontinuous speetra, the
amount of power transmitted by the analyzing filter will depend on the filter bandwidth.
In a11 frequeney analysis, there is a bandwidth-time limitation. When using a filter, it
shows up as the response time ofthe filter. Afilter having a bandwidth of B(Hz) will take
approximately 1fB seeonds to respond to a signal applied to its input. If the analyzing
filter is B (Hz) wide, one has to wait at least IfB seeonds for a measurement. After
46 Real time fault monitoring of industrial processes
filtration, the filter output must be detected. One can detect the peak level passed by the
filter, the average level, the mean square level, or the root mean square level. Mean
square or root mean square detection is used, since it relates to the energy or power
content of the signal independent of the phase relationships. Peak detection is relevant
when maximum excursions are important. Mean square and root mean square detection
require that the output of the analyzing filter be squared and averaged. The period over
which the square ofthe filter output is averaged is called the averaging time, TA'
With random signals, averaging is used to reduce the standard deviation, 0; of the meas-
ured estimate. For a mean square measurement, then:
I
cr=---
~BTA
where Bis the analyzing filter bandwidth and TA is the averaging time. For a root mean
square measurement:
I
cr--==
- 2~BTA
The above assurnes that BTA~ 1O. When FFT analyzers are used, BT will usually be equal
to nd, the number of averages. However, this will depend on the overIap conditions set.
OverIap is where overlapping time records are analyzed and averaged. 0% overIap means
that only results from statistically independent records are averaged.
As a consequence of the Central Limit Theorem, it can be assumed that any narrow band
filtered random signal follows a gaussian distribution. Hence, from the properties of the
Gaussian Distribution, there is a 68.3% chance of being within ±o, a 95.5% chance of
being within ±20, and a 99.7% chance ofbeing within ±30 ofthe true mean value ofthe
signal.
Many FFT analyzers can also average signals in the time domain. Time domain averaging
can be used with repetitive signals, for instance, repeated transients or vibration from
rotating machines, to suppress extraneous noise. However, there must be a trigger signal
synchronous with the signal being averaged. The amount of noise suppression (for ran-
dom noise) which can be achieved with time domain averaging is equal to 1 / Jn;,
where nd is the number of time domain records averaged. Time domain averaging is also
called signal enhancement, or synchronous averaging.
In vibration measurements external measurements of internal effects must be made.
However, the transmission path characteristics from the source of vibration to the meas-
uring point will vary from machine to machine, even if the machines are of the same de-
sign and construction. This is due to differences in castings, welds, tightness of bolts,
etc .. Even in a single machine, the transmission path characteristics will vary with fre-
quency.
Fault detection and diagnosis methods in the absence of process model 47
Change in dB = log Al
A2
where Al is the present level and A2 the previous level. The same relative change will
give the same change in dB, independent of the absolute levels measured. The absolute
levels themselves, however, will depend on the transmission path characteristics. This
can be extended to making all measurements in dB refer to a common reference.
Changes in vibration levels can then be conveniently plotted simply by subtracting the
previous vibration level in dB form the present level. The two most commonly used
methods of presenting data in the frequency domain are constant bandwidth on a linear
frequency sCale, and constant relative bandwidth on a logarithmic frequency scale. The
two methods have their own different applications. The former gives equal resolution
along the frequency axis, making it easier to identify such things as families of harmonies
and sidebands. Its limitation is that it can only be used across a frequency range of about
1 and 1/2 decades. The latter can be used across a broader frequency range (3 or 4
decades is typical). Its drawback is that the resolution gets progressively worse at higher
frequencies.
In fault detection, it is necessary to use a broad frequency range in order that all machine
faults can be detected. Typically this requires a range of under half the slowest shaft
speed to more than three times the highest toothmeshing frequency. Also, the possibility
of easy speed compensation is desirable, since macbine speeds will vary from measure-
ment to measurement. Both these requirements are fulfilled by constant relative band-
width analysis on a logarithmic frequency axis. Once again, the spectrum should be plot-
ted with logarithmic amplitude.
The most basic level of vibration measurement is to measure the overall vibration level
on a broadband basis in a range of, for example, 10-1000 Hz or 10-10000 Hz. Such
measurements are also relevant with displacement measurements from proximity probes,
where the frequency band of interest is usually from about 30% of the running speed up
to about the 4th harmonie. An increasing vibration level is an indicator of deteriorating
machine condition. Trend analysis involves plotting the vibration level as a function of
time, and using this to predict when the machine must be shut down for repair. Another
way of using the measurements is to compare them with published vibration criteria.
48 Real time fault monitoring of industrial processes
One example of a published vibration criterion chart is the General Machinery Criterion
Chart, which is for displacement or velocity measurements at the bearing cap. In this
chart there is always a factor of two involved in movement from one class to the next,
that is a constant interval of6 dB, and logarithmic axes are employed (Mitchell, 1981).
Another example is VDI 2056, from Germany, shown in fig. 1.12. It is for measurements
at the bearings of the machine of interest in a frequency range of 10Hz to 1000 Hz.
153 45
tI)
....... 149 28 Not pcrmissiblc
~ 145 18 Not pcnni..ible
Not pcrmisliblc
IQ
10
...-!
141 tI) 11.2 20 dB (x 10)
.......
4-1 137 Si 7.1 Jot tole""'l.
Q)
Si
.....
~ 133 4.5 Just tolerable
>t
.j..J
~ 129 2.8 Allowablc
'''; Just tolcrable:
't1 u
:> 125 0
....-t
U Aflowable
1;1
Max.15 kW
15-75 kW
(300kW) >75kW
This chart differentiates between vibration classes according to machine size. Note again
the logarithmic velocity axis and the constant width of the allowable and just tolerable
Fault detection and diagnosis methods in the absence of process model 49
A A
~---------123 _4
\V~~;:::::::
~
--23
-1
•
·•
A A
Limit Limit
'.'
date of
shot down
Overalllevcl Speetrum
measured measured
The objective of frequency analysis is to break down the vibration signal into its compo-
nents at various frequencies. It is used in machine health monitoring because a machine
running in good condition has a stable vibration spectrum. As parts wear and faults de-
velop, however, the vibration spectrum changes. Since each component in the vibration
spectrum can be related to a specific source inside the machine (e.g. unbalanced masses,
toothmeshing frequency, blade pass frequency resonances), this then allows diagnosis of
the fault.
The basis of fault diagnosis is that different faults in a machine will manifest themselves
at different frequencies in the vibration spectrum, as it can be seen in fig. 1.14. The fre-
quency domain information can then be related to periodic events in gears, bearings, etc ..
Note that fault diagnosis depends on having a knowledge of the machine in question, that
is the shaft frequencies, toothmeshing frequencies, number of teeth on gears, bearing ge-
ometries, etc.
50 Real time fault monitoring of industrial processes
Two of the most common faults associated with rotating shafts are unbalance and rnis-
a1ignrnent. Unbalance produces a component at the rotational frequency of the shaft,
mainly in the radial direction. Arnisaligned coupling, however, will produce a component
at the rotational frequency, plus usually its lower harmonics, both in the axial and radial
directions. Misaligned bearings produce a sirnilar symptom, except that the higher har-
monics also tend to be excited. Abent shaft is just another form of rnisalignment, and
will produce vibration at the rotation frequency and usually its lower harmonics. Finally.
a cracked shaft produces an increase in the vibration at the rotational frequency and the
second harmonic (eue 1990, Mitchell1981 ).
Fig. 1.15 taken from RandalI, shows an example of the effect of rnisalignment in a
gearbox. Both the low speed (50 Hz) and high speed (85 Hz) shafts are originally mis-
aligned. After repair, the 50 , 85 and 170 Hz components are considerably reduced. The
100 Hz component, however, remains more or less at the same level, which rnight appear
strange until it is realised that it is not only the second harmonic of the shaft speed, but
also the second harmonic ofthe mains frequency, (2-pole synchronous motor). This is a
common electromagnetic source of vibration. Note that the higher noise level in the up-
per spectrum is because it was originally recorded as acceleration and integrated to ve-
locity on playback.
Magnetically induced vibration is an important source of vibration in electrical machines.
One source is the rotating magnetic field, which causes ahernating forces in the stator.
Since there are symmetrical conditions for a north or south pole, this gives rise to vibra-
tion at twice the mains frequency, or the "pole passing frequency" . Note that in electrical
machines, the force is proportional to the current squared, that is the vibration is highly
load dependent. In induction motors, the rotational frequency will usually be slightly less
than the synchronous frequency. For instance, fig. 1.16 shows the vibration spectra for
an induction motor. The lower ofthe two is a detailed analysis obtained by non-destruc-
tive zoom, and shows that the high 100 Hz component is electromagnetic in origin rather
than from misalignment.
Fault detection and diagnosis mehods in the absence ofprocess model 51
50Hz 85Hz
Before repair
170Hz
100Hz
170Hz
velocity. Where the velocity spectrum is flat, the displacement spectrum will show a -6
dß/octave slope, and the acceleration spectrum a +6dß/octave slope. Abrief presenta-
tion of the most commonly used vibration measuring transducers is given in the follow-
ing.
Zoom range
o 400 800
Zoomed Spectrum
100.0 Hz:
99.6 Hz:
frequency (Hz)
Mechanicallevers measure displacement, see fig. 17. They are inexpensive and self gen-
erating but limited to low frequency only, sensitive to orientation and prone to wear.
Eddy current (or proximity probe) measures displacement, see fig. 1.18. There are not
moving parts and contacts, resulting in no wear, but variations in the magnetic properties
of the shaft give erroneous signal components.
When a force is applied to a piezoelectric material in the direction of its polarisation, an
electric charge is developed between its surfaces, giving rise to a potential difference on
the output terminals. The charge (and voltage) is proportional to the force applied. The
Fault detection and diagnosis methods in the absence ofprocess model 53
same phenomena will occur if the force is applied to the material in the shear mode. Both
modes are used in practical accelerometer design.
Accelerometers (compression type or shear type) measure acceleration, see fig. 1.19.
Usually they have not moving parts so there is no wear and they have very large dynamic
range and wide frequency range making them more suitable for applications.
Noise analysis.
Up to now it has been studied how vibration is transmitted through a machine to its outer
surfaces. In the following is considered how that vibration is converted into sound.
Sound radiation is inherently a complicated process (Lyon, 1987). It turns out, however,
that some fairly simple geometrical and dynamical parameters control sound radiation.
These parameters a1low to make reasonably good estimates of sound radiation. More
specifically, Lyon, (1987), has shown that the sound radiated power of a vibrating
machine structure is proportional to the space-time mean square vibration velocity.
An example of asound source is the noise produced by an air jet when it impinges on a
rigid obstacle such as a fan blade. When the turbulent flow produces forces on an obsta-
cle, then, by Newton's law of reaction, the obstacle puts forces back on the fluid in the
form of fluctuating lift and drag, resulting in sound radiation. Large-scale motions asso-
ciated with structural vibration are usually much more efficient in radiating sound.
Impacting forces also produce a broad spectrum of vibration in the machine, and this
represents another source of sound radiation. Generally, the sound energy produced by
vibration will be greater, particularly for large machines that are resonant. For example,
although there is direct sound radiation due to the deceleration of the impacting elements
in a punch press, the major amount of sound usually comes from the impact-induced vi-
bration and its subsequent radiation.
The ability ofmulti-channel FFT analyzers (see Appendix l.B) and other analyzers using
digital filters to quickly and accurately compute the cross spectrum between microphone
signals has been the basis for the very rapid growth in using acoustical intensity meas-
urements to determine the sound power radiated by machines. The usual measurement
procedure is to surround the machine with a fixed array of microphone probes or a trav-
erse setup that sweeps over an area surrounding the machine.
Identification and ranking of the noise sources is essential for both new and existing in-
stallations. Only the sources which are contributing to the excessive noise levels need to
be treated.
Frequently a trial-and-error approach is used. Dominant sources are identified from far-
field sound pressure measurements by comparing far-field noise spectra with near-field
spectra of probable sources. It is very difficult however to distinguish between spectra of
sources when many sources exist in the near-field. It is also difficult to know how much
to silence a source and to know whether all the important sources are identified. Often a
major source is treated, lowering the near-field noise but reducing the far-field levels only
marginally because other sources start to dominate. It is also important that suppliers
provide suitable noise data on their products.
The use of sound intensity techniques will provide better sound power information by
deterrnining sound power levels of individual sources without subjecting bulky equip-
ment on the confines of anechoic or reverberant chambers. These sound power levels can
be calculated from intensity measurements taken in situ in the presence of many sources.
Using sound powers and correcting for directivity, distance and excess environmental
attenuation, a mathematical model can be generated to determine the effect of the major
sources on far-field sound pressure levels. This model allows the ranking of the sources
in order of importance and provides a means to predict the impact of a noise abatement
programme. The results will be used to predict sources at other similar plants and to
provide information to suppliers to enable them to improve machine package design
(Laws, (1987), eue (1990».
Sound power can be reliably calculated from sound pressure levels in a controlled envi-
ronment, or in the free-field where sources do not interfere with one another. If ambient
Fault detection and diagnosis methods in the absence ofprocess model 55
noise levels are bigh and the sound field is reactive, however, only sound intensity meas-
urements will enable calculation of accurate sound power levels.
Sound intensity is the sound energy flux, a vector quantity describing the magnitude and
direction of the net flow of acoustic energy. Therefore the dimensions commonly used
for sound intensity are W/m2, By taking ten-times the logarithm ofthe ratio of sound in-
tensity ofa reference value (10- 12 Watt1m2), the sound intensity level can be expressed in
decibels.
The integral of the sound intensity over a surface is the sound power passing through the
surface. The sound intensity and sound power levels can be expressed in terms of octave
bands, tbird-octave bands, or overall noise level over any frequency range.
There are different instrumentation packages on the market that can measure sound in-
tensity levels. The instrumentation usually consists of a pair of microphones in conjunc-
tion with either a dual channel Fast Fourier Transform (FFT) signal analyser or areal
time sound intensity analyser (see Randal). For continuous level noise sources such as
gas-turbines, both types of analyser will give similar results. For the study of an unsteady
source such as a jack hammer, a real-time instrument should be used to capture the peak
levels.
Sound intensity techniques have a number of inherent limitations, such as bias errors re-
sulting from the finite pressure difference approximation for particle velocity, phase mis-
match errors due to phase differences in the microphones and analyser channels, and re-
activity errors resulting from phase mismatch of both the equipment and the measure-
ments surface. The bias errors limit accuracy in the bigher frequencies, wbile the phase
mismatch errors limit the lower frequency capabilities. Reactivity errors could result at
any frequency, depending on the location and sound power levels of extraneous sources
and the distance between the microphones. The microphone spacing should be selected
correctly for the frequency range of interest, in order to minimize the amount of error.
These errors are fully discussed in the literature (Randali and MitchelI, 1981).
In addition to sound power levels, an important factor in determining the effect of a
source on the far-field is the directivity of the source. There are two components of
directivity which can be described as directivity factors; the directivity wbich a source
would exhibit if it was operating in an anechoic chamber or in the air without any reflec-
tive surfaces (QA), and the directivity effect upon a source due to reflective surfaces,
wbich can be termed spatial directivity (Qs). Such sources as exhaust ducts, air-intake
ducts and vents radiate sound non- uniformly even if there are no reflective surfaces. The
spatial directivity factor accounts for reflections from such items as the ground and walls.
The spatial directivity factors for spherical, hemispherical and quarter-spherical propaga-
tion are orte, two and four respectively. The total directivity factor Q8 is defined as the
product of QA and Qs and tbis total directivity factor is translated into a directivity index
(DI8 ), expressed in decibels, by the following equation:
DI8 = 10 log Q8 = 10 log QA = 10 log Qs
56 Real time fault monitoring of industrial processes
Sound intensity measurements for any particular source will use one of three types of
control surface; conformal (conforming to the shape of the object), hemisphere (a hemi-
spherical "cover" placed over the source), box (a box-shaped "cover" placed over the
source), or a combination of these. The control surfaces are determined using coordi-
nates relative to the object of interest. The physical size of the majority of the sources
examined in the case study of fig. 1.20 dictated the use of a box technique. Tbis tech-
nique is best explained by describing sound intensity measurements over one of the
sources investigated - an inertial air filter extraction fan. A box shape was constructed
over the fan as shown in fig. 1.20 and the area of each ofthe five open sides was deter-
mined. The sixth side of the box was covered by the steel plate of the filter house. Since
the fan and duct did not radiate sound uniformly through each side of the box, the sound
intensity needed to be measured for each of the five sides. Before taking readings of the
average sound intensity for the box, the choice was made between the use of a grid or a
sweeping technique.
The grid technique involves constructing a real or imaginary grid of equal-area shapes
over a surface and taking sound intensity measurements at the centres of each of these
shapes. The grid size should be small enough so that the intensity does not vary greatly
throughout the shape. An identical grid is set up witbin the sound intensity computer
program (Ikeucbi, 1988) and measurements are then taken systematically.
_---r-. . . . . ...........
~
I
.,...
'--~ .... -
-- --- _---
----
_.... - --,--- - ~
I
I ~~------~~:
I I
I I
I
I
I I I
..,..'- - ... --- - . . . - .... -. ~ - _.1.- - - -
L ::.:-_-- I "
, -rI
----------..J,.,,;'
Figure 1.20 Extraction fan control surface.
The operator ensures that the probe is in the correct location in the centre of the shape
and perpendicular to the box surface. Once all measurement points are stored, the
computer program will display the sound intensity results in tabular or grapbical form.
The sound power for each of the grid areas, as weIl as the total sound power of the
complete surface, is calculated by the computer.
Fault detection and diagnosis mehods in the absence of process model 57
The sweeping technique is a space-averaging process taken across a complete side. The
probe is kept at right angles to the surface and swept uniformly across the surface while
the analyser averages the sound intensity.
The sweeping technique was used for each of the five sides of the imaginary box over the
extraction fan, using 250 averages over each of the sides or sectors. As described before,
the frequency range over which measurements are taken results in various values of bias,
phase mismatch and reactivity errors. These errors can be minimized by selection of the
proper combination of frequency range and space between the probe microphones (see
the case studies below).
Fault signature extraction.
A new general approach to the statistical development of diagnostic models is the use of
nonparametric pattern classification techniques, so as not to require knowledge of the
probabilistic structure of the system. Recently, ehin and Danai, (1991), introduced a
nonparametric pattern classification method with a fast learning algorithm based on
diagnostic error feedback that enables it to estimate its diagnosic model based on a small
number of measurement - fault data.
This method utilizes a multi-valued injluence matrix (MVIM) as its diagnostic model
and relies on a simple diagnostic strategy ideally suited to on-line diagnosis. The MVIM
method can also assess the diagnosability of the system and variability of fault signatures
which can be used as the basis for sensor selection and optimization.
Processed
Measurements
.------,p
A
X
Illustration of the various stages of the fault signature extraction improved diagnosis
(see fig. above). In the jlagging uni!, the processed measurements are first flagged by
thresholds and then filtered by a single-Iayer network. A sampie batch of measurement-
fault vectors is used to tune the flagging unit through iterative learning using a
nonparametric pattern classification method. Once all the measurement vectors in the
sampie batch are flagged, the MVIM is estimated to provide the indices for fault
signature variability and system diagnosability. These indices along with the number of
false alarms and undetected faults are then fed back to the unit's adaptation algorithm to
tune the unit's parameters in its next adaptation iteration. The parameters of the flagging
unit are tuned iteratively until its performance indices are extremized.
The effectiveness of this scheme is demonstrated by simulation in ehin and Danai,
(1991), where the reader is referred for the detailed mathematical analysis and
58 Real time fault monitoring of industrial processes
implementation features of the method. The method is also suitable for automatie tool
breakage deteetion in maehining.
System Analysis.
The distinetion between signal analysis and system analysis is often made depending on
what ean be measured.
In praetical analysis situations, there is no measurable input but a measurable output or
both a measurable input and output. In the first, it is only possible to make a signal
analysis, while in the seeond, beeause of the presenee of information ab out both the input
and the output, it is possible to make an analysis ofboth the signals and the system.
In signal analysis the input to the system is usually not measured. This ean be due to any
of three reasons. The first is that the input might be inaeeessibie. A good example of this
is maehine health monitoring, where external measurements must be used to monitor in-
ternal effeets. The seeond reason is that it might be impossible to define an individual in-
put, as, for instanee, in many environment noise measurements. The third reason is that
the output might be the only item of interest, as, for instanee, in noise dose or whole
body vibration measurements.
In system analysis, measurements are made of both the input and the output to the sys-
tem. It is best to measure the input and output simultaneously (while at the same time
taking aeeount of any system delays), so as to maintain the phase relationships, although
some limited system analysis (measurement of the magnitude of the frequeney response
funetion) is possible using sequential measurements. With system analysis one ean obtain
the system properties whieh ean then be used to predict how the system will behave un-
der various excitations.
System analysis is mostly used as a design tool. However, it also has applieations when
systems are instalied. It ean, for instanee, be used to monitor struetures such as oil pro-
duetion platforms, maehine foundations, ete., for faults. It ean also be used for determi-
nation of signal sources and signal paths when, for instanee, it is neeessary to isolate part
of a system from vibration. The classical method of system analysis is to use swept sine
testing. The system is excited with a sine wave, and feedback from the output is used to
hold the input amplitude eonstant as the sine wave is swept up or down in frequeney.
Henee, the amplitude ofthe sine wave at the output ofthe system gives the magnitude of
the frequeney response and the phase differenee between the input and output. The
advantages of swept sine testing are high signal-to-noise ratio and the possibility of
studying non-linearities. The disadvantage is that it is slow. However, the speed limita-
tion has been largely removed by Time Delay Spectrometry, (TDS), where very fast sine
sweeps are used to give results almost in real-time.
Dual-ehannel digital filtering is rarely used for system analysis although it is potentially a
very powerful tool, sinee true real-time measurements ean be made. It is a very powerful
means of aeoustie and vibration intensity measurements.
Fault detection and diagnosis mehods in the absence ofprocess model 59
Dual-channel FFT analysis forms a very powerfill and widely used means of system
analysis (Randall and Ikeuchi, 1988). Both the input and output of the system are
measured simultaneously (taking account of system delays). The basic measured data are
the autospectra at the input and output and the cross spectrum between the input and
output, from which many other functions can be calculated. The phase information is
maintained, and the effects of noise can be reduced.
Some advantages of dual-channel FFT analysis are flexibility and the fact that is easy to
use. Also, because the input signal to the system need not be controlled, naturally occur-
ring excitations can be used. Finally, since it is a digital form of analysis, the results can
be easily entered into a computer, for example, to carry out a modal analysis.
Dual-channel FFT analyzers are easy to use, but there are several pitfalls that it is neces-
sary to be aware of Three of them include leakage, the assumption of a linear system,
and compensation for system delays.
Leakage is an effect which occurs because FFT analyzers (both single and dual-channel)
operate on a time-limited signal. The rectangular weighting introduced produces a
(sinx)/x filter characteristic, and power "Ieaks" from the main lobe to the sidelobes,
meaning that measured peaks can be too low and measured valleys too high. Leakage
can be combated by using higher resolution (zoom), introducing an artificial time
window, or where the excitation can be controlled, choosing the right excitation (see
RandalI).
Linearisation can also be considered an advantage. However, it is important to remember
that the dual-channel FFT analyzers impose linearity, even if the system being measured
is non-linear.
All physical systems exhibit a propagation delay. When a propagation delay becomes
significant, as can frequently happen in mechanical and acoustical (and electrical) sys-
tems, it becomes necessary to compensate for it when making a 2-channel FFT analysis,
otherwise bias errors will be introduced into the results.
For instance suppose a system has a propagation delay of T seconds, and the analyzer
processes data blocks T seconds long. If the analyzer processes simultaneous data blocks
at the input and output, the measured frequency response fI will be lower than the true
response H by a factor (1 - r /1). Likewise, the measured coherence y2 (see below for an
exact definition) will be low by a factor (1 - rlJ)2. System analysis measurements are
usually based on the Fourier Transforms, see fig. 1.21, ofthe input and output time sig-
nals a(t) and b(t). The input and output spectra produced (SA and SB) are two sided, that
is they exist for both positive and negative frequencies. However, since the time func-
tions are real and SA and SB will both be conjugate even (that is symmetrical amplitude
about 1=0, but opposite phase), it is usual to combine the positive and negative
frequency halves to form the single sided spectra GA and GB, which are zero for negative
frequency.
60 Real time fault monitoring of industrial processes
The basic functions usually used for system analysis are the input and output autospectra,
(formerly called the input and output power spectra), and the cross (power) spectrum,
i.e.
• input autospectrum
(1= 0)
(1)0)
• output autospectrum
(1= 0)
(1)0)
• cross spectrum
(1=0)
(1)0)
a(t) - <t)
[ ; ] --.
H(f)
b(t)
SA =F[a(t)]
SB = F[b(t)]
2SA (I> 0)
GA ={ (I =0)
SA
(1)0)
(I = 0)
Figure 1.21 System analysis measurements.
The input and output autospectra are the squared and averaged input and output spectra.
Note that they contain no phase information. The cross spectrum is the product of the
coherent amplitudes at the input and output and the phase difference between the input
and output. The cross spectrum is the most important function in system analysis, since it
contains the phase information, and since uncorrelated noise at the input and output will
be averaged out in the cross spectrum.
Given the three basic functions, many input/output relationships can be calculated by
taking various combinations ofthe three and by using Fourier Transforms. The most im-
Fault detection and diagnosis mehods in the absence ofprocess model 61
portant is the system frequency response Hif) and the system impulse response h( T). The
impulse response is the weil known time response of a system to a delta function and it
can be calculated by taking the inverse Fourier Transform of the system frequency re-
sponse (see Appendix l.B). Cross correlation shows whether the input and output
signals are correlated and at what time delays. It can be calculated by taking the inverse
Fourier Transform ofthe cross spectrum.
Three different methods can be used to measure a frequency response function. The first
IH
method is based on a 12 , the ratio of the output to the input autospectrum as it would
be measured using a single-channel analyzer.
Two other methods can be used in a dual-channel analyzer. These are the traditional
method, H I , which is the ratio of the cross spectrum to the input autospectrum, and a
newer method, H2 , which is the ratio of the output autospectrum to the inverse cross
spectrum, i.e.,
H(f) = B(f)
A(f)
IH (ft = GBB(f)
a GAA(f)
H (f) = GAB(f)
1 GAA(f)
H 2(f) = GBB(f)
GBA(f)
The three methods will behave differently according to whether there is noise at the in-
put, noise at the output, or noise at the input and the output. Noise at the output pro-
duces an error in IHi and H 2. On the other hand, H 1 will be unaffected, since it is a
function of the input autospectrum CM, which is noise free, and the cross spectrum,
CAB, where the noise can be averaged out. Hence H I will give the correct result. An ex-
ample where there will be noise at the output is where there are other, unknown inputs to
the system. The effects of these other inputs will show up as noise at the output, as
shows the figure below:
D(t) --4~
b(t)
62 Real time fault monitoring of industrial processes
IHi =G
GBB =IHnl+GNN IGVV ]
AA
_GAB _
HI----H
GAA
H 2 = GBB = H[1 +GNN I Gvv ]
GBA
Noise at the input produces an error in IHl and H I . This time, H 2 will give the correct
result. An example where there will be noise at the input is where a specimen is being
excited with random noise on a shaker. At a resonance of the specimen, the shaker is
effectively trying to drive a mechanical short circuit, which drives the input signal down
towards the noise floor of the measuring instrumentation. Rence the input signal-to-noise
will be low. The output signal-to- noise will be high, however, because ofthe resonance
of the specimen. This situation is shown in the figure below:
h("t)
u(t) - - -.....- -...... ~----...b(t)
H(i)
m(t)--"':
H} =GAB =H - -1- - -
GAA I+GMM IGuu
H2 =GBB =H
GBA
Use of H 2 for measurement at resonance peaks when using broad band random noise
excitation was first proposed by MitchelI, (1981).
The situation when noise is present both at the input and the output is shown in the next
figure:
Fault detection and diagnosis mehods in the absence ofprocess model 63
Noise at Input and Output: In the case of noise at the input and output one obtains:
Ej=GMM/GUU Eo=GNN/GVV
r2 =H I / H 2
IHl =IHII·IH21
IHII ~ IHI ~ H 2 1
Note that the true value of the frequency response function will always lie between BI
and B 2, and that while BI tends to give a low estimate, B 2 tends to give a high estimate.
The user can choose H I or H 2 (after rneasurernent). H I is lower bound while H 2 is upper
bound. B 2 reduces bias errors for resonance peaks with randorn excitation. The coher-
ence function relates how rnuch of the rneasured output signal is linearly related to the
rneasured input signal, i.e.
2
rAB-
_ IGABGI2 ' O~rAB~1
2
G
AA' BB
Low coherence can be due to, amongst other things, noise, non-Iinearities, or leakage.
Note that where low coherence is due to noise, it is still often possible to make good
measurements, since the effects of the noise can be averaged out in the cross spectrum.
Low coherence due to leakage can be combated by increasing the resolution. Here, it is
also important to remember that although the coherence will be the same for H} and H2,
H2 will converge on a resonance peak faster than H). There is nothing which can be done
to combat low coherence due to non-Jinearities.
Figure 1.22 shows the differences obtained for a measurement of H) and H 2 on a cantile-
ver bar mounted on a shaker and excited with random noise. The resonance peaks in H)
are about 10dB lower than in H 2 .
main Y~44.8dB
....,...,
... . . -...-.."'"
, ....... ' r
main Y ~34.2dB
Another important tool in signal analysis is the power cepstrum (see for details Appendix
l.B). The cepstrum is a sort of "spectrum of a spectrum". The distinctive feature of the
cepstrum is the logarithmic conversion of the spectrum. The power cepstrum can be
applied to the detection of periodic structure in the spectrum (harmonics, side bands,
echoes, reflections) and for the separation of source and transmission path effects. The
power cepstrum is a sensitive measure of growth of harmonic/sideband family (can be
used for separation of different families) and it is insensitive to measurement point, phase
combination, amplitude and frequency modulation and loading. An illustration of the use
of the cepstrum for both detection and diagnosis of a gear box fault is given in fig. 1.25
of the next section.
Components also occur at the toothmeshing harmonics due to mean deviation from the
ideal profile. These may be a resuIt of initial macbining errors, but will eventually be
dominated by the effects ofuniform wear. Wear tends to be greater on either side ofthe
pitch circle, as iIIustrated, because ofthe greater sliding velocity there (with pure rolling
at the pitch circle), see fig. 1.23.b. The effects of such geometrical errors are much less
load sensitive.
Fig. 1.24 iIIustrates typical increases in toothmeshing harmonics due to uniform wear.
The effect of wear is often first seen in the second harmonic, but usually spreads to the
bigher harmonics as the profile deteriorates. It is advisable to monitor at least 3 harmon-
ies, as the signal at the first harmonie must first exeeed the effects of tooth defleetion to
be noticeable. Measurements must be made at constant load, for comparisons to be
meaningful.
Fig. 1.25 is an illustration ofthe use ofthe cepstrum for both detection and diagnosis of
a gearbox fault. Sideband family ean be c1early seen in spectrum, but in cepstrum it can
be detected by monitoring only one component, at 95.9 ms (detection). Measured period
(95.9 ms) and corresponding frequency (10.4 Hz) is measured so accurately as to elimi-
nate second harmonic of output shaft speed (5.4 Hz) as a possible source. Source was
traced to rotational speed of second gear, even though tbis was unloaded because first
gear was engaged (diagnosis).
66 Real time fault monitoring of industrial processes
Q) _ Toothmeshing frequency
®0 - Higher har.. onics
Log.
Velocity
Frequency
I
Figure 1.24 Gear toothmeshing harrnonics.
•
60~
100
90
Figure 1.25 The use of the cepstrum for fault detection and diagnosis of a gearbox.
On-line bearing fault monitoring implies automatic data processing without human inter-
vention. Vibrations, picked up by sensors, are transmitted to a monitoring system where
they are processed for information extraction. The on-line system comprises a data ac-
quisition stage, where analog bearing signals are converted into digital form and a data
processing stage, where modular software algorithms are employed to perform the de-
signed algorithm under the guidance of a supervisor program.
The data acquisition stage comprises (I) an accelerometer, (2) acharge amplifier, (3) a
band pass filter, and (4) an analog to digital converter. The data processing stage consists
ofthree functional units: supervisor, defect detectionldiagnosis unit, and data base.
The block diagram which illustrates the organization of the complete system is shown in
fig. 27.a. The supervisor is responsible for : (I) the proper logic sequence of system op-
eration, (2) the data flow control between the defect detectionldiagnosis units and the
global data base, and (3) global data base management.
AgIobaI data base is constructed to hold aIl the information to be relayed among the data
acquisition stage, the functional units of data processing stage, and the system's human-
machine interface. It comprises 3 data files:
(I) general purpose data file, (2) raw bearing signal data file, and (3) pattern vector data
file.
One important reason for having a global data base is that the external data files will pre-
serve important data just prior to any unforeseen shutdown of the monitored system or
bearing monitoring system itself
68 Real time fault monitoring of industrial processes
n = number 01 balls or
rollers
I, = relative rev./s between
Inner and outer races
Figure 1.26 Faults in rolling element bear- Figure 1.27 Faults in ball and roller
ings. bearings
The interface is responsible for the human-monitoring system communication. The nec-
essary input information consists of (a) bearing geometry, (b) bearing rotational speed,
and (c) sampling rate. The output quantities through the interface consists of alarm and
diagnosis.
Among others, short-time energy function, short-time average zero crossing rate, and
median smoothing are employed by the proposed scheme. The definition of the short-
time energy function is,
<Xl
En = Lx 2 (m)w(n - m)
m=-<Xl
where x(n) is the sampled signal and w(n) = 1 if 0 ~ n ~ N-l, and zero otherwise (N is the
width ofwindow).
In the context of discrete-time signals, a zero-crossing is said to occur if successive
sampies have different algebraic signs. The rate at which zero crossings occur is a simple
measure ofthe frequency content ofa signal. Its definition is,
<Xl
where sgn[x(n)] = 1 ifx(n) ~ 0 and = -I ifx(n) < 0, and w(n) = (l/2)N ifO ~ n ~ N-I; = 0
otherwise. However, the second equation makes the computation of Zn appear more
complex than it really iso All that is required is to check sampies in pairs to determine
where the zero-crossings occur with the average being computed over N consecutive
sampies.
I
SUPERVISOR
~ I h
AID baudpass charge
fdter
-
amplifier
accelerometer
L .P
Figure 1.27.a. Block diagram representation ofthe on-line bearing monitoring system
A bearing signal dominated by bearing defect sensitive resonances reveals the occurrence
of damaged related impulses in the fluctuations of its amplitude. These amplitude fluc-
tuations are made more prominent by computing their short-time energy functions.
On the other hand, the fluctuations of those defect-excited resonances that have a much
higher frequency than the vibration generated by other machine elements will introduce
variations in frequency content of bearing signals. Variations of this type may be easily
revealed by the computation of average zero-crossing rate ofbearing signals.
After all defect related vibration bursts have been made more prominent in both the
short-time energy function and short-time average zero-crossing rate, their rate of occur-
rence is estimated by computing the autocorrelation functions.
It is possible to incorporate a pattern recognition based monitoring scheme which em-
ploys short-time signal processing techniques to extract useful features from bearing vi-
bration signals. These features can be used by a pattern elassitier to detect and diagnose
hearing defects (Li and Wu, 1989).
Short-time signal processing techniques use windowed segments of the bearing signals to
facilitate the estimation ofthe rate and the strength ofimpulsive vibrations, which may be
the result of a localized defect. If the estirnated impulse generating rate is elose to any
one of the eharaeteristie defeet frequencies and the strength of the impulse train is sig-
nifieant, the designed pattern elassifier will elassify the bearing into the damaged eate-
gory. Due to the uniqueness of eaeh eharaeteristic defeet frequeney, the diagnosis
regarding the loeation ofthe defect is also provided through the proposed seheme.
System operation logic. The aetual sampling rate of the system may be varied and it is
set by a programmable eloek whieh is an integral part of an ND eonverter. The
supervisor first reads in the necessary inputs from the human operator. Then the data
aequisition stage activates the ND eonverter aeeording to this information. Onee the
converter is initiated, it supplies diserete data elements to the system at the seleeted rate
until a specified number of sampIes has been generated. The data are formulated into
reeords and written onto the peripheral disk memory (data file no. 2).
The data will be proeessed by the defeet deteetionldiagnosis unit. The loop of data
measuring and proeessing will be eontinuously earried out until any kind of loealized de-
feet is deteeted and diagnosed. In the ease of sueh an event, the pattern veetor, estimated
impulse generating rate, and elassifier output will be displayed on the CRT to alert the
operator.
Details for the on-line implementation of the deseribed teehnique ean be found in Li and
Wu (1989).
C Reciprocating machine and gos turbine fault detection
Vibration and noise signals from reciproeating maehinery eome, typieally, from events
which occur at different phases of the machine eYele, see fig. 1.28. This amounts to a
signal whieh, though repeated every cyele, varies during one eyele. Continuous averaging
Fault detection and diagnosis mehods in the absence ofprocess model 71
lumps all these signals together and track is lost of this variation which from everyday
experience is important in judging the condition of the machine. An FFT Analyzer makes
it possible to pick out a short sampie length. of the signal which is associated with one
particular event and analyze it separately. This is done by triggering the Analyzer from a
tacho pulse every cycle and using the variable time delay of the Analyzer to choose that
phase ofthe signal to be analyzed (Randali).
The signal and the tacho pulse are recorded on an instrumentation tape recorder. They
are then played back into the FFT Analyzer which is set to trigger on the tacho pulse af·
ter a specified delay. Aseries of spectra for various trigger settings, representative of
various phases of the machine cycle, are recorded on the digital cassette recorder.
Comparison and data display is handled by a programmable calculator and, optionally, a
digital plotter (see RandalI).
The basic steps used in the analysis for collecting spectra are presented in fig. 1.29,
where:
(a) Represents a typical impulsive signal from a four cylinder diesel engine.
(b) Represents the once per cycle tacho signal.
(c) Represents the positions of the Hanning time window after 2 specified trigger
delays, Trig 1 and Trig 2, set up in the FFT Analyzer.
(d) Represents aseries of spectra which are used to obtain a representative average
spectrum for the phase of the engine cycle corresponding to Trig 1 .
(e) Represents aseries of spectra obtained by repeating the process for a new trigger
delay setting, Trig 2.
These average spectra can then be stored on the digital cassette recorder. The signals in·
volved vary somewhat from cycle to cycle, and a considerable amount of averaging may
be necessary to obtain a reliable result.
A logic diagnosis for the whole engine can be incorporated in the computer memory such
as a binary interrogation ofthe type shown in fig. 1.30, which produces additional prog·
nostic conclusions (see also chapter 4).
By comparing actual values with base values determined for the system theoretically and
values measured when the plant was commissioned, the relevant trends can be reported.
Additionally, values are theoretically determined for the plant when operating with
defects and programmed into the computer; thus when measured conditions fall within
those for which the defect applies, the "status" is reported and the defect notified.
Automated fault identification for gas turbines based on spectral features of measure·
ments of various dynamic quantities, such as internal pressure, casing acceleration,
acoustic data is presented and applied by Loukis et al. , (1992). The examined faults were
rotor fouling (fault of all the blades), individual rotor blade fouting (fault of 2 blades of
stage 1 rotor), individual rotor blade twisted, stator blade restaggering. The difference
pattern (used as fault index) derived from the measured signal of an instrument is defined
by the expression :
72 Real time fault monitoring of industrial processes
Tacbo pulse
k T r ; g g c r dcla Y
l
Analyscd
cvcnl
~ I ("" !'--------JnL-_L
;....; o.r.y ahllf Tfie 1
T,• •_ ,
_tc_,.;..
! ....Lttf~ , I~----Li*!y'L-_ _ AN.!!'.::!,."
. 'rl----U'*~.~_..J.!];J~
(d,M~~b~~~ . ;fS" 69
(·,~~~be~~
Figure 1.29 Basic steps used in the analysis for collecting spectra.
Fault detection and diagnosis mehods in the absence ofprocess model 73
No
Check
indlcated
components
Check
oil
system
Inspection of the fault indices calculated for the different measuring instruments has
shown that the presence of one ofthe examined faults results in the appearance of differ-
entiations mainly at multiples of the shaft rotational frequency. It was therefore decided
to filter out values of the indices at frequencies other than the shaft harmonics, since the
most useful diagnostic information is contained at these harmonics.
This is done by a filter defined by the following equation:
h(f) {I if f is rotational harmonic
= 0 if f is not rotational harmonic
74 Real time fault monitoring of industrial processes
The pattern resulting from filtering the dift'erence pattern with the above filter will be ref-
ered to as reduced patternPr(f).
Two discriminant functions were selected, one of them expressing the quantitative simi-
larity (influenced by both the shape and amplitude of compared patterns) the other ex-
pressing the shape similarity (influenced only by the shape ofthe compared patterns).
The first discriminant function is the usual Euclidean distance between reduced patterns,
when they are viewed as points in an N-dimensional space.
The second discriminant function is the normalized crosscorrelation coefficient. In order
the two discriminants to be used for fault classification reference patterns for each fault
must be possessed.
If such reference patterns are possessed, then the two reduced pattern discriminants can
be produced for any measured signal from the monitored gas turbine and depending on
their values the fault corresponding to that signal (if any) can be decided.
The flow chart of an automated gas turbine fault diagnosis scheme based on spectral
pattern analysis is as shown in fig. 1.31:
Figure 1.31. Flow chart of the automated spectral pattern fault diagnosis method for gas
turbines.
The technique presented briefly previously has been developed on the basis of the data
available from experiments in an industrial gas turbine with specific implanted faults
Fault detection and diagnosis mehods in the absence ofprocess model 75
(Loukis et al., 1992). From this point ofview part ofthe findings can be considered of
general validity while others will be particular to the specific engine.
D. Induction machine broken bars detection.
The desire to improve the reliability of industrial drive systems has led to concerted re-
search and development activities in several countries to evaluate the causes and conse-
quences of varlous fault conditions. In particular, ongoing research work is being fo-
cused on rotor bar faults and on the development of diagnostic techniques.
Broken bars were shown to produce high localized airgap fields and to degrade mechani-
cal performance. The field perturbation associated with broken bars, which are deliber-
ately disconnected from the endrings by machining, produces, low-frequency compo-
nents and harmonics in the seareh coil-indueed voltages and gives rise to an oseillatory
torque that produces noise and mechanieal vibration.
Different techniques for the detection of broken bars were tested and evaluated by
Elkasabgy et al., (1992).
A. Search Coil Induced Voltage Detection Techniques. This technique involves an
inspection ofthe time and frequency domain ofvoltages indueed in internal (stator tooth
tip and yoke) seareh eoils or an inspection of the time and frequency domain of voltages
indueed in an extemal seareh eoil plaeed against the frame ofthe motor.
Consider first the voltage indueed in an internal stator tooth tip coil for a maehine with
no rotor faults. The dominant and fundamental frequency will be at the motor excitation
frequeney of 60 Hz. Higher frequency eomponents will appear due to the periodieity of
rotor bars. The nonsinusoidal stator emf distribution, i.e., spaee harmonies, may also in-
duee time-harmonie voltages.
Consider now the voltage in the same seareh eoil for a machine with one or more adja-
cent broken bars. An anomalously high airgap loeal field rotates at rotor speed. This field
pulsates at slip frequeney and ean be considered to be the resultant of two fields, eoun-
terrotating at sx(synehronous speed), whieh are rapidly attenuated away from the fault
loeation. The field assoeiated with the broken bars will, therefore, modulate the eoil-in-
dueed voltage at a eharacteristie frequeneY!jault, given by
ffault =(~)(1-S)±SfHZ
where fis the exeitation frequeney, s is the slip, and p is the number of poles of the in-
duetion motor.
Similar frequeney components are anticipated in the yoke and external seareh eoil volt-
ages. These voltages an be eaptured by a fast data aequisition system and printed out to
iIIustrate their time dependenee. Their frequency spectra ean also be analyzed.
Comparlng the tooth-tip seareh and the yoke-seareh coils indueed voltage frequency
76 Real time fault monitoring of industrial processes
spectrum for a fault-free rotor with that for the broken-bar rotor, broken bars can be
adequately detected.
Perhaps surprisingly, the external search coil is just as effective as the internal coils in
detecting the broken bars. It therefore appears unnecessary to incorporate internal coils
to take advantage of this diagnostic technique; an external coil placed against the casing
of the machine being entirely adequate.
The use of an external coil placed against the frame of the machine is considered particu-
larly useful in an industrial environment because the motor need not be modified in any
way (by the installation of stator search coils) or be taken out of service temporarily. All
that is needed is a coil of 10-20 turns with a length equivalent to the active axiallength of
the machine and ofwidth equal to perhaps halfa pole, a power-frequency oscilloscope or
low-frequency spectrum analyzer, and the experience of an observant operator.
B. Stator-current detection technique. Each individual rotor bar can be considered
to form a short-pitched single-turn single-phase winding. The airgap field produced by a
slip-frequency current flowing in a rotor bar will have a fundamental component rotating
at slip speed in the forward direction with respect to the rotor, and one of equal
amplitude that rotates at the same speed in the backward direction. With a symmetrical
rotor, the backward components will sum to zero. For a broken-bar rotor, however, the
resultant is nonzero. The field, which rotates at slip frequency back-ward with respect to
the rotor, will induce EMFs in the stator side that modulate the mains-frequency
component at twice slip frequency.
Under sinusoidal voltage excitation, this effect produces twice slip frequency (2.67 Hz at
1760 r/min) side bands in the spectrum ofthe phase current, which indieate the existence
of the fault.
Thus, the examination of the machine current spectrum provides an important method
for detecting rotor-bar faults.
e. Torque-harmonics detection technique. In a balanced three-phase induction machine
with no rotor faults, the forward-rotating field interacts with the slip frequency induced
rotor currents to produce a steady output torque. For a machine with a rotor fault, a
backward rotating field is developed as discussed. This backward rotating field interacts
with the rotor currents, induced by the forward rotating field, to produce a torque
variation at twice-slip frequency, which is superimposed on the steady output torque.
Rotor faults therefore lead to low-frequency torque harmonics, which result in increased
noise and vibration. The torque oscillations can be measured by means of a shaft torque
transducer using a data acquisition system while the motor is running on-line at various
load conditions. The frequency of the torque oscillation increases as the machine is
loaded. The dominant frequency of oscillation corresponds to twice the slip frequency of
the operating condition.
Fault detection and diagnosis mehods in the absence of process model 77
1.3.3 Conclusions
References
Chin H. and K Danai (1991). A method offault signature extraction for improved diag-
nosis. ASME Transactions oj Dynamic Systems, Measurement and Control, 113, p.
635.
Chitturi V. (1976). Distribution ofmultivariate white noise autocorrelations. Journaloj
American Statistical Association, 71, 353, p.223.
Commissariat a l' Energie Atomique (1978). Statistique appliquee a l' exploitation des
mesures. Editions Masson, Paris.
Cue R.W. and D.E. Muir (1990). Engine performance monitoring and trouble shooting
techniques for the CF-18 aircraft. Proceedings, Gas Turbine and Aeroengine Congress,
Brussels, June 11-14, 1990.
Dixon W. 1. Power functions of the sign test and power efficiency against normal al-
ternatives. Annals ojMathematical Statistics, 24, p. 467.
Elkasabgy N.M., Eastham AR and G.E. Dawson (1992). Detection of broken bars in
the cage rotor on an induction machine. IEEE Transactions on Industry Applications,
28, 1, p. 165.
Hawkins D.M. (1992). A fast accurate approximation for average run lengths of
CUSUM control charts. Journal ojQuality Technology, 24, 1, p. 37.
Himmelblau D. M., (1978). Fault detection and diagnosis in chemical and petrochemical
processes. Elsevier, Amsterdam.
Hoerl R.W. and AC. Palm AC. (1992). Discussion: Integrating SPC and APC",
Technometrics, 34, 3, p. 268.
Hunter 1.S. (1986). The Exponentially Weighted Moving Average. Journal oj Quality
Technology, 18, 4, p. 203.
Ikeuchi T., Shirai M., Nakamachi K, Tanabe S., Ishino K and T. Fujishima (1988).
Computer-assisted noise and vibration analysis system - CANVAS. Noise and Vibration
Control Worldwide, February, p. 58.
Johns W. D. and R.H. Porter (1988). Ranking ofcompressor station noise sources using
sound intensity techniques. Noise and Vibration Control Worldwide, February, p. 70.
Kendall M., Stuart A and 1.K Ord (1982). The advanced theory of statistics. Vois. 2,
3. Charles Criffin Ltd., London.
Laws W. C. and A Muszynska (1987)Periodic and continuous vibration monitoring for
preventive/predictive maintenance of rotating machinery. ASME Journal oj Engineering
jor Gas Turbines and Power, 109, April, p. 159.
Li C.J. amd S.M. Wu (1989). On line detection of localized detects in bearings by
pattern recognition analysis. ASME Journal oj Engineering in Industry, 111, November,
p.331.
80 Real time fault monitoring of industrial processes
Liggett W., Jr. (1977). A test for serial correlation in multivariate data. The annals of
Statistics, 5, 2, p. 408.
Loukis E., Mathioudakis K. and K. Papailiou (1992). A precedure for automated gas
turbine blade fault identification based on spectral pattern analysis. ASME Journal oj
Engineeringjor Gas Turbines and Power, 114, April, p. 201.
Ljund L. and T. Söderstrom (1987). Theory and practice of recursive identification.
The MIT Press, London.
Lucas J.M. and M.S. Saccucci (1990). Exponentially Weighted Moving Average
Control Schemes: Properties and Enhancements. Teehnometries, 32, 1, p.l.
Lyon R (1987). Machinery noise and diagnostics. Butterworths ed., London.
MacNeill I. B. (1974). Tests for periodic components in multiple time series.
Biometrica, 61, 1, p. 57.
Mehra RK. and J. Pesthon (1971). An innovations approach to fault detection and
diagnosis in dynamic systems. Automatiea, 7, p. 637.
Mitchell J. S. (1981). Machinery analysis and monitoring. Penn Weil ed., London.
Pnegelly B.W. and G.E. Ast (1988). A computer-based multipoint vibration system for
process plant rotating equipment. IEEE Transaetions on Industry Applieations, 24, 6, p.
1062.
Pignatiello J. and C. Runger (1990). Comparisons of multivariate CUSUM charts.
Journal ojQuality Teehnology, 22, 3. p. 173.
Pouliezos A. (1980). An iterative method for calculating sampie serial correlation co-
efficients. IEEE Transactions on Automatie Control, AC-25, 4, p. 834.
Proceedings, 15th Symposium ''Aireraft Integrated Monitoring Systems", Aachen,
Germany, September 12-14, 1989.
Randali R D. Efficient Machine Monitoring. Bruel and Kjaer publication, code no. 18-
212.
Randles R (1989). A distribution-free multivariate sign test based on interdirections.
Journal ojthe Ameriean Statistieal Association, 84, 408.
Robert P., Cleroux Rand N. Ranger (1985). Some results on vector correlation.
Computational Statisties and Data Analysis, 3, p. 25.
Spee R and A.K. Wallace (1990). Remedial Strategies for brushless DC drive failures.
IEEE Transaetions on Industry Applieations, 26, 2, p. 259.
Stephens C.M. (1991). Fault detection and management system for fault-tolerant
switched raluctance motor drives. IEEE Transaetions on Industry Applieations, 27, 6,
p. 1098.
Fault detection and diagnosis mehods in the absence ofprocess model 81
Appendix 1. A
l~ l~-,r-~--.---.-~~~---r--.---r--,
a
0.5
-2. -1. o 1. 2.
Figllre 1.A.l Operating characteristic curves for the sampie mean test, Pf = 0.01
l~ fr--,r--.---.--~--~--r-~r-~--~--~
a -
0.5 -
-2. -1. o 1. 2.
Figure 1.A.2 Operating characteristic curves forthe sampie mean test, Pf = 0.05
Fault detection and diagnosis mehods in thc absence ofprocess model 83
;'
\
,
99 o· ~ ) ~I
9S -I
/
..:~ ---
90 ;"
\ \ I /'
70
50 v --
30 "'-
., r-.
./
·f /
.... W-
1
.5
.2
.05
.1 .2 .4 1 2 4 10 20 40 100
Figure J.A.3 Power curves for the two-tailed i-test at the 5% level of significance.
84 Real time fault monitoring of industrial processes
Table J.A.J
Values of k such thatPr(y~ k -1) < a / 2 wherey has the binomial distribution withp=O.5.
5 1 - - -
6 1 1 - -
7
8
1
1
1
1
1
1
-1
9 2 2 1 1
10 2 2 1 1
11 3 2 2 1
12 3 3 2 2
13 4 3 2 2
14 4 3 3 2
15 4 4 3 3
16 5 4 4 3
17 5 5 4 3
18 6 5 4 4
19 6 5 5 4
20 6 6 5 4
21 7 6 5 5
22 7 6 6 5
23 8 7 6 5
24 8 7 6 6
25 8 8 7 6
30 11 10 9 8
35 13 12 11 10
40 15 14 13 12
45 17 16 15 14
50 19 18 17 16
Fault detection and diagnosis mehods in the absence ofprocess model 85
VaIues of I-Pd for the sign test VaIues of I-Pd for the sign test
n r I-Pd n r 1-P d
60 20 .02734
70 26 .04139
80 30 .03299
90 35 .04460
100 39 .0352
86 Real time fault monitoring of industrial processes
Table 1.A.4
II 0.01
~
0.1 0.05 0.02
I
.1
5 0.9 - - -
6 0.829 0.886 0.943 -
7 0.714 0.786 0.893 -
8 0.643 0.738 0.833 0.881
9 0.6 0.683 0.783 0.833
10 0.564 0.648 0.745 0.794
11 0.523 0.623 0.736 0.818
12 0.497 0.591 0.703 0.78
13 0.475 0.566 0.673 0.745
14 0.457 0.545 0.646 0.716
15 0.441 0.525 0.623 0.689
16 0.425 0.507 0.601 0.666
17 0.412 0.49 0.582 0.645
18 0.399 0.476 0.564 0.625
19 0.388 0.462 0.549 0.608
20 0.377 0.45 0.534 0.591
21 0.368 0.438 0.521 0.576
22 0.359 0.428 0.508 0.562
23 0.351 0.418 0.496 0.549
24 0.343 0.409 0.485 0.537
25 0.336 0.4 0.475 0.526
26 0.329 0.392 0.465 0.515
27 0.323 0.385 0.456 0.505
28 0.317 0.377 0.448 0.496
29 0.311 0.37 0.44 0.487
30 0.305 0.364 0.432 0.478
Fault detection and diagnosis mehods in the absence ofprocess model 87
Appendix 1. B
The relationship of [Xml to [xkl is discussed later; here it is simply noted that [Xml is the
OFT of [xkl and that the index m designates the frequency of each component Xm. Also,
the OFT is complex, so each Xm can be represented in polar form as
Xm =IXm lej8m
In this notation, IXml is the amplitude of Xm, and a plot of IXml versus the frequency index
m is called the amplitude spectrum of [xkl Similarly, a plot of (}m versus m is called the
phase spectrum of[xkl.
The relationship implemented by the forward transform between [xkl and [Xml can be
expressed as,
N-l N
X m -- "" - 0, 1, ... , -
k-.xke- j(21CIJ1kIN).,m-
k=O 2
Again it is assumed here for convenience that N, the number of data sampies, is even. In
this formula for Xm, the exponential function, exp(-j(2mnklN),is a complex sinusoid and
is periodic. If one thinks of exp(-j(2mnklN) as a function of k, the time index, then its
period is seen to be Nlm; that is, when k goes through a range of Nlm, exp(-j(2mnklN)
goes through one cycle. One can see this even more clearly by separating the real and
imaginary parts of the above relation:
88 Real time fault monitoring of industrial processes
Xm =N-l (21tm)
LXkcOS - - k -
N-l (21tm)
jLxksin - - k ;
N
m=O,I, ... , -
k=O N k=O N 2
Thus, each part (real or imaginary) of each DFT component Xm is a correlation (summed
product) ofthe data sequence [xk] with a cosine or sine sequence having aperiod of Nlm
data sampies.
The periodicity of the DFT just discussed is a very important property. One can see that
adding N to the index of Xm does not change the value of Xm, that is,
N-I
X m+N =L Xk c - j [21t(m+N)kIN]
k=O
N-I
=LXkc-j(27tIIJkIN)c-j27tk =X m
k=O
Furthermore, tbis rule holds whether [xk] is real or complex; Xo through XN_I is a com-
plete set ofDFT components.
Secondly, when [xk] is real, one can also show that Xm and XN_m are complex conjugates.
One can take:
N-I
XN- m =L Xk c - j [21t(N-m)kIN]
k=O
N-I
=L XkC +j(27tIIJkIN)c - j27tk = X~
k=O
The star (*) denotes the complex conjugate, resulting from the change of sign in the ex-
ponential function. Note that if [xk] is complex, the above relation does not apply.
Thus, when [xk] is a sequence ofreal data, all values ofXm outside ofthe setXo• XI • ...•
XN12 are redundant. When [xk] is a sequence of eomplex data, all values outside ofthe set
XO' X/> ... , XN_1 are redundant.
The inverse transform.
The reverse (or inverse) DFT is used to obtain a data sequence [xk] from its complex
spectrum [Xml The formula for the inverse DFT, is
1 N-I
xk
=_ ~X .. j(21tmkIN).
~ m'" , k = 0, 1, ... , N -I
N m=O
Thus the inverse DFT is the same as the forward DFT except for the sign of the expo-
nential and the sealing factor IIN.
It ean be seen that N values of Xm, not just (NI2)+ I values, are required for the inverse
transformation. Thus, given Xo through XN12 , one could generate X(NI2)+1 through XN_I ,
Fault detection and diagnosis mehods in the absence ofprocess model 89
using the formula for XN_m before applying the inverse formula. A1ternately, the inverse
formula can be modified as folIows, assuming [xkl is real and using the above formula for
XN-m:
=-1 [ X o + NI2-1
~ X e j (27r.mJc/N) + X
L. m NI2
e fttk + N -j l ]
~ X·
L. N-m
e (2runk/N)
N m=l m=N/2+1
m=l D=l
Thejast Fourier transjorm (FF1) is not a new kind oftransform different from the DFT.
Instead, it is simply an a1gorithm for computing the DFT, and its output is precisely the
same set of complex values expressed in Xm. The FFT a1gorithm eliminates most of the
repeated complex products in the DFT, however, so its execution time is much shorter.
Specifically, the ratio of computing times is approximately
FFT computing time _ 1 I N
--og2
DFT computing time 2N
Using the FFT, one can also do the computation "in place", so that [Xml replaces [xkl,
with only a Iimited amount of auxiliary storage needed for work space.
On the other hand, the FFT a1gorithm is more complicated than the DFT and becomes
lengthy when N, the number of data sampies, is not apower oftwo. Thus, in many appli-
cations it is simpler and preferable to use a simple DFT a1gorithm instead of an FFT.
One final but important point about the inverse transform is that most computer routines,
in order to preserve synunetry, omit the factor lIN. Thus, the sequence [xkl is scaled to
N times its correct amplitude by most inverse DFT and FFT a1gorithms.
Power spectrum and cepsuum.
The power spectrum of an ergodic signal is derived from the Fourier transform as,
90 Real time fault monitoring of industrial processes
Clearly, the logs ofthe magnitudes ofthe input and transfer function add to produce the
log of the output magnitude. The phase of the output is a linear sum of the phases of the
input and of the transfer function. One now has a situation in which the transfer function
and the input add their properties to produce an output. This frequency domain process
may not result in an effective way to separate the source and transfer function; but if one
takes the inverse transform into the time domain, the cepstrum is generated which may
provide a way in some cases to separate the individual effects of course and propagation
path on the output cepstrum, at least to a degree that is useful for diagnostic purposes.
Since the log magnitude is an even function of frequency and the phase is an odd func-
tion of frequency, their inverse transforms are real functions, so the complex cepstrum
Cy(/) is areal function oftime. The Fourier transform ofthe log magnitude is called the
power or real cepstrum, and in situations where the phase is unknown or ignored it may
be a useful way to separate source and path effects. The inverse transform of the phase is
the phase cepstrum, and the sum of the magnitude and phase cepstra is the complete
complex cepstrum of the signal.
It is noted that if one is able to determine the input cepstrum Cil), then by a Fourier
transformation he could construct the log Fourier transform of x and by exponentiation
recreate the Fourier transform itself Having obtained the Fourier transform of x, both in
magnitude and phase, one could use the inverse transform again and get back to x(t).
Thus, there is unique and recoverable relationship between the complex cepstrum and the
variable from which it is derived. However, it is not possible to recover the initial wave-
form from the power or real cepstrum because the inverse transform of the real cepstrum
only allows to compute the magnitude ofthe Fourier transform or the power spectrum.
Inverse time transformation of the power spectrum reproduces the correlation function,
not the initial waveform, because the phase of the signal has been lost.
92 Real time fault monitoring of industrial proccsses
-'\r-
Time ~ Frequency
ll(t)~y(t) .. FT
• X(w)~Y(w)
~t
I:I_:~:W I~ ~:
B
Variable FT Transform
y(t) = x(t) * h(/) Y(co) = X(co) H(co)
1J1 = IX! [H]
cPy = cPx + cP"
B
Correlation FT Power Spectrum
Rir) = R,,(r) * Rir) 1J12 = 1X!2 . IHF
+,T* ~ -Rx
~.
-~.
- 1\0 IV Ry
_A _ T >
"7>
B
Complex Cepstrum FT Log Transform
Cir) = Cir) + C,,(r) log Y = loglJ1 + jtPy
= logIXI + log 1111 + j( tPx + tPh )
Power (Real) Cepstrum Log Magnitude
Cy(r) = Ci(r) + Ch(r) loglJ1 = loglX! + loglHl
B
Phase Cepstrum FT Phase
Crw( r) = Crpx( r) + Cq;,( r) 'Py = 'Px + 'Pli
2.1 Introduction
Quantitative model-based failure detection and isolation (PDI) methods rely on the com-
parison of a system's available measurements, with a-priori information represented by
the system's mathematical model. The term quantitative is used here as contrary to the
term qualitative, denoting cruder sustem descriptions. There are two main trends of tbis
approach, namely analytical redundancy or residual-generation methods and parameter
estimation. Tbis distinction is not universally adopted and some researchers consider
these two approaches as belonging to the same category. However, for reasons of
clearer presentation the distinction is adopted here, and parameter estimation methods
will be presented in Chapter 3.
The term analytical redundancy arises from the use of analytical relationsbips describing
the dynamical interconnection between various system components. In contrast, physical
or hardware redundancy relies on replication of hardware components (sensors, actua-
tors, computers), thus increasing the reliability ofthe overall system.
Chow and Willsky, (1984), may be considered the inventors oftbis terminology, but it is
certain that sirnilar ideas were tried before them. Since then, tbis approach has been ex-
tended and evolved into what is currently termed robust FDI. Though tbis term will be
elaborated later on, it generally means FDI schemes that are robust with respect to
modeling errors and unknown (unmeasured) disturbances. Two main streams of re-
search along tbis path have been followed in the two sides of the Atlantic. In USA, par-
ity space methods have been used in many applications, wbile in Europe, observer-based
techniques were developed. However, it has been recently seen, that both approaches
are formally equivalent and just use different mathematical tools to acbieve the same goal
in robustness (Gertier, 1991).
94 Real time fault monitoring of industrial processes
The common denominator of all the approaches that will be presented in this chapter is
that decision on whether a specific fault has occurred or not is made according to the
values of characteristics quantities called residuals. These are generated from the ob-
served input-output history of the system, and the way by which they are generated
signifies each different method.
While residuals are zero in ideal situations, in practice, this is seidom the case. Their de-
viation from zero is the combined result of noise and faults. If the noise is negligible, re-
siduals can be analyzed directIy. With any significant noise present, statistical analysis
(statistical testing) is necessary. In either case, a logical pattern is generated, showing
which residuals can be considered normal and which ones indicate a fault. Such a pattern
is called the signature of the failure.
As is common in other fields of science, there exist a variety of possible cIassifications of
methods. With respect to the different sectors of a system where faults can occur, one
may distinguish between instrument fault detection (IFD), actuator fault detection
(AFD) and component fault detection (CFD). While early attempts were not concen-
trated on any one of these categories separately, increased complexity of robustness re-
quirements has forced researchers to attack each of these in isolation to the others. It is
also true to say that most work has been directed towards IFD, since sensor information
is very crucial to the safe operation of any system. With respect to noise modeling, ana-
Iytical redundancy methods may be cIassified into stochastic and deterministic. In the
former, explicit modeling of the noise is present while in the latter noise is taken into ac-
count without any distribution assumptions. Early attempts tend to fall into the first
category, while current robust techniques employ the second method.
The main technological areas where the methods of this Chapter have been applied are:
aerospace engineering, automotive engineering, machining applications, nuclear engi-
neering, chemicallpetrochemical engineering, power plant and power transmission appli-
cations.
The structure of this Chapter is as folIows: the main concepts of analytical redundancy
methods is presented first. This incIudes modeling and performance criteria. Next,
methods of stochastic modeling are briefly considered, mainly for introductory purposes.
This is followed by current techniques, aiming at robust FDI, based on deterrninistic
models. Finally, specific examples from industrial applications are presented, which iIIus-
trate the varlous methods which have been considered.
Most model-based failure detection and isolation methods rely on linear discrete-time
state-space models. Since most diagnostic computations are performed on sampled
data, this represents a reasonable form. This implies that for non-linear plants, any non-
linearity is linearlzed around some operating point. Note, however that some method-
Analytical redundancy methods 95
ologies can be extended to explicitly non-linear models, especially if they can be de-
composed into static nonlinearities and linear dynamics (Gertier et al., 1991). Also,
continuous-time plants are represented by their discretized model. It must be empha-
sized however that the type of model used serves the proposed solution method, this be-
ing the cause of the many different representations one sees in the fault detection litera-
ture. Thus, frequency or z-domain representations have been lately considered, which
exploit the additional information carried by the spectrum ofthe process.
Plant parameters may be varying with time. "Normal" variations are usually small and
slow compared to the dynamics of the plant. Such variations will be neglected here for
the sake of simplicity. Abrupt and/or significant changes, on the other hand, may and
should be considered as multiplicative process faults. In addition, additive faults, e.g.
biases on the different parts of the system are taken into account.
The state-space model relates the state vector x(k) to the input vector u(k) and output
vector y(k) using known system matrices A, B, and C. The well-known state equations
describing the nominal (fault-free) system are:
x(k + 1) = Ax(k) + Bu(k) (2.1)
y(k) = Cx(k) (2.2)
The dimensions of the state, input and output vectors are n, rand m respectively. An
equivalent input-output model may be presented in shift-operator form, with matrices
G(z) and H(z) consisting of elements that are polynomials in the shift operator z and H
being a diagonal matrix:
H(z)y(k) =G(z)u(k) (2.3)
The matrices ofthe input-output model are related to those ofthe state model by,
Here the matrices Land Mare obtained in accordance with (2.4), with B replaced by P
and Q, respectively. Note that the presence ofthe new terms in (2.5) may influence H(z)
and G(z) since L(z) and M(z) interfere with the simplification ofthe equations.
Introduce now fu(k) and!y(k) for the additive measurement fault (bias) on the input u(k)
and output y(k), and wu(k) and wy(k) for the respective measurement noise. With these,
the measured input fi( k) and output j( k),
fiCk) = u(k) + f u(k) + W u(k)
(2.7)
j(k) = y(k) + f y (k) + W y (k)
For controlled inputs, there is no sensory measurement; instead, u(k) is the control signal
and fiCk) its implementation by the actuators, withfu(k) representing any actuator mal-
function and wik) the actuator noise. Altematively, inputs may be c1assified into three
groups: measured inputs, um' controlled inputs U c and disturbance inputs ud'
Finally, introduce M(k), AB(k), and AC(k) for the discrepancies between the model
matrices Ä, Band C and the true system matrices A, B, and C:
Ä=A+AA(k)
B=B+AB(k) (2.8)
C=C+AC(k)
Such discrepancies may account for multiplicative plant faults. To obtain a complete de-
scription of the system with all the possible faults and noises taken into account, the true
variables u(k) and y(k) expressed from (2.7) and the true matrices A, B, C from (2.8) are
to be substituted into (2.5) and (2.2). The model becomes,
x(k + 1) = (A + AA(k»x(k) + (B + AB(k»(u(k) + fu(k) + w u(k») + Pfp(k) + Qwp(k)
y(k) = (C + AC(k»x(k) + fy(k) + Wy(k) (2.9)
or,
x(k + 1) = Ax(k) + Bu(k) + AA(k)x(k) + AB(k)u(k)
+ B~ (k) + Bwu(k) + AB(k)fu(k) + AB(k)wu(k) + Pfp(k) + Qwp(k)
Now ifthe various terms are lumped into similar groups one obtains the system,
x(k + 1) = Ax(k) + Bu(k) + ~ (k)t; (k) + D1d1(k)
(2.11 )
y(k) = Cx(k) + F; (k)f2 (k) + D 2d2 (k)
where the meaning of the various terms is clear from comparison of (2.11) and (2.1 0).
The main point to note is that uncertainty (faults and disturbances) is generally split into
two categories:
Analytical redundancy methods 97
• structured uncertainty, acting upon the system as additive faults and disturbances and
represented by unknown time-functions multiplied by known distribution matrices,
and
• unstructured uncertainty, wbich describes multiplicative (parametric) faults and
modeling errors and is represented by unknown matrices multiplying known
(observed) variables.
Further specialization of (2.11) is possible in specific situations, where it is clarified what
is considered a fault to be detected and what is considerd a disturbance to be ignored.
Model-based fault diagnosis can be defined as the detection, isolation and characteriza-
tion of faults in system components from the comparison of available measurements.
These three desired functions are stated in order of increased difficulty. Detection is
performed by all traditional methods, while isolation and characterization (size and pos-
sible time of occurrence of fault) are acbieved using more sopbisticated algorithms. A
fault detection method should usually possess the following characteristics:
• Low detection delay time (td): tbis is usually minimised for a fixed false alarm rate.
• High rate of correct detections (PtJJ.
• Low rate of/alse alarms (PI)'
• Isolability: is the ability to distinguish (isolate) faults and depends on the statistical
tests employed and the structure of the system matrices.
• Sensitivity: characterizes the size of faults that can be isolated under certain condi-
tions. It depends on the size of the respective matrices and noise properties and is
closely related to the detection delay time. Some researchers use tbis term in a dif-
ferent context wbich is related to the robustness requirement but we prefer tbis
definition since it defines a distinct aspect of an FDI algorithm.
• Robustness: is the ability to isolate faults in the presence of modeling errors and/or
unknown disturbances. Tbis is a most serious requirement, since such errors are
practically inevitable and therefore greatly affects the previous properties. Since
modeling errors appear as multiplicative faults, false alarms are triggered if the ro-
bustness issue is not taken into account in the design process. It has by now become
clear that the most essential requirement for a model-based FDI algorithm is robust-
ness to disturbances as weil as to model-system mismatches. Tbis is not at all a
straightforward problem and as will be seen in the following sections, approximate
solutions are found in practice.
Performance measures for the above requirements vary, and will be cited with the spe-
cific methods.
98 Real time fault monitoring of industrial processes
The procedure of evaluation of the redundancy given by the mathematical model of the
system, described by any of the models of Section 2.2 can be roughly divided into the
following two steps:
1. Residual generation.
2. Residual analysis: decision and isolation ofthe faults (time, location, sometimes also
type, size, and source).
The analytical redundancy approach requires that the residual generator performs some
kind ofvalidation ofthe nominal relationships ofthe system, using the actual input 11, and
measured output y. The redundancy relations to be evaluated can simply be interpreted
as input-output relations of the dynamics. If a fault occurs, the reduncancy relations are
no longer satisfied and a residual, ~, occurs. Tbe residual is then used to form appro-
priate decision functions. They are evaluated in the fault decision logic in order to moni-
tor both the time of occurrence and location of the fault.
A more detailed structural diagram of the overall FDI procedure is depicted in fig. 2.1.
Note that for the residual' generation three kinds of models are required: nominal, actual
(observed) and that of the faulty system. In order to achieve a high performance of fault
detection with low false alarm rate, the nominal model should be tracked and updated by
the observation model.
Input Outputs
U y
Basically, there are three different ways of generating fault-accentuated signals using
analytical redundancy: parity checks, ob server schemes and detection filters, all of them
using state estimation techniques. The resulting signals are used to form decision func-
tions as, for example, norms of likelihood functions. The basis for the decision on the
occurrence of a fault is the fault signature, i.e. a signal that is obtained from some kind
of faulty system model defining the effects associated with a fault.
Deterministic methods tor FDI use deterministic state variable methods to generate re-
sidual quantities. The detection, isolation and further diagnosis of faults is achieved using
these residual quantities. Careful design of the residual can facilitate the use of tighter
bounds in the form of threshold levels for detection and isolation.
If the dynarnical system with a number of possible faults can be described by the input-
output relation,
y(s) = GuCs)u(s) + Gfs)f(s)
where y(s), u(s), fis) are the Laplace transformed output, input and fault vectors re-
spectively then each component of the residual vector r(t) generated by means of a de-
terministic model, should satisfy the condition:
r(t) = 0 ifand only if f(t) = 0 (2.12)
where fit) is considered to act upon the dynarnics of the process in an additive manner.
From a practical point of view it is reasonable not to make further assumptions about the
fault vector fit) except that it is an unknown time function.
The general structure for all deterministic residual generators, based upon the concept
above is shown in fig. 2.2. This structure is expressed mathematically in the frequency
domain as:
u(s) Y(s)
r(s)
+ IResiduals
~._-----------_._--- . _------;
Figure 2.2 General structure of a residual generator
100 Real time fault monitoring of industrial processes
The transfer matrices HII(s) and Hy(s) are realizable using stable linear systems. In order
to make the residual r(s) become zero for the fault-free case (i.e. to achieve require-
ments in Eq. (2.12», His) and His) must satisfy the null condition:
Hu(s) + Hy(s)Gu(s) =0
Eq. (2.13) is a generalized representation of all residual generators. The design of the
residual generator results simply in the choice ofthe transfer function matrices HII(s) and
His) which must satisfy the null condition. The various ways of generating residuals
correspond to different parameterizations of His) and His). One can obtain different
residual generators using different forms for HII(s) and His) and using the design
freedom, the desired performance ofthe residual can be achieved.
A fault can be detected by comparing the residual with adecision or threshold fimction
DF<r) according to the test,
Dp(r) ~ T(t) for f(t) =0
{
Dp(r) > T(t) for f(t) 0 *
1fthis test is positive (Le. the decision function is exceeded by the residual), a likely fault
is hypothesized. There may also be a likelihood that the decision function will be
exceeded even if there is no fault. This would lead to a false alarm in detection as a
consequence of modeling errors (in the determination of the residual or in the
determination of the decision function) or unknown (Le. unexpected) disturbances
affecting the residual. The simplest method of deciding whether or not there is a fault is
to use a fixed threshold applied to the residual signal r(t).
There is a rieh variety of methods available for quantitative model-based residual gen-
eration, including:
Observer approaches. The underlying idea is the estimation of system outputs from the
measurements (or a sub set of measurements) by using either full order or reduced order
state ob servers. A suitable weighting of the output estimation error is then defined as a
residual, aecording to the general structure given in Eq. (2.13) and fig. 2.2. Methods for
the selection ofthe ob server gains include:
(a) The Unknown Input Observer (UIO) method (Watanabe and Himmelblau, (1982),
Massoumnia, (1986), Wünnenberg and Frank, (1987)).
(b) Eigenstructure assignment to give disturbanee decoupling (patton and Kangethe,
(1989), Patton and ehen, (1991)).
The state ob server approach has become a popular approach due to the flexibility of de-
sign, the relative ease in achieving robustness in fault detection and fault isolation, the
algorithmic and software simplicity, and speed of response in detecting and isolating
faults. However, it does not provide fault size information.
The parity relations approach is based either on a teehnique of direct redundancy, mak-
ing use of the static algebraie relations between sensor and actuator signals, or alterna-
Analytical redundancy methods 101
tively upon temporal redundancy, when dynamic relations between inputs and outputs
are used (differential or difference equations). The term "parity" was first used in com-
puter systems to enable "parity checks" to be performed for error checking. In the FDI
field, it has similar meaning in the context of providing an indicator for the presence of a
fault (or error) in system components. The key idea is to check the parity (consistency)
of the mathematical equations of the system (analytical redundancy relations) by using
the actual measurements. A fault is declared to have occurred once preassigned error
bounds are surpassed. Tbis method does not provide fault size information either.
In the early developments, parity space methods were applied to parallel redundancy
schemes (potter and Suman, (1977), Desai and Ray, (1981), Chow and Willsky (1984».
For such system configurations, the number ofmeasurements is greater than the number
of variables to be sensed and the residuals can be obtained directly from the redundant
measurements. Inconsistency in the measurement data is then a metric that can be used
initially for detecting faults and, subsequently for fault diagnosis. Mironovskii (1980),
has independently derived similar relations in the Soviet Union.
Stochastic modeling methods for fault diagnosis are based on statistical testing of the
innovations (i.e. the residuals) ofKalman filters or other filters and can be used for both
fault detection and isolation, by means of hypothesis testing.
Whilst using a similar structure to the observer, approaches based on the Kalman filter
comprise a residual generation mechanism derived by means of a stochastic model of the
dynamical system. In normal operation the Kalman filter residual (or innovation) vector
(the difference between the measurements and their Kalman filter estimates), is a zero-
mean wbite noise process with known covariance matrix. Mehra and Peschon, (1971)
proposed the use of different statistical tests on the innovation to detect a fault of the
system. The idea wbich is common to all these approaches is to test, amongst all possible
hypotheses, that the system has a fault or is fault-free. As each fault type has its own
signature, a set of hypotheses can be used and checked for the likelihood that a particu-
lar fault has occurred.
Main methods include:
(i) Chi-squared testing (Mehra and Peschon (1971), Willsky et al., (1974a, 1975);
Watanabe et al., (1979, 1981». Tbis is just an alarm procedure, i.e. it does not
provide fault location or size information.
(ii) Sequential Probability Ratio Testing (SPR1): The purpose of tbis test is to check
the zero-mean property of the innovations. The decision is based on the value of
the likelihood ratio of the p.d.f. of the innovations under the null and altternative
hypotheses. Usually the decision space is divided into three regions: fault, no fault
and repeat. In tbis sense the test is sequential. Cbien and Adams, (1976), Deckert
et al, (1977), Yosbimura et al, (1979), Bonivento and Tonielli, (1984) and Uosaki,
(1985) are the main contributors to tbis approach. Tbis is also an alarm procedure.
102 Real time fault monitoring of industrial processes
(iii) Generalized Likelihood Ratio (GLR) testing. The philosophy ofthis approach is as
follows: a Kalman-Bucy filter is implemented on the assumption of no abrupt syst-
em changes while a secondary system monitors the measurement residuals of the
filter to determine if a change has occurred and adjusts the filter accordingly.
Decision of fault occurrence is based on the value of the generalised likelihood ratio
ofthe no-fault and fault hypotheses (Willsky and Jones, (1976), Ono et al, (1984),
Kumamaru (1984), Pouliezos and Stavrakakis (1987) and Tanaka and Müller
(1990». This method performs all required tasks, i.e. fault detection, isolation and
estimation, thus it is possible to perform automatie system reorganization in the
case of soft failures. These qualities are however offset by its low robustness.
(iv) Multiple Model Adaptive Filters (MMAFs). In this early development a bank of
linear filters based on different hypotheses concerning the underlying system behaviour is
constructed. The innovations of the various filters are monitored and the conditional
prob ability that each system model is correct is computed. The system with the highest
probability is declared to be the correct one (Lainiotis (1971), Athans and Willner
(1973), Willsky et al. (l974b». The same comments as for the GLR method apply here
In this section it is considered the problem of detecting changes in linear, possibly time
varying, stochastic dynarnical systems described by,
x(k + 1) = A(k)x(k) + B(k)u(k) + w(k) (2.14)
y(k) =C(k)x(k) + D(k)u(k) + v(k) (2.15)
where wand v are zero-mean, independent, white Gaussian sequences with covariances
defined by,
where 8kj is the Kronecker delta. Eqs. (2.14)-(2.16) describe the "normal operation" or
"no failure" model ofthe system ofinterest. Ifno failures occur, the optimal state estima-
tor is given by the discrete Kalman filter equations,
where r(k) is the zero-mean, Gaussian innovation process, and the gain K(k) is calcu-
lated from the equations,
Analytical redundancy methods 103
Here p(ilj) is the estimation error covariance of the estimate x(ili), and V(k) is the
covariance of r(k). Eqs. (2.17)-(2.23) are referrd to as the "normal mode filter" in the
sequel.
In addition to the above estimator, one may also have a c10sed loop controllaw, such as
the linear law
u(k) = G(k)x(klk) (2.24)
Since the statistics of the innovations sequence is completely known under normal con-
ditions, a number of tests can be devised to check if the observed statistics are the ex-
pected. The relevant properties are: zero mean, independence and known covariance.
These tests can be used as fault alarms, in situations where tbis is desirable, or as first
level fault detectors in more sopbisticated algorithms. Both approaches will be examined
in the following.
Mehra and Peschon (1971) were among the first to propose several statistical tests for
detection of changes in the system (2.14)-(2.15). Since then various other tests have
been proposed and tested in various real applications. It is obvious that the relevant the-
ory is very extended and cannot be described in detail here. The interested reader is re-
ferred to the excellent book ofBasseville and Nikiforov (1993), in wbich the statistical
change detection theory is presented in detail.
Some of the results presented in tbis section are specialised versions of general algo-
rithms presented in Chapter 1. However, for reasons of clarity, they will be reintroduced
here for the KaIman filter's innovation sequence.
For hypothesis testing purposes, it is more convenient to consider the standardized inno-
vation sequence defined by,
1
,,(k) = (C(k)P(klk -l)C T(k) + R(k)t 2r(k) (2.25)
1
where (.) 2 denotes the square root of the inverse of a matrix. Then,
104 Real time fault monitoring of industrial processes
(2.26)
It is also usual in fault detection applications to use moving windows of data. Tbis re-
sults in lower detection delays, since the estimators do not have infinite memory. For
tbis purpose, define,
71T(k,n w )= [T
71 (k-n w +1)::T
71 (k-n w +2) ::T
71 (k-n w +3) :
: ...:: T ]
71 (k)
(2.27)
to denote a collection of nw residuals ending at time k.
These tests check whether the observed innovation sequence is zero mean or not. The
mean of the innovation sequence is estimated as,
1 N
=-
A
where the subscript is used to signify the dependence on the sampie size N, and iiN de-
notes the true mean. Under the null (no-failure) hypothesis, qN has a Gaussian distribu-
tion with zero mean and covariance,
(2.29)
Therefore at the 5 per cent significance level, the null hypothesis is rejected whenever,
~ I 1.96 I
I 71N > .JN (2.30)
The above test suffers from the fact that the covariance of rt...k) is assumed known. A
better test is the ]2-test wbich use the ]2-statistic,
2 A
T = NiiNCN 7IN
A-l~T
(2.31)
A
where C N is the sampie covariance calculated by (2.35) (folIows). Tbis test is uniformly
most powerful among all the tests for zero mean wbich are invariant with respect to
scaling (or covariance), see Anderson (1958).
If a sliding window is used, recursive calculation of the window mean is also possible,
using,
'(k) = ,,(k - nw ) - ,,(k) (2.32)
Analytical redundancy methods 105
(2.33)
D w
where k denotes the current time and lfw the window length..
The above tests assume residual independence, a condition that may be violated if certain
faults (non-additive) have occurred. In tbis situation, non-parametric tests ofTer greater
robustness. Such a test is the multivariable component sign test (Bickel, 1965). Tbis
test uses a sign statistic for each component ofthe vectors and combines them in a quad-
ratic form. Define,
where,
k
Sj,nw (k) = L sgn l1j(t) (2.34)
t=k-nw +l
and the sgn function is defined as,
z>O
sgn(z) ={ ~ z=O
-1 z<O
Nowform,
S:w (k) =SJw (k)W;l(k)Sn
w w
(k) where
k
Lsgnl1j(t);gnl1t(t) for 1 ~ i ~ m and 1 ~ f. ~ m
The test rejects Ho (detects change) for values of S;... (k) greater than X~.
Division by (N-l) instead of N produces unbiased estimates for smaal sampies. Under the
A
If moving windows are used, relevant recursive expressions for scalar signals are
(pouliezos, 1980a),
A*=log--
l-ß
a
B*=logL (2.37)
l-a
If the LLR exceeds the boundary A * or falls below the boundary B*, the observation is
terminated with acceptance of the hypothesis H 1 (failure mode) or the hypothesis Ho
(normal mode), respectively. Otherwise, the observation is continued and decision is
deferred. It is noted that LLR can be computed recursively by,
_ p(11n IH J) A 2 -1 2
An - An- J + log / An-l + - - 2 - 11n -logA (2.38)
P(11n Ho) 2A
and the test can be performed recursively aS new observations come in.
A failure detection system based on the conventional Wald's SPRT formulation given
above, on the average, minimizes the time to reach adecision for specified error prob-
abilities if the system is either in the failure mode or in the normal mode from beginning
of the test. However, the failure process considered here is characterized by the system
being initially operated in the normal mode and then transition occurs to the failure mode
Analytical redundancy methods 107
at a random instant during observations. The LLR (2.36) will show, on the average, a
negative drift when the system is in the normal mode and a positive drift when it is in the
failure mode, and thus the detection system suffers an extra time delay in compensating
for a negative quantity accumulated under the normal mode before transition to the fail-
ure mode (fig. 2.3). To improve on the performance ofthe conventional SPRT, Uosaki
and Kawagoe, (1988) proposed a modified version called backward SPRT.
Taking into account the change time 0, the two hypotheses can be restated as:
Ho (normal mode): E{ 'ln-i+I} = 0, Var{ 'ln-i+d = 1
H I (normal mode): E{ 'ln-i+l} = 0, Var{ 'ln-i+l} = ..12 > 1, ;=1, 2, ... , n-9t 1 (2.39)
Define a backward LLR (logarithm of likelihood ratio function computed in reverse from
the current observation to the past ones) by,
= ..1
2 -1
2
N
L (2 2..12
Di - - 2 - log..1
J
;k =1,2, ...• D
2..1 i=n-k+1 ..1 -1
When the backward LLR is applied to test the hypotheses (2.39) as in SPRT (called
backward SPR1), no extra time delay will be introduced, since in the backward SPRT,
the variance of the normalized innovation is not unity from the beginning for hypothesis
H I corresponding to the failure mode. Furthermore a negative quantity accumulated un-
der the initial normal mode in the conventional SPRT scarcely appears (fig. 2.3).
..1 2-1 11 (2
=--'" 2..1 2 J-..1-2--1' n"- k(2'1. ---log..1
2..1 2 J
2..1 t 2..1 t
'1. ---log..1
2 ..1 1
1 2 - ..1 -1
2 1 2
If the backward LLR statistic A!k>K for some k=1, 2, ... , n, where K is a
suitable constant, observation is terminated with acceptance of the hypothe-
sis that system is in the failure mode. Otherwise, observation is continued as
system is not likely in the failure mode.
108 Real time fault monitoring of industrial processes
Var{n}
~~--------------~
failure (H,)
~n----------------+"a~-------------k
I
I
I
I
I normal (Ho)
O..::.....---''''""""''----o...,.-o-----,~---'---L--'----'----'---'----'---'--k
n n-4 n-6 'a 4 3 2 1
,I
Figure 2.3 Backward SPRT failure detection system and trajectory ofbackward LLR
The alternative expression of the backward LLR given by (2.40), leads to the decision
role for acceptance of the hypothesis that system is in the failure mode as,
A.n - A.n- k = A.:'k > K for some k = 1, 2, ... , n
or,
A.n - min A.k > K (2.41)
I<k<n
It is easy to show that this decision role is equivalent to the following decision role:
If,
.Li. 2 -1 2
Sn=max [ 0, 2L1 2 Tln -logL1+Sn_1] >K (2.42)
Markov chain as EI=(-oo, 0], E =«i-2)W, (i-I)W]; i=2, ... , N-I, and El'F(K, 00). It
j
should be noted that the states EI and EN are reflecting and absorbing, respectively. Let
P be the transition probability matrix of this Markov chain with components Pij=Pr(Sn_I
EEj and SnEE}; ij=I, ... , N) which are given by,
PlI = Pr{z - h :S O} = F(Dh)
Pli =Pr{(i - 2)W:S z - h < (i -1)W}
p,,: Pr{Z- h"(-i +%)w}: F( D((-i +%}W + b)) i:2, . .,N -I)
Pij = pr{(j -i -~)W:S z -h « j -i +~)W}
A 2 -1 2A2
z =--2-' h = logA, D = ( 2 J 2 (2.44)
2A \ A -1 Li,;
and F(x) is the cumulative distribution function of the X2 distribution with one degree of
freedom, i.e.,
0 0 o 1
=[: ~] (2.45)
and the mean absorption time vector starting from state Ei is given by,
JJ =(I - R)-I,
where'=[1 1 ... Ir
Since the system starts from the normal mode, the first component of JJ gives the MDT
by an N-discrete state Markov chain approximation. The MDT can be obtained by sim-
ple extrapolation, as the number of states of discrete state approximation N, goes to in-
finity. It is assumed that the MDT for a large number of states N, is expressed by,
A
)J(N) =)J(oo) + N
and )J(oo) is determined by a least squares method from )J(N) for several values of N. To
determine the reference value .Li and decision boundary K, it is possible for example, to
determine the reference value as the greatest tolerable innovation variance change, and
then find the decision boundary K referring the value of the MDT of normal mode
(.tt; = 1), which corresponds to the inverse ofthe probability offalse a1arm.
(2.47)
Now, CN(-r) is an asymptotically unbiased and consistent estimate ofCN<-r) (Jenkin and
Watts, 1968). Under the null hypothesis CN(-r), -r=1, 2, ... are asymptotically independ-
ent and normal with zero mean and covariance of I/N. Thus they can be regarded as
sampies trom the same normal distribution and must lie in the band ±1.96/..JN more than
95 per cent ofthe times for the null hypothesis.
Another statistic that can be used for testing independence between the components of
the innovation vector is the sampie correlation coefficient defined as,
N
L(71a (i) - if)(71 P(i) - ffP)
(2.48)
where the superscripts a and P indicate the components of the vector 71. Anderson,
(1958) shows that under the null hypothesis the distribution of pr;! is,
where r [.] denotes the gamma function. This statistic is invariant with respect to the
mean and the covariance of rl...k). It is particularly useful in the present case since the true
mean and covariance of rl...k) are unknown. This statistic can also be used for testing
whiteness by defining p'f! for different lags.
The previously presented simple tests provide fault alarms and are useful when more so-
phisticated algorithms cannot be used, because of speed or other limitations. An alterna-
tive procedure in such situations would be to implement a two-stage method (pouliezos
and Stavrakakis, 1991). In this approach partial fault isolation is achieved by performing
a combination of simple tests, and then use a fault table to limit the possible faults. Such
a fault table is the following:
112 Real time fault monitoring of industrial processes
S = Coo - V - CAE(CA)T
E = (1- KC)AE[(1- KC)A]T + K(Coo - V)K T - KCAl{KCA)T
where V and K denote the state steady values of the residual covariance and filter gain
A
respectively, as given by (2.21), (2.22). C00 denotes an estimate of the residual covari-
ance obtained from a large number of sampies.
Tbis is an algebraic equation in the elements of E , which can be solved directly for its
distinct n(n+ 1)/2 elements. Let,
(1= [ u11 u12 ••• u1n u21 ... u2n ... um]
be the vector of the unknown elements of E. Then using brote force algebra,
(I =(I - T)-lf
ll t 12
tnn t;:
nn
f=[f ll f 12 f 1n f 21 ... f 2n ... fnnl
Analytical redundancy methods 113
n n
t~!
1)
= A)'yA
"
i x -Ai,L...J"
x "M). kAky-A).y"
,LJ Mi"kAk x
k=1 k=l
L = K(Coo - V)K T
M=KC
(suffices denote respective matrix elements). In particular, ifthe system is scalar, its so-
lution is,
s=(8 -v)( 1-
00
(cak) 2
[(1- kc)a]2- 1
J- 1
The MM method was originally developed for problems of system identification and
adaptive control (Athans and Willner (1973), Lainiotis (1971)).
The basic MM method deals with the following problem: the inputs u(k); Ir-O, 1, 2, .. ,
and outputs y(k); Ir-l, 2, ... of a system which is assumed to obey one of a given finite
set of linear, possibly time-varying, stochastic models, indexed by ;=1, ... , N, are ob-
served:
xj(k) = A j (k)x 1 (k) + Bj(k)u(k) + wj(k) + gj(k)
y(k) =Cj(k)xj(k) + vj(k) + bj(k)
where w;(k) and v;(k) are independent, zero-mean Gaussian white noise processes, with,
As usual, the initial state xj(O) is assumed to be Gaussian, independent ofwj and Vj, with
mean xj(OIO) and covariance Pj(OIO). The matrices Aj(k), Bj(k), Cj(k), Qj(k), and Rj(k)
are assumed to be known. Also, bj(k) and gj(k) are given deterministc functions of time
(corresponding to biases, linearizations about different operating points, etc.). In addi-
tion, the state vectors x;(k) may be of different dimensions for different values of i
(corresponding to assuming that the different hypothesized models represent different
orders for the dynamics of the real system). Note that this is a discrete-time formulation
of the MM method. Continuous-time versions can be found in the literature (Greene,
1978), and they differ from their discrete-time counterparts only in a technical and not in
a conceptual or structural manner.
114 Real time fault monitoring of industrial processes
Assuming that one of these N models is correet, we now have a standard multiple hy-
pothesis testing problem. That is, let Bi denote the hypothesis that the real system corre-
sponds to the ith model, and let PiCO) denote the a-priori probability that Bi is true.
Similarly, let Piek) denote the probability that Bi is true based on measurements through
the kth measurement, i.e. given I k = {u(O), ... , u(k-I), y(1), ... , y(k)}. Then Bayes' rule
yields the foUowing reeursive formula for the Piek):
p(Y(k + 1)IBj ,1k,u(k») pj(k)
pj(k + 1) = N (2.49)
LP(Y(k + 1)1Bj ,1k,u(k»)pj(k)
j=l
Thus, the quantities that must be produced at each time are the conditional probability
densities p(Y(k + 1)IHj ,lk,u(k»); i = I, ... , N. However, conditioned on Bi> this prob-
ability density is preeisely the one step prediction density produced by a Kaiman filter
based on the ith model.
That is, let xj(k + llk) be the one-step predicted estimate of xi(k+l) based on I k and
u(k), assuming that Bi is true. Also let xj(k + llk + 1) denote the filtered estimate of
x,{k+l) based on Ik+1={/k> u(k), y(k+l)} and the ith model. Then these quantities are
computed sequentially from the following equations:
xj(k + llk) = Aj(k)xj(klk) + Bj(k)u(k) + gj(k) (2.50)
Here Pi(k+llk) denotes the estimation error covariance in the estimate xj(k + llk)
(assuming Bi to be true), and Pi(k+llk+J) is the covariance of the error
x j (k+I)-xj(k+llk +l), again based on Bi. Also under hypothesis Bi> Ti(k+l) is
zero mean with covariance V;(k+I), and it is normally distributed (since we have as-
sumed that annoises are Gaussian). Furthermore conditioned on Bi' Ik> and u(k), y(k+ 1)
Analytical redundancy methods 115
is Gaussian, has mean Cj(k),ij(k + 1)lk) and covariance V;(k+l). Thus, it is deduced
that,
(2.56)
(recaU m is the dimension ofy).
Equations (2.49)-(2.51) and (2.56) define the MM algorithm. The input to the procedure
are the y(k) and u(k), and the outputs are the p;(k). The implementation of the algorithm
can be viewed as consisting of a bank of N KaIman filters, one based on each of the N
possible models. The outputs of these Kalman filters are the innovations sequences
1;(k+l), wbich effectively measure how weIl each ofthe filters can track and predict the
behavior of the observed data. Specifically, if the ith model is correct, then the one-step
prediction error 1 ;(k) should be a wbite sequence, resulting only from the intrinsic uncer-
tainty in the ith model. However if the ith model is not correct, then 1;(k) will not be
wbite and will include errors due to the fact that the prediction is based on an erroneous
model. Thus the probability calculations (2.49), (2.56) basically provide a quantitative
way in wbich to assess wbich model is most likely to be correct by comparing the per-
formance of predictors based on these models.
Several questions arise in understanding how the MM algorithm should be used. Clearly
a very important question concems the use of MM in problems in wbich the real system
is nonlinear and/or the noises are non-Gaussian. The answer to tbis problem is applica-
tion-dependent. The Gaussian assumption is basically used in one place, i.e. in the
evaluation of p(Y(k + 1)IHj,Ik,u(k») in (2.56). It may turn out that using tbis formula,
even when r;(k+l) is non-Gaussian, causes essentially no performance degradation.
As far as the nonlinearity of the real system is concemced, an obvious approach is to lin-
earize the system about a number of operating points for each possible model and use
these linearized models to design extended KaIman filters wbich would be used in place
of Kalman filters in the MM algorithm. Again the utility of tbis approach depends very
much on the particular application. Essentially the issue is whether the tracking error
from the extended Kalman filter corresponding to the linearized model closest to the
true, nonlinear system, is markedly smaller than the errors from filters based on less cor-
reet models. Tbis is basically a signal-to-noise ratio problem, similar to that seen in the
idealized MM algorithm in wbich everytbing is linear. In that case the noise is measured
by the V,{k+ 1). The larger these are, the harder it will be to distinguish the models. In the
nonlinear case, the inaccuracies of the extended Kalman filters effectively increase the
V;(k+ 1) thus reducing their tracking capabilities and making it more difficu1t to distin-
guish among them. Therefore, the performance ofMM in tbis case will depend upon how
far apart the different models are, as compared to how weIl each of the trackers tracks.
116 Real time fault monitoring of industrial processes
The further apart the models are, the more signal we have; the poorer the tracking per-
formance is, the more difficult it is to distinguish among the hypotheses.
Even if the true system is linear, there is clearly the question of the utility of MM given
the inevitability of discrepancies between the actual system and any of the N hypothe-
sized models. Again tbis is a question of signal-to-noise ratio, but in the linear case a
number of results and approaches have been developed for dealing with tbis problem.
For example, Baram (1976), has developed a precise mathematical procedure for calcu-
lating the distance between different linear modes, and he has shown that the MM proce-
dure will converge to the model closest to the real model (i.e. p;(k)~ 1 for the model
nearest the true system). Tbis can be viewed as a technique for testing the robustness of
MM or as a tool that enables us to decide what models to choose. That is, if the real
system is in some set of models that may be infinite or may in fact represent a continuum
of models (corresponding to the precise values of certain parameters), then Baram's re-
sults can be used to decide upon a finite set of these models that span the original set and
that are far enough apart so that MM can distinguish among them. Willsky (1986) fur-
ther elaborates on implementation issues of the MM algorithm.
The starting point for the GLR method is a model describing normal operation of the ob-
served signals or of the system wbich generated them. Abrupt changes are then modeled
as additive or multiplicative disturbances to tbis model that begin at unknown times.
Additive disturbances typify biases, wbile multiplicative disturbances model parameter
changes. As just discussed for MM, the case of a single such change will be considered,
the assumption being that abrupt changes are sufficiently separated to allow for individ-
ual detection and compensation. The solution to the problem just described and applica-
tions ofthe method can be found amongst others in Willsky and Jones (1976), Pouliezos
and Stavrakakis (1987), Tanaka and Müller (1990).
= a vector whose components are all zero except for the jth one which equals I for k~B,
then this corresponds to the onset ofa bias in thejth component ofy. Finally, the scalar v
denotes the magnitude of the failure (e.g. the size of a bias) which we can model as
known (as in MM and as in what is called simplijied GLR (SGLR» or unknown.
Assurne that a Kaiman filter based on normal operation is designed, i.e. by neglecting/;
and gj. This filter is given by,
x(k + 11k) = A(k)x(klk) + B(k)u(k) (2.58)
where K, P, and V are calculated as in (2.52)-(2.55). Suppose now that a type i change
ofsize voccurs at time B. Then, because ofthe linearity of(2.57)-(2.60),
= xN(k) +aj(k,B)v
x(k) (2.61)
This has the interpretation as the amount of information present in y( B), ... , y(k) about a
type i change occurring at time B.
The on-line GLR calculations consist ofthe calculation of,
118 Real time fault monitoring of industrial processes
k
d(k,e,i) = LPJ (J,e)V-1(J)y(J) (2.68)
j=()
which are essentially correlations ofthe observed residuals with the abrupt change signa-
tures p/j,e) for different hypothesized types i, and times e. If v is known (the SGLR
case), then the likelihood of a type i change having occurred at time e given data y( 1), ... ,
y(k) is,
k
L pi T (JA)V- 1(J)Pi(J,e2 ) (2.72)
j=max(O,,02)
and the relative distinguishability of type i and m changes at times e1 and ~ similarly:
k
L pi T (J,e1)V-1(J)p m(J, e2 )
These quantities provide extremely useful information. For example, in some appliea-
e
tions, the estimation of the time at which the change occurs is critical, and the above
equations provide information about how weil one can resolve the onset time. In failure
detection applications these quantities direct1y provide information about how system
redundancy is used to detect and distinguish failures and can be used in deciding whether
additional redundancy (e.g. more sensors) are needed. Also, they direct1y give the statis-
ties of the likelihood measure (2.69). For the SGLR case of (2.69), f s is Gaussian, and
its mean under no failure is -v 2a(k, e,
i), while if a type m failure occurs at time rp, its
mean is,
In the case offull GLR, under no failure f(k,e,i) is a X 2 random variable with 1 degree
of freedom, while if a failure (m,rp) of size v occurs, f(k,e, i) is a non-central X 2 with
mean,
These quantities can be very useful in evaluating the performance of GLR detection al-
gorithms and for determining decision rules based on the GLR outputs. If one were to
follow the precise GLR philosophy (V an Trees, 1968), the decision rule one would use is
to choose at each time k the largest ofthe f(k,e,i) over all possible change types i and
e.
onset times This largest value would then be compared to a threshold for change de-
tection, and if the threshold is exceeded the corresponding maximizing values of and i e
are taken as the estimates of change type and time. For greater confidence, persistance
tests (Le. f must exceed the threshold over some time period) are often used to cut down
on false alarms due to spurious and unmodeled events. Basseville (1981), further dis-
cusses the threshold selection procedure.
A final issue to be mentioned is the pruning of the tree of possibilities. As in the MM
case, in principle a growing number of calculations have to be performed, as d(k, e,i)
must be calculated for every possible fault case and all possible change times up to the
e
present, i.e. = 1, ... , k. As discussed previously, an appropriate procedure is look only
over a sliding window of possible times:
k-Ml~e~k-M2
120 Real time fault monitoring of industrial processes
whereMl andM2 are chosen based on the a's (eq. 2.69), i.e. on detectability and distin-
guishability considerations. Basically after M2 times steps from the onset of change,
enough information is collected, so that a detection may be made with a reasonable
amount ofaccuracy. Further, after MI time steps, a sufficient amount ofinformation will
be collected so that detection performance is as good as it can be (i.e. there is no point in
waiting any longer). Clearly, MI, M2 must be large to allow for maximum information
collection, but also small enoough for fast response and for computational simplicity.
Tbis is a typical tradeoff that arises in all change detection problems.
GLR has been successfully applied to a wide variety of applications, such as geophysical
signal analysis (Basseville and Benveniste, 1983), detecting arrhythmias in electrocardio-
grams (Gustaffson et al., 1978a and 1978b), freeway incident detection (Willsky et al.,
1980), and manoeuver detection (Dowdle et al., 1983). Recall that the model used in
(2.57) for such changes is an additive model. Thus it appears on the surface that the
types of abrupt changes that can be detected by GLR are a special sub set of those that
can be detected by MM, since in tbis method parametric changes are allowed (in A, B, C,
Q, R) as weil as additive ones. However, a GLR system based on the detection of addi-
tive effects can often also detect parameter failures. For example, a gain change in a sen-
sor does look like a sensor bias, albeit one that is modulated by the value of the variable
being sensed. That is, any detectable change will exhibit a systematic deviation between
what is observed and what is predicted to be observed. Obviously, the ability of GLR to
detect a parametric change when it is looking for additive ones is again a question of ro-
bustness. Ifthe effect ofthe parametric change is "elose enough" to that ofthe additive
one, the system will work. For example, Deckert et al. (1977), describe an additive-fail-
ure-based design that does extremely weil in detecting gain changes in sensors. Note of
course that in tbis mode GLR, is essentially only indicating an alarm, but in detection
problems where the primary interest is in simply identifYing wbich of several types of
changes has occurred, tbis is acceptable.
Direct application of the GLR methodology to parametric changes has also been re-
ported by some researchers. In tbis context, the model is described in any of the follow-
ing modes:
X(k + 1) =[A(k) + AAuk+10]x(k) + B(k)u(k) + w(k)
(a) { ,
y(k) =C(k)x(k) + v(k)
(c) {
X(k + I) = A(k)x(k) + B(k)u(k) + w(k) + 'x (k)l1 k +19
,
y(k) = C(k)x(k) + v(k)
X(k + I) = A(k)x(k) + B(k)u(k) + w(k)
(d) {
=C(k)x(k) + v(k) +,y (k)l1 k+1,9
y(k)
where 'ik), 'y(k) are additional noises of covariances Sx, 5.;, independent of the plant
and sensor noise sequences.
Model (a) is used in cases where a transition matrix change is to be detected, model (b)
is used when a measurement matrix change is to be detected, while models (c) and (d)
are used whenever additional noise in the state or measurements must be monitored. For
these models, appropriate equations for GLR-based fault detection are (pouliezos and
Stavrakakis, 1987):
For model (a):
k
y(k)= YN(k) + LGa (k,i,AA)AAxN (i -I)
i=9
G.(k, 9,AA) =C(k{ '(!9) C;<P(9+ q,9+ q)(AA)q - A(k -l)F,(k -1,9,AA) J
Fa(k,0, AA) = K(k)Gik,O,AA) + A(k -1)Fa(k -I,O,AA); k ~ 0
Gik,0, AA) = Fa(k,O,AA) = 0; k<O;
C; is the binomial coeeficient,4i(i,j) is the state transition matrix and the subscript N
denotes normal operation quantities defined as in (2.61)-(2.64). As seen these equa-
tions have a similar form to Eqs. (2.64), (2.65)-(2.68) but the nice linear structure is lost,
since there is modulation of the residuals by the unknown AA.
Similarly for model (b):
k
y(k)=YN(k)+ LGb(k,i)AHxN(i)
i=9
where,
Gc(k,O) = -C(k)[<P(k,O) - A(k -l)Fe(k -1,0)]
Fe (k,O) =K(k)G e(k,O) + A(k -l)Fe(k -1,0); k ~ 0
Ge(k,O) = Fe (k,O) =0; k<O
Finally for model (d),
k
r(k) = rN (k) + L Gd (k,iK y (i)
i=8
where,
Gd(k,O) = -C(k)A(k -l)Fd(k -1,0); k >0
Fd(k,O)=K(k)Gd(k,O)+ A(k -l)Fd(k -1,0); k ~ 0
Gd(k,O) = I;
Gd(k,O) = Fd(k,O) = 0; k<O
These equations can be used to derive GLR-based detection a1gorithms. However, their
implementation is not at a11 straightforward, because of their complexity. Liu (1977),
considered the case of additional plant noise and proposed some approximate solutions
to tbis problem. Pouliezos et al. (1993), considered the problem of additional sensor
noise and solved it for the time-invariant case. Their approach utilised the steady-state
effect additional sensor noise has on the filter innovations. Tanaka and Müller (1990),
have used a pattern recognition approach to detect parametric changes in the state and
measurements matrices, using a GLR statistic. Their method is robust and relies on the
recognition of the pattern of the curve of the maximum GLR calculated by the conven-
tional step-hypothesised GLR.
The basic idea in the ob server based approach is to reconstruct the states (and the out-
puts) of the system with the aid of observers. Observers are dynamic systems that are
aimed at reconstructing the state x of a state-space model on the basis of the measured
Analytical redundancy methods 123
inputs u and outputs y. The state estimation error is then used as the residual for the
deteetion of the faults.
The particular type of ob server used depends on the particular applieation's needs, i.e.
what kind of failures need to be detected (component, sensor, actuators), what criteria
must be set (robustness, isolability ete.) and on the system strueture (observability, eon-
trollability). To introduee the eoneept, eonsider the linear, time-invariant, eontrollable
and observable system deseribed by equations (2.1), (2.2) in its eontinuous time version:
i(t) = Ax(t) + Bu(t)
(2.73)
y(t) = Cx(t)
It ean easily be seen that tbis mathematieal model with the assumptions made, eontains
all the existing relations among the measured variables Yj (i.e., the "redundaney rela-
tions") in the form of differential equations. To reeonstruet the states or measured vari-
ables from other measured variables and inputs one ean use the linear jull-order estima-
tor,
i(t) = AX(t) + Bu(t) + L(y(t) - Cx(t» (2. 74a)
y(t) = Cx(t) (2. 74b)
where x(t) and y(t) denote the estimated state and output veetor and L the gain ma-
trix, by the choice ofwhich desired dynamies ofthe estimator ean be aebieved.
Let's eonsider a simple, eontinuous-time version of equations (2.11):
i(t) = Ax(t) + Bu(t) + f(t) + w(t) (2.75)
y(t) = Cx(t) + AC(t) + v(t) (2.76)
where fit) denotes faults in the proeess, AC(t) sensor faults, and w(t), v(t) proeess and
sensor noise respeetively.
Substracting (2.74a) from (2.75) and defining the state estimation error,
s(t) = x(t) - x(t)
the equations for the ouput estimation error,
c(t) = y(t) - y(t)
beeome,
i(t) =(A - LC)s(t) + w(t) + f(t) (2.77)
c(t) = Cs(t) + AC + v(t) (2.78)
It ean be seen that if all the disturbanees fit), AC, w(t), v(t) are zero and the matrix A-
LC has leftside eigenvalues, the estimation error e(t) goes to zero after any initial eondi-
tion traeking error has died out. If, however, at least one of these disturbanees oeeurs,
124 Real time fauIt monitoring of industrial processes
e(t) is driven by it. Thus its effect on e(t) can be evaluated to discover the disturbance.
Therefore e(t) can be used as a residual for FDI.
Since specific designs depend on design requirements let us now consider three dinstinct
cases:
• instrument fault detection (IFD)
• component fault detection (CFD)
• actuator fault detection (AFD)
A. Instrumentfault detection
The goal of IFD is to detect and locate faults of the sensors of the process with sufficient
robustness to system parameter changes and noise. There are two extremes of applica-
tion, depending on the purpose of failure detection and the type of failures. One extreme
is the detecting of hard-jailures as, for example, in safety relevant systems. Typically,
the failures are then large and the admissible detection time is small (e.g. seconds or
fractions of seconds). The other extreme is the detecting of soft jai/ures as in the case of
signal validation, such as repeated (periodic) checking of instrumentation in a nuclear
reactor. In this case the failures to be discovered are typically small (a few percent) but a
large detection time is available (days or weeks). Hence, in the first case, one can get
along with the deterministic approach using ob servers since noise (and parameter uncer-
tainties) do not playa large role. In the second case noise cannot be neglected and hence
Kaiman filters and stochastic decision procedures including correlation techniques are
indicated. In a concrete situation one has to decide for either (or, in poorly conditioned
situations, for non) ofthe two approaches.
Over the years a number of different estimator schemes for deterministic IFD have been
proposed in the literature. They differ in the number of estimators and the number of
measured variables.
In the Dedicated Observer Scheme (DOS), (Frank, 1987), it is assumed that the full
state or a sub set of it can be observed from each measured variable of the process (fig.
2.4). Each sensor (instrument) output, Yi, is used to drive a dedicated ob server of full or
reduced order to observe as many of the remaining measured variables as possible. If the
process is completely observable, an m-fold redundancy of the measurement vectors is
achieved. In the non-fault case,
Yik = Yij = Yi; i = 1,2, ... ,n; k,j = 1,2, ... ,m (2.79)
holds for i=l, 2, ... , n and k=1, 2, ... , m, and kt:.q, i.e., the ob server that is driven by the
faulty instrument provides a wrong estimate of all measured variables whereas the esti-
Analytical redundancy methods 125
mates of all the other ob servers match the corresponding Yi except that of the erroneous
instrument. The output estimation error,
(2.81)
represents now the residual vector which, in principle, allows a unique FDI even in the
case of several faulty instruments at the same time.
OBSERVER
1 •
Yq
\
OBSERVER ~ Yq
q ~ LOGICq ~
: J2q
Yqq
If the sensor s that drives the observer fails, all estimates are erroneous so that
Yj - Yj * 0 for all i except i=s. An advantage ofthis scheme is that it applies to proc-
esses with very limited observability and that it is easy to implement. However, the price
to pay is reduced redundancy and a loss of detection reliability.
At the other end, if DOS is generalized in that each observer is driven with more than
one measured variable, say with overlapping subsets of y, the Generalized Observer
126 Real time fault monitoring of industrial processes
Scheme (GaS) is arrived at, see fig. 2.6 (Frank, 1987). It is evident that the decision
logic is ofthe same type as in the case ofDOS or SOS. However, this scheme provides
more degrees of freedom for the ob server design and can be used for the increase of ro-
bustness to parameter variations.
YI
1 SET OF
~
- r ' PROCESS f=:;> lNSTRU- ! Y7
!y,
MENTS Yq
I vI
LOGIC
OBSERVER OR
KALMAN FILTER :-(2
.
\!
PROCESS
1 SET OF
INSTRU-
MENTS
SUBSET Y..I
Y..
.. SUBSE
-
•=
OBSERVER I
Y..q
im
LOGIC
ALARM
1
.
of ···
OBSERVER
q i(q)
rather high, the order of each subsystem and consequently the order of the corresponding
ob servers may be rather low. Furthermore, only local observability is required. The
problem with this approach lies, however, in the interaction between the subsystems. If
theyare low, a malfunction in any ofthe components affects only the estimate ofthe cor-
responding local observer. It is thus possible to identify the faulty component uniquely.
However, if the interactions are large and not measurable, a fault in one component
propagates to observers of other components. The observer scheme would then fail to
identify the faulty component. What it could achieve is to identify a fault in the comp/ex
0/ components that are largely interconnected. This is the basis for the Hierarchica/
Observer Scheme (HOS) introduced by Janssen and Frank (1984).
In the HOS, shown in fig. 2.7, the overall system is divided into two levels of compo-
nents: an upper level including all components with measurable couplings and a lower
level of components with unknown couplings. For each of the resulting configurations in
both levels a· scheme of local observers is designed, which includes the reconstruction of
the measured outputs of the components. The ob server scheme used for the upper level
is called ASCOS (Available-State Coup/ed Observer Scheme) and the one for the lower
level ESCOS (Estimated-8tate Coupled Observer Scheme). The difference between
ASCOS and ESCOS is the way of performing the couplings in the observer part of fig.
2.7.
PROCESS
i=l.
!!J.
I ~l
aMlNENT 1
!! crNONOO. •
I ~
==- COUPlING
==--
!!p '=f CO/'IPOHEHT p
I
I
!p
r
11
OBSERVER 1 ~
~ ..
FAULT
OBSERVER
. DETECTION
P
pI
COUPLING y AHD
OBSERVER p
h: ISOLATION
Figure 2.7 Local observer scherne ofan HOS (after Frank, 1987)
For abrief development of the main ideas of ASCOS and ESCOS, consider a system
represented by equations (2.1), (2.2) in component form:
N
iAt) =Ajjxj(t) + LAijxj + Bjuj (2.82)
j=l
i~j
128 Real time fault monitoring ofindustrial processes
where xi> Yi' "i are the states, outputs and inputs, respectively, of the ith component and
A V' Bi' Ci are matrices of appropriate dimensions. Note that Aif characterizes the cou-
plmg gains with other components.
Assuming that the coupling terms L Aijxj in (2.82) are measurable, the state equations
ofthe corresponding ith ob server are given by,
N
i(t) =(A jj - LjCj)xj(t) + L(Aijxj(t)) + Bjuj(t) + LjYj(t) (2.84)
j=l
iM
Yj(t) = CjXj(t) (2.85)
where xi and "i are measurable and N is the total of components of the upper level. In
the nominal case the state estimation error &j(t) =xj(t) - xj(t) obeys,
(2.86)
where AI; and vi denote parameter variations and component faults of the ith compo-
nent. The corresponding output estimation error Cj (t) =Y j (t) - Yj (t) of the ith com-
ponent then becomes,
(2.87)
where Bj and Jlj denote sensor faults and measurement noise, respectively. In the ideal
case, all coupling terms are measurable, all state and output estimation errors are per-
fectly decoupled from each other, and AIj, Vj' Bj , Jlj only affect the error equations of
the ith component. This implies that a system fault in the kth component only affects eJ!.t)
leaving the remaining errors e,{t); #k, unchanged. This allows a unique failure isolation.
To derive ESCOS, observe that in the second level the coupling terms LAijxj(t) are
not measurable. The idea is to replace them in the observe scheme by their estimates and
estimation errors which both are available from other ob servers. Then the state equation
of the corresponding observer of the ith component becomes,
N
ij(t) = (Ajj - LjCJxj(t) + L[Aijx/t) + LjCj(t)] + Bjuj(t) + LjYj(t)
j=l
j~j
This leads to the following equation for the state estimation error:
N
Sj(t) =(A1j -LjCJ&j(t) + L(Aij -LjCJ&j(t) (2.88)
j=l
j~j
Analytical redundancy methods 129
From (2.88), the output estimation error e;(t) can again be deterrnined in the same way
as above.
The parity space approach originated by Deckert et al. (1977), yields a systematic ex-
ploitation of the analytical redundancy provided by the mathematical model of the sys-
tem.
The basic idea used in this work is to identify the analytical redundancy relations of the
system that were known weil and those that contained substantial uncertainties. An FDI
system (i.e., its residual generation process) is then designed based primarilyon the well-
known relationships (and only secondarily on the less well-known relations) of the sys-
tem behavior. Chow, (1980), and Chow and Willsky, (1984), extracted and extended the
practical idea underlying this application and developed a general approach to the design
ofFDI algoritthms. Lou et al. (1986), and Massoumnia et al. (1988), further developed
these ideas.
In addition to its use in specifying residual generation procedure, this approach is also
useful as it can provide a quantitative measure of the attainable level of robustness in the
early states of a design. This allows the designer to assess overall performance.
The basis for residual generation in analytical redundancy essentially takes two forms:
1. Direct redundancy: the relationship amongst instantaneous outputs of sensors; and,
2. Temporal redundancy: the relationship amongst the histories of sensor outputs and
actuator inputs. Based on these relationspjps, outputs of (dissimilar) sensors (at dif-
ferent times) can be compared. The residuals resulting from these comparisons are
then measures of the discrepancy between the behaviour of observed sensor outputs
and the behaviour that should result under normal conditions.
In order to develop a clear picture of redundancy, consider the system described by
equations (2.1), (2.2) in the following form:
r
x(k + 1) = Ax(k) + "Lbju /k) (2.89)
j=l
where Cj is an n-row vector. Direct redundancy exists among sensors whose outputs are
algebraically related, i.e., the sensor outputs are related in such a way that the variable
one sensor measures can be determined by the instantaneous outputs of the other sen-
sors. For the system (2.90), this corresponds to the situation where a number ofthe cj's
are linearly dependent. In this case, the value of one of the observation scan be written as
a linear combination ofthe other outputs. For example, one rnight have,
130 Real time fault monitoring of industrial processes
m
Yt(k) =L 8 iYi(K) (2.91)
i=2
where the a;'s are constants. This indicates that under nonnal conditions the ideal output
of sensor 1 can be calculated from those of the remaining sensors. In the absence of a
failure in the sensors, the residual YI (k) - L:28iYi(k) should be zero. A deviation from
this behavior provides the indication that one of the sensors has failed. Define,
cj
cjA
c-A k
}
The weU-known Cayley-Hamilton theorem implies that there is an nJ; 191j 91, such that,
k+1 k<n}.
rankC·(k)= { (2.92)
J nj k~nj
The null space of the matrix S{nr I) is known as the unobservable subspace of the jth
sensor. The rows of S{nr I) span a subspace of Rn that is the orthogonal complement of
the unobservable subspace. Such a subspace will be referred to as the observable sub-
space of the jth sensor, and it has dimension nj"
Let m be a row vector of dimension N = L:2 (ni +1) such that m = [mI ... cd"], where
mj;}=I, ... , m is a (nJ+I)-dimensional row vector. Consider a nonzero m satisfying,
Ct(nt )
x(k)=O (2.93)
Cm(nm )
Assuming that the system (2.89)-(2.90) is observable, there are only N-n linearly inde-
pendent m's satisfying (2.93). Let Q be an (N-n)xn matrix with a set of such independ-
ent m's as its rows. (Q is not unique). Assuming all the inputs are zero for the moment,
yields,
Analytical redundancy methods 131
p(k) = {J (2.94)
where,
; j=l, ... , m
The (N-n)-vector p(k) is called the parity vector. In the absence of noise and failures,
p(k)=O. In the noisy, no-fail case, p(k) is a zero-mean random vector. Under noise and
failures, p(k) will become biased. Moreover, different failures will produce different
(biases in the) p(k)'s. Thus, the partity vector may be used as the signature-carrying re-
sidual for FDI.
The matrix (Jmay be generated by making direct use of(2.93). Let,
C1(n1)
T=
Cm(n m )
From (2.93), it is seen that the rows of {J span the orthogonal complement ofthe range
space of T. This suggests that {J can be generated by subtracting the orthogonal projec-
tion onto T from the identity operator. That is, (J can be chosen to consist of the (N-n)
independent rows of 1-T(TfTr 1Tf.
When the actuator inputs are not zero, (2.94) must be modified to take into account this
effect. In this case,
132 Real time fauIt monitoring of industrial processes
where,
o o o
o
o
B=[b} ... br l
u(k) = [u}(k) ... uAk)f
[T
U(k,no) = u (k) T
... u (k + no) ]T
and Binj) is an (nJ+l)xno" matrix.
The quantity p(k) is the generalized parity vector, which is nonzero (or of nonzero mean
if noise is present) only if a failure is present. The (N-n) dimensional space of all such
vectors is called the generalized parity space. Under the no-fail situation (P(k)=O),
(2.95) characterizes all the analytical redundancies for the system (2.89)-(2.90) because
it specifies all the possible relationships amongst the actuator inputs and sensor outputs.
Any linear combination ofthe rows of(2.95),
m
LCiJi[Y/k,n) - B/n)U(k,no)] = 0 (2.96)
i=1
is called a parity equation or a parity relation; any linear combination of the RHS of
(2.95) is called a parity junction.
To provide insight into the nature ofparity relations consider two simple examples.
Example 2.1: A Single sensor. It is always possible to find a nonzero CiJj such that
Analytical redundancy methods 133
(2.97)
or,
where,
a(; t =0, ... , nj -I is ar-dimensional row vector and m(; t = 0, ... , nj -I is the
(/+ 1)st component of m j. Equation (2.98) represents a reduced-order ARMA model for
the jth sensor alone. That is to say, the output of sensor j can be predicted from its past
outputs and past actuator inputs according to (2.98). Based on the ARMA model, sev-
eral methods of residual generation are possible.
Example 2.2: Temporal redundancy between two sensors. A temporal redundancy
exists between sensor i and sensor j if there are,
m i =[moi
i
mnj-I 0]
mj - [m 0j
j
mnJ -I 0]
satisfYing the redundancy relation,
Example 2.3. To illustrate the residual generation procedure consider a simple second-
order system with the following parameters:
nature of failure signatures contained in the parity vector depends heavily on the choice
of D. Clearly D should be chosen so that failure signatures are easily recognizable.
A possible approach wbich uses information about how failures affect the residuals has
been examined by Potter and Suman (1977), and Daley et al. (1979). Tbis method ex-
ploits the following phenomenon. A faulty sensor, say the jth one, contains an error sig-
nall-(k) in its output
Y j(k) = C jx(k) + v(k) (2.104)
The effect oftbis failure on the parity vector defined by (2.102) is,
p(k) :: D j l-(k)
where D j is the jth column of D. That, is, no matter what l-(k) is, the effect of a sensor
j failure on the residual always lies in the direction D j' Thus, a sensor j failure can be
identified by recognizing a residual bias in the D j direction. D j is referred to as the
lai/ure direction in parity space (FDPS) corresponding to sensor j.
It is now clear that D should be chosen to have distinct columns, so that a sensor failure
can be inferred from the presence of a residual bias in its corresponding FDPS. In prin-
ciple, an D with as few as two rows but m distinct columns is sufficient for detecting and
identifying a failure among the m sensors. In practice, however, increasing the row di-
mension of D can help to separate the various FDPS's and increase the distinguishability
of the different failures under noise conditions.
Now, consider the extention of tbis detection method to temporal redundancy relations.
In tbis case, it is generally not possible to find an D to confine the effect of each compo-
nent failure to a fixed direction in parity space. To see tbis, consider the parity relations
(2.101). The parity vector can be written as,
+[: -:"Ju(k
Yl(k)
~k)=[i -:"]
-(au +a22 ) 0 Yl(k -1)
1
a 11a 22
(2.105)
136 Real time fault monitoring of industrial processes
Unless \(k) is a constant, the effect (signature) of a sensor 2 failure is only confined to a
tivo-dimensional subspace of the parity space. In fact, generally when temporal redun-
dancy is used in the parity function method for residual generation, failure signatures are
generally constrained to multidimensional subspace in the parity space. These subspaces
may in general overlap with one another, or some may be contained in others. If no such
subspace is contained in another, identification of the failure is still possible by determin-
ing which subspace the residual bias lies in.
The theory developed in the preceding sections works weIl as long as the adopted system
models represent adequately the monitored physical system and no noise or unexpected
disturbances are present. These requirements are rather stringent however, and are met
in few real systems. As a consequence, following initial efforts in devising PDI algo-
rithms for idealised situations, current research focuses on robust algorithms. There are
a number of different, but somewhat overlapping approaches to this problem, and in the
following sections they will be briefly described.
1\
Y-
Estimated
Measurements
where W is a pxm weighting matrix. Substituting (2.107), (2.1 08) and (2.109) into
(2.11 0) yields,
eCk) =WCe(k) + W~(k) =He(k) + W~(k) (2.111)
where, H=WC (2.112)
From equations (2.109) and (2.111), the complete response of the residual vector in shift
operator form is,
E(Z) = [w - WC(zI - Ac)-l L ]fs(k) + WC(zI - Ac)-IQ~(z) + WC(zI - Ac)-l Ed(z)
(2.113)
One can see that the residual is not zero, even if no faults occur in the system. Indeed, it
can be difficult to distinguish the effects of faults from the effects of disturbances acting
on the system. The effects of disturbances obscure the performance of fault detection and
act as a source of false alarms. Therefore, in order to minimize the false alarm rate, one
should design the residual generator such that the residual itself becomes decoupled with
respect to disturbances. This is essentially the principle of a robust residual generator.
In order that the residual e(k) be independent of uncertainties, it is necessary to null the
entries in the transfer function matrix between residuals and disturbances, Le.
H
H~
= [al (z)Ip a2(z)Ip ... an(z)Ip ] E
H~-I
Analytical redundancy methods 139
(2.116)
1. Compute the weighting matrix W to satisfy equation (2.118): The necessary and suf-
ficient condition for this is rank(CE)<m.
2. Assign the left eigenvectors ofthe ob server as the rows of H (corresponding to suit-
able eigenvalues).
Step (2) of the algorithm can be done by a transformation of the dual control problem.
Obtaining the right eigenvectors of the dual control problem is equivalent to computing
the left eigenvectors of the ob server. The assignment of right eigenvectors for a control-
ler is weil developed (Moore, 1976). The assignability condition is that for each Pi' the
corresponding left eigenvector 1;
of Ac must belong to the row subspace spanned by the
rows ofC(ß;I-A)-I.
If the left eigenvector assignability condition is not satisfied, a similar approach can be
followed, that is to assign the right eigenvectors of the ob server as columns of matrix E.
In this case the corresponding conditions are:
Theorem 2.2 If WCE=O and all columns of E are right eigenvectors of Ac correspond-
ing to anyeigenvalues, equation (2.118) is satisfied.
The assignment of right eigenvectors of the ob server (Jeft eigenvector of dual controller)
is a relatively new problem. Patton and Chen (1991) derived the following conditions:
Theorem 2.3 For a vector ri to ba a right eigenvector of A-KC corresponding to the
eigenvalue Pi either:
• ri is a right eigenvector of A corresponding to Pi and Crj=O or
• ri is not a right eigenvector of A corresponding to Pi and Crj ;f. 0.
If a number of right eigenvectors must be assigned, the gain matrix L must satisfy a set
of equations like,
L C ri = (A - P;I) r i
If all columns ei of E must be assigned as the right eigenvectors of Ac=A-LC corre-
sponding to eigenvalues Pi the following equation must be satisfied:
LCCj = (A - PJ)cj ; i = 1, ... , q (2.119)
l.e. LCE=Ap (2.120)
where,
Therefore the right eigenvector assignment problem is to solve (2.120) and at the same
time ensure that the observer is stable.
Theorem 2.4 The necessary and sufficient condition for solution of equation (2.120) to
exist is:
Analytical redundancy methods 141
(2.125)
where,
(2.126)
This means that the state vector x(k) is separated into the measurable part y(k) and the
unmeasurable x· (k) which has to be estimated by the ob server. From Eqs. (2.122),
(2.123) and (2.125) it is obtained,
Mx· (k + 1) - AoMx· (k) = Bou(k) + Ed(k) - C Ry(k + 1) + AoC Ry(k) (2.127)
Multiplying (2.127) from the left with the regular matrix,
[;]
where,
yields,
NMx· (k + 1) - NAoMx· (k) =NBou(k) - NCRy(k + 1) + NAoCRy(k) (2.129)
On the left hand side of(2.129) there is an expression with the unknown x·(k). All ele-
ments on the right hand side are known or measurable. By substituting,
(2.131)
which is a system of difference equations that has to be solved. Using the shift operator
z, (2.131) can be rewritten as,
(zNM - NAoM)x· (k) =u* (z) (2.132)
H(z) = diag{0Jlo,Eo;L&, (z), ... ,L&s (z);zIp1 - J:;zIps - 1";;zJo - I,;L!, (z), ... ,L!p (z)}
(2.134)
Analytical redundancy methods 143
The Ei; ;=0, ... , S, are the column indices and the P.i; ;=0, ... , p, are the row indices. The
expression O'L. s corresponds to zero rows or columns. The matrix L s . is of dimension
ru' 0 I
o 0 z -I
and the corresponding matrix for the row indices is,
zOO
-1 Z o
L TPi = 0 o (2.136)
o 0 -1
The matrix J: is a ßrdimensional matrix that has only unstable eigenvalues. J; repre-
sents a ps-dimensional Jordan matrix with stable eigenvalues only. J o is a Jordan matrix
with all eigenvalues identical to zero. Consider now that part x~(k) ofthe state vector
that corresponds to the J o block; this is determined by the difference equations.
-x~2(k + 1) + x~l(k) =u~l(k)
-x~3(k + 1) + x~2(k) =u~2(k)
(2.137)
-x~,(k + 1) + x~'_l(k) = u~_l(k)
x~(k) = u~,(k)
which are directly derived with the aid of Eqs. (2.133) and (2.134). It is easily seen, that
all components X~i are completely determined by the known signals U~j; X~j is then cal-
culated with a maximum delay of ,time shifts where 'is the dimension of the Jo-matrix.
Next, define,
s
E=Eo+LEj+S (2.138)
i=l
(2.139)
and partition the appropriately chosen matrix M from Eq. (2.125) into the matrices
144 Real time fault monitoring of industrial processes
(2.140)
where Ms containts the first E columns, M: the following ßJ. columns, M; the next Ps,
M oo the next , Md M p the last )J columns of M.
Therefore, a linear combination of the state variables,
z(k) =Tx(k) (2.141)
can be reconstructed,
i) Without delay but with free choice in the eigenvalues of the estimations error dy-
namic matrix if,
(2.142)
ii) Without delay and without free choice of all eigenvalues of the estimation error dy-
namic matrix if,
(2.143)
iü) With delay of a finite number of sampies and with free choise of the dynamics of the
estimation error dynamics matrix if,
(2.144)
The above constitute conditions for the existence, structure and eigenvalues of the re-
sulting ob servers as weil as the basis for all possible matrices R. Numerically stable al-
gorithms for the computation of an upper triangular form that contains all information of
the Kronecker canonical form are available in Konik and EngelI, (1986).
Next, assume that the ob server used is expresse4 by,
z(k + 1) = Rz(k) + Sy(k) + Ju(k)
with the residual,
r(k) = ~z(k) + ~y(k)
This observer must fulfill the following robustness requirements:
i) lim r
k--+oo
=0 for all u and d and for all initial conditions Xo and Zo.
ii) A matrix T must exist, such that Txo=Zo implies TXk=Zk> for all k.
These conditions lead to the well-known observer equations:
Analytical redundancy methods 145
TAo-RT=SC
TE=O
J=TB (2.145)
[1,. L,m=O
Now, if (2.122), (2.123) are enriched to include failure modes, the system is given by,
x(k + 1) = Aox(k) + Bou(k) + Ed(k) + KE(k)
(2.146)
y(k) = Cx(k) + Fd(k) + GE(k)
For these model, the set of equations that must be fulfilled in order to acbieve distur-
bance and fauIt decoupling are Eqs. (2.145) and,
SF=O
SG:tO
TK:tO (2.147)
~F=O
~G=O
These equations can be solved with the Kronecker canonical form, outlined previously.
Conditions for existence of solutions can be found in Patton et al. (1 989b).
large scale system. In this case, the disturbance distribution matrix E can be directly
computed as:
E = AA[
.
~
.
AB ~ A co ~ Go ~ Q
. ' ] -1l x q
t:: (2.149)
Now, consider the situation where the system matrices are functions of a parameter
vector aER&:
x(t) = A(a)x(t) + B(a)u(t) (2.150)
If the parameter can be perturbed around a nominal condition Il=tzo, (2.150) can be ex-
panded as:
dA ~ dB ~ : dA : dB]
E =[ dal : da l : •.. : da g : da g
(2.152)
(2.153)
Here 11 . II~ denotes the Frobenius norm, defined as the root of the sum of squares of
the entries of the associated matrix. This optimization problem can be solved by the
Singular Value Decomposition (SVD) of E:
(2.155)
where S and T are othogonal matrices, 0i $;02$; '" $;on are the singular values of E. As
shown in Lou et al. (1986), the matrix E* that minimizes (2.154) is given by:
of operating points. The success of the single FDI design depends on its robustness
properties. In order to rnake the disturbance decoupling hold for all operating points, one
mustmake:
HE; =0, for i = 1,2, ... ,M (2.157)
or,
(2.158)
Ifrank(p)9l-1, Eq. (2.158) has solutions and exact decoupling at alt operating points is
acbievable. Otherwise, approximate decoupling must be used. Tbis is equivalent to the
solution ofEq. (2.144) and can be solved by defining an optimization problem:
Now, consider the case when the full-order system model is not available. A possible
approach would be to obtain the nominal model {A o, B o, Co, D o} via identification, with
the estimation error {AA, AB, AC, AD}. Normally, AA and AB are unknown but
bounded:
AI~AA~A2 (2.160)
BI ~AB~B2 (2.161)
where A 1> A 2, BI and B 2 are known and AA~2 denotes that each element of AA is not
targer than the corresponding element of A 2 . Tbis typifies an unstructured but bounded
uncertainty. Consider AA and AB in a finite set of possibilities, say {AA;, AB;}; i = 1, 2
, ... , M witbin the interval A I ~A ~ A 2 and BI ~ AB ~ B 2. Tbis might involve choosing
representative points, reflecting desired weighting on the likelihood or importance of
particular sets of parameters. In tbis situation, a set of unknown input distribution matri-
ces is obtained:
(2.162)
In order to make the disturbance decoupling valid for a wide range of model parameter
variations, an optimal matrix E* should be made to be near all Ei; i = 1, 2 , ... , M as
closely as possible. The optimization problem is thus defined as:
In most cases, not enough knowledge about the state space model of the system is avail-
able. What is usually at hand, is the linearized low order model matrices (A, B, C, D).
In order to account for modeling errors, the system is assumed to be in the form:
i(t) = Ax(t) + Bu(t) + d l (t) (2.164)
where dI(t) represents modeling errors. If dI(t) can be obtained, it may be decomposed
into Ed(t) with E a structured matrix so as to apply the disturbance decoupling concept.
Firstly, assurne that dl (t) is slowly time-varying, so that the system model can be re-
written in augmented form as:
(2.166)
(2.167)
Using the true system input and output data, an ob server based on eqs. (2.l66) and
(2.167) can be used to estimate dI(t). Then it is possible to obtain so me information
ab out the distribution matrix E. Further details of this method can be found in Patton and
ehen (1992). Patton et al. (1992) further explored tbis idea and proposed the deconvo-
lution method to estimate the vector dl (t).
As far as the UIO method is concerned, appropriate optimal approximate solutions have
been proposed by Wünnenberg and Frank (1987). In this approach, the "best" residual is
found by solving the following minimisation problem:
11 w TVOH 2 11
P=min (2.168)
w IIw T VoH 311
where,
F 0 0
CE F 0 0
H2 = CAE CE F 0
CAs-tE CA s- 2 E CA s- 3E F
G 0 0
CK G 0 0
H 3 = CAK CK G 0
CAs-1K CA s- 2 K CA s- 3K G
and Vois used to ensure that,
Analytical redundancy methods 149
C
CA
W
T(VOH 2H 2TVoT- PVOH 3H 3TVoT) =0
This is a generalised eigenvalue-eigenvector problem. The minimal eigenvalue is the
optimal performance index, while the corresponding eigenvector is the optimal residual
generator v. With this the optimal residual sequence is calculated by,
y(k-s)
y(k - s+ 1)
1 u(k- s)
u(k - s+ 1)
r(k)=v T -Hj
y(k) J u(k)
where,
o o o
CB o o o
H1 = CAB CB o o
These conditions have been obtained in the time domain. Ding and Frank (1989), have
produced corresponding results in the frequency domain.
In this section an approach for obtaining robust parity relations in the face of noise and
parameter uncertainty will be given. The exposition follows Chow and Willsky (1984).
The starting point is a model that has the same form as (2.89) but inc1udes noise distur-
bance and parameter uncertainty:
q
x(k + 1) =A(r)x(k) + Lb/r)Uj(k) + w(k) (2.169)
j=l
(2.170)
150 Real time fault monitoring of industrial processes
r
where is the vector of uncertain parameters taking values in a specified sub set r of
RM. This form allows the modeling of elements in the system matrices as uncertain
quantities that may be functions of a common quantity. The vectors w and v are inde-
pendent, zero-mean, white Gaussian noise vectors with constant covariance matrices Q(
~O) and R(>O), respectively.
The component of v(k) and U(k), and the rows of C(r), tP(r), the B are determined
from (2.170) and the structure of Y(k). If, specifically, the ith component of Y(k) is
yik-(J), then the ith component ofv(k) is,
vj(k)=vs(k-(J).
v
The vectors wand are independent zero-mean Gaussian random sequenses with con-
stant covariances Q and R respectively. The matrix Qis block diagonal with Q on the
diagonal; Rj,j = Rs,Al,t, and the ith element of Y(k) is ys(k-(J), while the jth element is
YI.k-r). The ith row ofC(r), i.e., C(;, r) is,
C(i,r) = csAP-u
The ith row, tP(i, r) of tP(r)(which has pN columns) is,
Analytical redundancy methods 151
.....(.1,1) -- [AP-a-I
'!P c. , c. AP-a-2 , ... , c., 0, ... , 0]
Note thatx(k-p) is a random vector that is uncorrelated with wand V, and
E{(x(k - p)} = xo(k - p)
cov{x(k - p)} = E(r)
Since tbis has a trivial solution (a=0, P=0), tbis optimization problem has to be modified
in' order to give a meaningful solution. Recall that a parity equation primarily relates the
sensor outputs, i.e., a parity equation always includes output terms but not necessarily
input terms. Therefore, a must be nonzero. Without loss of generality, a can be re-
stricted to have unit magnitude. The actuator input terms in a parity relation may be re-
garded as serving to make the parity function zero so that Pis nominally free. In fact, P
has only a single degree offreedom. Any Pcan be written as p = AlfT(k) + ZT, where z is
a (column) vector orthogonal to U(k). The component ZT in Pwill not produce any ef-
fect onp(k). This implies for each U(k), only Pofthe form P=AlfT(k) has to be consid-
ered, leading to the following problem:
where,
S=[SI1 S12]
S21 S22
The normalized parity e"or e *, the normalized parity eoefficients, and the normalized
parity junetion p *(k) are defined as folIows:
c* =c* /(J
ii* =a *(J
P* =P*(J
p* (k) =ii*Y(k)- P *U(k)
where,
(J2 =[a*, p*][a*, p*]T =1+ P * P *T
The parity functions with the smallest normalized parity errors are preferred as they are
closer to being true parity functions under noise and model uncertainty, i.e., they are
least sensitive to these adverse effects.
An additive consideration required for choosing parity functions for residual generation
is that the chosen parity functions should provide the largest failure signatures in the re-
siduals relative to the inherent parity errors resulting from noise and parameter uncer-
tainty. A useful index for comparing parity functions for this purpose is the signature-
to-parity e"or ratio 7r, which is the ratio between the magnitudes of the failure signature
and the parity error. Using g to denote the effect of a failure on the parity function, 7r
can be defined as,
For the detection and identification of a particular failure, the parity function that pro-
duces the largest 7r should be used for residual generation.
2.8 Applications
The correct operation of a gas turbine is very critical for an aircraft and, if faults occur
the consequences can be extremely serious. There is therefore a great need for simple
and yet highly reliable methods for detecting and isolating faults in the jet engine. Patton
et al. (1992), presented an example for the detection ofjet engine sensor faults using the
procedure described in Section 2.7.1. The jet engine model used, illustrated in fig. 2.9,
has the measurement variables: Nb NH, T7, P6, T29 (N denotes compressor shaft speeds,
the P variables denote pressures, whilst T represents temperature). The control inputs
are: the main engine fuel flow rate and the exhaust nozzle area.
154 Real time fault monitoring of industrial processes
A thermodynamic simulation model of a jet engine is utilised as a test rig to assess the
robustness ofthe FDI scheme. This model has 17 state variables; these include pressures,
air and gas mass flow rates, shaft speeds, absolute temperatures and static pressure. The
linearized 17th order model is used here to simulate the jet engine system. The nominal
operating point is set at 70% ofthe demanded high spool speed (NH ). For practical rea-
sons and convenience of design, a 5th order model is used to approximate the 17th order
model. The model reduction and other errors are represented by the disturbance term
Ed(t) ofEq. (2.106a). The 5th order model matrices are:
-78 294 -22 21 -29
7 -28 2 -2 3
A= -1325 5326 -526 221 -477
1081 -4445 377 -463 403
2152 -8639 781 -575 782
-0.0072 0.0030
0.0035 0.0003
B= l.2185 -0.0329 , C = I sxs , D =0SX2
1.3225 0.0201
-0.0823 0.0244
As shown in Section 2.7.1, a necessary step for the robust residual generation design
procedure is to find a matrix H to satisfy Eq. (2.118) (i.e. HE=O). The matrix E models
structured uncertainty arising from the application of the 5th order ob server to the 17th
order plant, and is given by:
E=[E1 :E2 :E3 :E4 ] (xl0 3 )
where numerical values for the E;'s are defined in Patton et al. (1992). From these val-
ues, rank: (E)=5=n, and hence Eq. (2.118) has no solution. The singular values of E are
Analytical redundancy methods 155
{1.5,60, 198, 11268}, and the matrices Sand T are omitted for brevity. The optimallow
rank: approximation of the distribution matrix Eis,
E* = S [ diag ( 90, 5, 60, 198,11268) 0Sx14 ]T.
Based on tbis matrix, an observer-based robust residual generator can be designed. The
observer design is simplified by choosing all eigenvalues at -100. In this case, the gain
matrix K=-(lOOISxs+A) as Cis an identity matrix.
In fig. 2.10, the output estimation error norm is shown. Tbis is very large, and cannot be
used to detect the fault reliably. Tbis represents the non-robust design situation. fig.
2.11 shows the fault-free residual. Compared with the output estimation error, the resid-
ual is very small, i.e., disturbance decoupling is acbieved. Tbis robust design can be used
to detect incipient faults. In order to evaluate the power ofthe robust FOI design, a small
fault is added to the exhaust gas temperature (T7); tbis simulates the effect of an incipi-
ent jault, the effect of wbich is too small to be noticed in the measurements. fig. 2.12
shows the faulty output of the temperature measurement (T7) and the corresponding re-
sidual. The fault is very small compared with the output, and consequently, is not detect-
able in the measurement. It can be seen that the residual has a very significant increase
when a fault has occurred in the system. A threshold can easily be placed on the residual
signal to declare the occurrence of faults. A fault signal is now added to the pressure
measurement signal for P6. The result is shown in fig. 2.13 wbich also demonstrates the
efficiency ofthe robust residual in the role ofrobust FOI.
rnJ( 0.8
400 .
0.6
0.4
200
0.2 \
0 0
0 10 0 10 20 30
20 time (s) 30 time (s)
Figure 2.10 Norm ofthe output estimation Figure 2.11 Absolute value ofthe fault-free
error residual
156 Real time fault monitoring of industrial processes
r \
150 60
100 40
SO 20
0 0
0 10 20 time (s) 30 0 10 20 time (s) 30
residual residual
0.08 0.06
0.06
0.04
0.02
0 0.01
0 10 20 time (5) 30 0 10 20 time (s) 30
Figure 2. J2 Faulty output and rsidual in the Figure 2.13 Faulty output ofthe pressure
case of a fault in T7. measurement P(, and corresponding residual.
Among other things, the term Jactory oJ the future contains driverless transportation
systems within the framework of computer-aided logisties. The transportation vehicles
are mostly inductively guided along a defined path. The prineiple of electronie traek-
guidanee is applied in airports, container terminals at railway stations or harbours, within
the serviee tunnel transportation system of the euro-tunnel and within modern public
short-haul trafIic systems. An example of this class of track-bounded transportation sys-
tems with automatie track-guidance, is the standard city bus 0-305 ofMercedes-Beoz.
The supervision of measuring instruments in such vehicles is of utmost importance, sinee
they are self-driven. Thus, automatie sensor failure detection techniques have a potential
field ofapplication in this area. In a recent paper, van Schrick (1993) presents such an
example. The main points ofthis work follow.
Analytical redundancy methods 157
The bus follows a nominal track marked by an electro-magnetic field of a cable that is
narrowly running under the road surface. The altemating current flowing through the
cable generates an electromagnetic field, that induces a voltage into the measuring in-
strument located concentrically in front of the bus. The voltage induced is a measure for
the deviation oftrack, d(I), used as the only controller input. The digital controller calcu-
lates a steering signal, that with the aid of an active steering system, directly acts onto the
front wheels to minimize the distance between the bus and the nominal track.
Additionally, a second measuring instrument was introduced to enhance the riding com-
fort of the bus by disturbance rejection control, (van Schrick, 1991). This instrument
gives information on the directly measured steering angle, /J...I), of the front wheels. Both
measuring instruments, the one for the deviation of track and the one for the steering
angle have to be supervised. This is due to the very high requirements of safety on such
transportation systems.
A linearised model of fifth order in sensor coordinates for the lateral motion of the city
bus is given as folIows:
i(t) = A(p)x(t) + bu(t) + E(p)d(t)
y(t) = c T x(t)
where the states Xl (I) to Xs(t) are the displacement d(l) between sensor and nominal
track, its velocity d(t), the yaw angle rate & (I), the side slip angle a (I) and the steering
angle /J...I). The control input U(I) is the steering angle rate ß(t) and the controlled vari-
abley(I)=xl(l) is the measured deviation oftrack d(I). Additionally, the disturbance vec-
tor d(t) consisting ofbending K(I), side wind momentM(I) and side wind force F(t) acts
on the system. For the investigations described in the following, only the bending K is
regarded. If necessary, the effects of the disturbances M(I) and F(I) can be treated in the
same maner. The parameter vector p =[m v] contains the relative mass m correspond-
ing to the ftiction coefficient Jl ranging from 9950 kg to 27000 kg and the velocity v
ranging from 0.6ms- 1 to 14 ms- 1. The input vector b and the output vector cT are
b =es and c T =e7 The system matrix A(p) and the disturbance input vector g K (p)
ofthe disturbance input matrix E(p) are:
0 I 0 0 0 0
0 0 a 23 a 24 a 25 g21
A(p)= 0 0 a 33 a 34 a35 , gK(P) = 0
0 0 a43 a 44 a 45 0
0 0 0 0 0 0
where the elements aij and g21 depend on the parameter vector p (cf. Darenberg (1987),
for their evaluation).
The 5th order dynamic controller used for track control can be written as,
158 Real time fault monitoring of industrial processes
w T=[Wd wpl
is a weighting vector and,
the residual vector. In spite of parameter variations Ap and the unknown input K(t), the
fact that the decision function should be minimal in the fault-free case and maximal in the
Analytical redundancy methods 159
case of sensor faults simplifies the observer design. It is not required to minimise the
estimation errors but to minimise the decision function. For tbis reason, the optimization
ot the weighting vector wT is included into the design procedure.
f j>
d
-Uv.,... u Control Systems
~.r~ City Bus 0305
uR ~
Track Controller
. . . r.- 5th Order Dynamic ~ Control
Controller
k'" I
v "-
u
Generation
Residual Generator ~
""'- 5th Order Observer
Ir
Residual Evaluator
Threshold Logic Detection
aJ a.l
Figure 2.14 Overall structure ofthe IFO-seheme
1
t
.0001
\A'itllOUI ORt
t
1 .0002
wilhoUI
10•
.00005 .0001
0
5 0
10 5 10
I
t
.006
.004
w;lhDRC
t
1.0002 10•
with DRC
P,
.002 .0001
0 0
5 10 5 10
I[S ) _
Figure 2.15 j(t), no-fault case. Figure 2.16 j(t), 5% d-sensor fault.
sor failure detection and isolation logic; (2) soft sensor failure detection and isolation
logic; (3) an accommodation filter; and (4) the interface switch matrix.
---------------------------------j
Fl00 engine system ,
T
L________ ,:
Engine
protection
I
I 1 Hard detectionl
isolation lagic I
I _____ ~ :
I Zm ,
- - - - Signal P'lth I
~ - Reconllguratlon L _______________________ J
Information
In the normal or unfailed mode of operation, the accommodation filter uses the full set of
engine measurements to generate a set ofoptimal estimates, jet), ofthe measurements.
These estimates are used by the controllaw. When a sensor failure occurs, the detection
logic determines that a failure has occurred. The isolation logic then determines which
sensor is faulty. This structural information is passed to the accommodation filter. The
accommodation filter then removes the faulty measurement from further consideration.
The accommodation filter, however, continues to generate the full set of optimal esti-
mates of the control. Thus the control mode does not have to be restructured for any
sensor failure. The ADIA algorithm inputs as shown in fig. 2.17, are the sensed engine
output variables Ym(t), the sensed engine environmental variables em(t), and the sensed
engine input variables um(t). The outputs of the algorithrn, the estimates jet) of the
measured engine outputs Ym(t), are used as input to the proportional part of the control.
During normal mode operation, engine measurements are used in the integral control to
ensure accurate steady-state operation. When a sensor failure is accommodated, the
Analytical redundancy methods 163
likelihood based upon a weighted sum of squared residuals (WSSR). Assuming Gaussian
sensor noise, each sampie of r i has a certain Iikelihood of probability Li
Lj= pj(rj) =kexp(-WSSR j »
WSSR j = r JIrj
I =diag(u7)
where k is a constant and the 0i are the adjusted standard deviations. These standard de-
viation values scale the residuals to dimensionless quantities that can be summed to form
a WSSR. The WSSR statistic is smoothed to remove gross noise effects by a first order
lag with a time constant of 0.1 s. The log of the ratio of each hypothesis likelihood to the
normal mode likelihood is calculated. If the maximum log-likelihood ratio exceeds the
soft failure detection and isolation threshold, then a failure is detected and isolated and
accommodation occurs. If a sensor failure has occurred in NI for example, all of the hy-
pothesis filters except BI will be corrupted by the fauIty information. Thus each of the
corresponding likelihoods will be small except for LRI. Thus, LRI will be the maximum
and it will be compared to the threshold to detect the failure.
LRI .....----,
}-1'--'I'-1'--'I'-<~
+
}--'1'--'I'~~ Maximum No lallure
isolalOO
LRI
Fallure
1501alOO
Initially, the soft failure detectionlisolation threshold was determined by standard statisti-
cal analysis of the residuals to set the confidence level of false alarms and missed detec-
tions. The threshold was then modified to account for modeling error. It was soon ap-
parent from initial evaluation studies that transient modeling error was dominant in de-
termining the fixed threshold level. It was also clear that threshold was too large for de-
Analytical redundancy methods 165
sirable steady state operation. Thus, an adaptive threshold was incorporated, to make the
algorithm more robust to transient modeling error while maintaining steady-state per-
formance. The adaptive threshold d j was heuristically determined and consisted of two
parts. One part, diss is the steady-state detectionlisolation threshold which accounts für
steady-state, or low frequency modeling error. The second part, dEXP, accounts for the
transient, or high frequency modeling error. The adaptive threshold is triggered by an
internal control system variable mtran which is indicative oftransient operation:
d j =djss(dEXP +1)
.(dEXP)+dEXP =mtran
The values of d jss, ., and M tran were found by experimentation to minimize false alarms
during transients. When the engine experiences a transient, M tran is set to 4.5; otherwise
is O. The threshold time constant is • =2s. The adaptive threshold expansion logic en-
abled diss to be reduced to 40% of its original value, which results in an 80% reduction in
the detectionlisolation threshold dJ diss . The adaptive threshold logic is iIlustrated in fig.
2.19 for a power lever angle (PLA) pulse transient.
o
0 I
n,,-'---TranSlenl indicalor
(MTRAN)
I ! ,
2
~Threshold
I
I
o 5 10 15 20 25
TIme. sec
For failure accommodation, two separate steps were taken. First, all seven of the filters
(the accommodation filter and the six hypothesis filters) were reconfigured (the appro-
priate residual in each filter is forced to zero) to account for the detected failure mode.
Second, if a soft failure was detected, the states and estimates of all seven filters were
updated to the values of the hypothesis filter which corresponds to the failed sensor.
Stability of each of the filters after reconfiguration was verified during algorithm evalu-
ation.
The algorithm was implemented in real-time to demonstrate its capabilities with a full
scale engine using hardware and software typical of that to be used in next generation
turbofan engine controls. The Multivariable Control (MVC)-ADIA implementation has
several distinct hardware and software features. Three CPUs were used, operating in
parallel. The schedules used to generate the engine model basepoints and table look-up
routines within the ADIA algorithm were coded in assembly language.
The real-time microcomputer implementation of the combined MVC-ADIA algorithm
performed extremely weil. Sensor failure detection and accommodation were demon-
strated at eleven different operating points which included subsonic and supersonic
conditions and medium and high power operation of the engine.
Interest in on-line diagnosis of internal combustion engines has recently increased due to
new environmental regulations in the United States and in some European countries (e.g.
the EFTA partners). These regulations will, for example, require the following compo-
nents/faults to be diagnosed:
• individual fuel injectors
• 02 sensor(s)
• mass air flow sensor
• manifold pressure sensor
• throttle position sensor
• NP controller
• inlet air temperature sensor
• misfire detection
• vacuum leak in intake manifold
• loss of power in individual cylinder
In order for a diagnostic strategy to be useful for such applications, it should permit real
time implementation in order to allow for continuous monitoring of system performance
on-board the vehicle. Within this context, many of the currently employed service diag-
nostic strategies, for example those based on expert system methods, are not suitable.
Analytical redundancy methods 167
Rizzoni et a/. (1993) discuss simulation and experimental results of a study aimed at di-
agnosing faults associated with an automotive engine exhaust emissions control system,
using the parity vector approach.
The engine model used was based on the works ofDobner (1982), Coats and Fruechte
(1982), Moskwa and Hedrick (1987), Cho and Hedrick (1989) and Grizzle et a/. (1990).
The structure ofthe model is shown in fig. 2.20. The essential elements ofthe model are:
intake manifold, rotating dynamies, volumetrie efficiency, oxygen (lambda) sensor model,
combustion and exhaust dynamies and fuelling controller and metering system.
Details ofthe various subsystems can be found in Rizzoni et a/. (1993). The whole sys-
tem depicted in fig. 2.20, after suitable identification procedure, is put in the form:
x(k + 1) = Ax(k) + Bu(k) + Ed(k)
y(k) =Cx(k) + Du(k) + Ay(k)
where d(k) is a vector of plant or input faults, or unmeasured disturbance inputs, and
Ay(k) is a sensor fauIt vector. In tbis ease, residuals can simply be defined as folIows:
The model of fig. 2.20 was used to generate a set of residual generators, shown in fig.
2.21. Apart from the intake manifold model, wbich is modeled by a nonlinear static rela-
tion, each of the residual generators was obtained by employing system identification
methods to identify the relevant dynamics. Thus, the range of operation of the residual
generation strategy for these four subsystems is limited to the neighborhood of an op-
erating point.
intake manifold
nonlinear model
fueling
modell
throttle-to-fuel
dynamics
Another limitation of tbis design is that, because of lack of suitable experimental ar-
rangements, the fueling dynamics of the engine have been simulated off-line; the engine
simulator reflects each of the subsections except for the dependence of the fueling time,
"fi on the estimated air mass. Thus, the fueling model may be thought of as a perturba-
tion model.
Note that perturbations in the load torque, ATL , appear as an input (or output) in two of
the residual generators. Tbis quantity is shown in parentheses to indicate that tbis distur-
bance input is not measured; part ofthe residual generation strategy consists of designing
a decoupling matrix W(z) that can remove the effect of the load torque from the residual
vector r(k). Details oftbis strategy can be found in Gertler (1991).
Analytical redundancy methods 169
The residual generation strategy was tested using a mix of experiment and simulation. As
an example one experiment is discussed, which included a throttle transient correspond-
ing to a change in trottle position of7.5° causing a change in engine speed from 3,000 to
2,700 rev/min, as shown in fig. 2.22. The sensor outputs were sampled in the crank angle
domain, at a rate corresponding to one sampIe every 1800 of crankshaft rotation. Various
faults were injected, in correspondence of the 1800th sampIe. The faults consisted of
10% calibration errors in each of the sensors (throttle position, speed, manifold pres-
sure), and in one ofthe fuel injectors. Figs 2.23 and 2.24 depict the residual vector cre-
ated by the residual generators of fig. 2.22 for the no-fault case, and for a typical fault
condition of 10% change in throttle calibration. As it was onserved, the transient caused
some difficuIties in discriminating among faults. However, once the effects of the tran-
sient had subsided, the residuaIs behaved according to the structure iIIustrated in Table
2.1.
.(J03
enmksbaft8DjJIe
Figure 2.22 Experimental conditions for Figure 2.23 No fault residuals; throttle
residual generation validation. (top).
Figure 2.24 10% throttle sensor fault
(bottom).
170 Real time fault monitoring of industrial processes
This is ofthe form ofEqs. (2.146) withAo=O, K=O, F=O, G=O. In order to detect a fault
fJ....t) in the i-th axis of an n-degree offreedom robot, applying conditions (2.147) yields
the transformation matrix T:
T = [0 nxl : I:xd
1* =[OnxI lOnxI]
i -th coluIDD
and
G=[OnxI j -aJ;xI]
LI =[OnxI : -aJ;xd
L2 =[OnxI: aJ;xI]
where aj is a positive number determining the ob server poles.
A matrix multiplication, f; (t) =J ( q (t ) ) f d ( t) decouples the torque on each general-
ized axis, representing unmodeled friction torques and any other additional torque dis-
turbances.
Extensive investigations on this robot showed that friction cannot be modeled in a simple
form, e.g. Coulomb and viscous friction. Instead, a displacement dependent friction char-
acteristic had to be introduced (see e.g. Schneider and Frank, 1992, Armstrong, 1988).
Typical friction characteristics ofthe r3 robot are displayed in fig. 2.25 where friction is :
obtained by the observer-based approach while constant velocity movements are per-
formed. The residual, representing a friction torque, is not plotted as a time signal but as
a correlation with the corresponding positional displacement, such that different meas-
urements can easily be compared.
In experiments, external torques to the robot were calculated with the observer-based
method and also measured with a force/torque sensor. An external torque was applied to
the robot's tool center point wbile the robot performed a normal operation. As seen from
fig. 2.26, the calculated residual matched the measured one within a ±3Nm range. Note
that a dead-band of about 30Nm was needed when a fixed threshold was used.
Moreover, only a small torque was applied in tbis experiment, degrading the results since
the measurements were corrupted by noise.
172 Real time fault monitoring of industrial processes
60
40
20 6
Torquc
o [Nm1 4
-20
-40
-60
2
. .
~.,j... '''': J;:.. - ... '~. ,.~ ...:'
-80 ~---:-!:::-----:7:--~--=--=--::-:-:----:--:-:-- "
-200 -150 -100 -50 0 50 100 150 -2'L-----:-----c-----:--____-:---::--
02 46189
Position (deg) TIme [s1
Figure 2.25 Friction characteristics of the Figure 2.26 External torque estimation.
MANUTEC r3 robot. Solid line: measured; dotted line: calculated.
References
Dobner D.J. (1982), Dynamic Engine Models for Control Development. Part I:
Nonlinear and Linear Model Formulation. General Motors Research Laboratories
Publication GMR-3783, January.
Dowdle J.R., Willsky AS. and S.W. Gully (1983). Nonlinear generalised likelihood ra-
tio algorithms for manoeuver detection and estimation. Proceedings, 1982 American
Control Conjerence, Arlington, Virginia, June 1983.
Frank P.M. (1987). Fault diagnosis in dynamic systems via state estimation-A survey.
In S. Tzajestas, M Singh, G. Schmidt (Eds.), System jault diagnostics, reliability and
related knowledge-based approaches, D. Reidel, Dordrecht, 35-89.
Frank P.M. (1990). Fault diagnosis in dynamic systems using analytical and knowledge-
based redundancy-A survey and some new results. Automatica, 26, 3, 459-474.
Gantmacher R. (1974). The theory of matrices, Volume 11. Chelsea Publishing
Company, N.Y.
Gertler, IJ. (1991). Analytical redundancy methods in failure detection and isolation in
complex plants. Proceedings, IFAClIMACS Symposium SAFEPROCESS '91, Baden-
Baden, September 10-13, pp.9-21.
Gertler, II, Costin, I, Kowalczuk, Z., Fang, x., Hira, R. and Q. Luo (1991). Model-
based on-board fault detection and diagnosis for automotive engines. Proceedings,
IFAClIMACS Symposium SAFEPROCESS '91, Baden-Baden, 503-508.
Green C.S. (1978). An analysis ofthe multiple model adaptive control algorithm. Phd.
Thesis, Report no. ESL-TH-843, Electronic Systems Laboratory, M.I.T.
Grizzle lW., Cook JA and K.L. Dobbins (1990). Individual Cylinder Air to Fuel Ratio
Control with a Single EGO Sensor. Proceedings, 1990 American Control Conjerence,
San Diego, CA, May 1990,2881-2886.
Gustafsson D.E., Willsky AS., Wang I-Y., Lancaster M.C. and IH. Triebwasser
(1978a). ECGNCG Rhythm diagnosis using statistical signal analysis, I: Identification
of persistent rhythms. IEEE Transactions on Biomedical Engineering, BME-25, 4,
344-353.
Gustafsson D.E., Willsky AS., Wang I-Y., Lancaster M.C. and IH. Triebwasser
(1978b). ECGNCG Rhythm diagnosis using statistical signal analysis, I: Identification
oftransient rhythms. IEEE Transactions on Biomedical Engineering, BME-25, 4, 353-
361.
Janssen K. and P.M. Frank (1984). Component failure detection via state estimation.
Proceedings, 9th IFAC World Congress, Budapest, Hungary.
Jenkin G.M. and D.G. Watts (1968). Spectral analysis and its applications. Holden Day,
San Francisco.
Analytical redundancy methods 175
Jones H.L. (1973). Failure detection in linear systems. Phd. dissertation, Dept.
Aeronautics and Astronautics, M.I.T.
Karkanias N. and B. Kouvaritakis (1979). The output zeroing problem and its relation-
ship to the invariant zero structure. International Journal ofControl, 30, 395-415.
Kasper R., Lückel J., Jaker KP. and J. Schroer (1990). CACE Tool for Multi-Input, .
Multi-Output Systems using a New Vector Optimisation Method. International Journal
ofControl, 51, 963-993.
Konik D. and S. EngeIl (1986). Sequential design of decentralised controllers using
decoupling techniques. Part 2:Numerical aspects and application. Proceedings, 4th
IFACIIFORS Symposium on Large Scale Systems: Theory and Applications, Zurich,
Switzerland.
Kumamaru K (1984). Statistical failure diagnosis for dynamical systems. Systems and
Control, 28, 77-86.
Lainiotis D.G. (1971). Joint detection, estimation and system identification, Information
and Control, 19, 75-92.
Lou X.C., Willsky AS. and G.C. Verghese (1986). Optimal robust redundancy relations
for failure detection in uncertain systems. Automatica, 22, 333-344.
Massoumnia, M.-A (1986). A geometrie approach to failure detection and identifica-
tion in linear systems. Phd. Thesis, Dept. of Aeronautics and Astronautics, MIT.
Massoumnia M.-A and W.E. Vanderwelde (1988). Generating parity relations for de-
tecting and identifying control system component failures. AIAA Journal of Guidance,
Control and Dynamics, 11, 60-65.
Mehra, R.K and I. Peschon (1971). An innovations approach to fault detection and di-
agnosis in dynamical systems. Automatica, 7, 637-640.
Mironovskii LA (1980). Functional diagnosis of dynamic system. Automation and re-
mote control, 41, 1122-1143.
Moore B.C. (1976). On flexibility offered by state feedback in multivariable systems
beyond closed loop eigenvalue assignment. IEEE Transactions on Automatie Control,
AC-21, 689-692.
Moskwa J.J. and J.K Hedrick (1987). Automotive Engine Modeling for Real Time
Control Application. Proceedings ofthe 1987 American Control Conference, Vol. 1, pp.
341-346.
Ono T., Kumamaru T. and K Kumamaru (1984). Fault diagnosis of sensors using a
gradient method. Transaetions, Society of Instrument and Control Engineers, 20, 22-
27.
176 Real time fault monitoring of industrial processes
Patton RJ. and I. Chen (1991). Robust fauIt detection using eigenstructure assignment:
A tutorial consideration and some new results. Proceedings, 30th IEEE Conference on
Decision and Control, Brighton, UK., December 11-13, 2242-2247.
Patton RI. and I. Chen (1992). Robust fault detection of jet engine sensor systems by
using eigenstructure assignment. Proceedings, AIIA Guidance, Navigation and Control
Conference, New Orleans, August.
Patton RI., Chen I. and H.Y. Zhang (1992). Modelling methods for improving robust-
ness in fault diagnosis of jet engine system. Proceedings, IEEE 31st Conference on
Decision and Control, Tucson, Arizona, December 1992, 2330-2335.
Patton RJ., Frank P.M. and RN. Clark, Eds. (1989). Fault diagnosis in dynamic sys-
tems, Theory and Applications. Prentice-Hall, Englewood Cliffs, NI.
Patton RJ. and S.M. Kangethe (1989). Robust fault diagnosis using eigenstructure as-
signment of ob servers. In Patton RI., Frank P.M. and RN. Clark (Eds.), Fault
diagnosis in dynamic systems: Theory and applications, Prentice Hall.
Patton RI. and S.M. Willcox (1986). FauIt diagnosis in dynamic systems using a robust
output zeroing design method. Proceedings, First European Workshop on Faill/re
Diagnosis, Rhodes, Greece, August 31-September 3.
Patton RI., Willcox S. and SJ. Winter (1987). A parameter insensitive technique for
aircraft sensor fault analysis, AIAA Journal of Guidance, Control and Dynamies, 10,
359-367.
Patton R.I., Zhang H.Y. and I. Chen (1992b). Modeling ofuncertainties for robust fault
diagnosis. Proceedings, IEEE 31st Conference on Decision and Control, Tueson,
Arizona, December 16-18.
Potter I.E. and M.C. Suman (1977). Thresholdless redundancy management with arrays
of skewed instruments. Integrity in electronic jlight control systems,
AGARDOGRAPH-224, 15-11 to 15-25.
Pouliezos A. (1980a). Fault monitoring schemes for linear stochastic systems. Phd.
Thesis, Brunel University, England.
Pouliezos A. (1980b). An iterative method for calculating the sampie serial correlation
coefficient. IEEE Transactions on Automatie Control, AC-25, 834-836.
Pouliezos A. and G. Stavrakakis (1987). Linear estimation in the presence of sudden
system changes: An expert system. In P. Borne, S. Tzafestas (Eds.), Applied modeling
and simulation of technological systems, North-Holland, 41-48.
Pouliezos A. and G. Stavrakakis (1991). A two-stage real-time fault monitoring system.
Proceedings, European Robotics and Intelligent Systems Conference EURISCON '91,
Corfu, Greece, June 23-28.
Analytical redundancy methods 177
3.1 Introducti.on
Fault detection via parameter estimation relies on the principle that possible faults in the
monitored process can be associated with specific parameters and states of a mathemati-
cal model of a process given in general by an input-output relation,
y(t) = f(u, e, (J, x) (3.1)
where y(t) represents the vector output of the process, u(t) the vector input, x(t) the
partially measurable state variables, (J the nonmeasurable process parameters likely to
change and e(t) unmodeled or noise terms affecting the process. It is obvious therefore,
that it is necessary to have an accurate theoretical dynamic model of the process in order
to apply parameter estimation methods. This is usually derived from the basic balance
equations for mass, energy, and momentum, the physico-chemical state equations and
the phenomenologicallaws for any irreversible phenomena. The models will then appear
in the continuous or discrete time domain, in the form of ordinary or partial differential
or difference equations. Their parameters (J i are expressed in dependence on process
coefficients Pj. like storage or resistance quantities. whose changes indieate a process
fault. Hence, the parameters ()i of continuous time models have to be estimated. In this
case there is a minimum number of independently measurable quantities which permit the
estimation of various states and parameters. As an example consider a simple dynamic
process model with lumped parameters, linearized about an operating point, which may
be described by the differential equation,
y(t) + ... + anY(n)(t) = boU(t} + b 1u(1)(t) + ... + b"",(m)(t) (3.2)
The process model parameters,
180 Real time fault monitoring of industrial processes
(3.3)
are defined as relationships of several physical process coefficients, e.g. length, mass,
speed, drag coefficient, viscosity, resistances, capacities. Faults which become notice-
able in these physical process constants are therefore also expressed in the process model
parameters. If the physical process coefficients, indicative of process faults, are not di-
rectly measurable, an attempt can be made to detect their changes via the changes in the
process model parameters 9. The following procedure is therefore applicable in general:
(I) Establishment ofthe mathematical model ofthe normal process,
y(t) =f(u(t),9) (3.4)
(5) Decision on whether a fault has occurred, based either on the changes L1pj cal-
culated in step 4 or on the changes L1(}j and tolerance limits from step 3. If
decisions are made based on the L1(}j the affectedp;'s can be easily determined from
3.5. This may be achieved with the aid ofa fault catalogue in which the relationship
between process faults and changes in the coefficients L1pj has been established.
Decisions can be made either by simply checking against the predetermined
threshold levels, or by using more sophisticated methods from the fields of
statistical decision theory. A fault decision should include the fault Iocation, fault
size and time of occurrence. System reorganization should follow a positive fault
decision. Such an action is essential, since usually controller design depends on the
correctness of parameters. Robust schemes also benefit from this procedure.
The basis of this class of methods is the combination of theoretical modeling and pa-
rameter estimation of continuous time models. A block diagram is given in Figure 3.1.
Since, however a requirement of this procedure is the existence of the inverse relation-
ship (3.6) it may be restricted to well-defined process.
Parameter estimation methods 181
Calculation Theoretical
of process < mode1lng
coefficients p.f" (8)
11
v P
v 6p, 68
v
FAULTS
Figure 3.1 Fault detection based on parameter estimation and theoretical modeling.
The implementation of the full procedure requires however, more eifort in modeling the
process, more sophisticated and fault-sensitive identification methods and fast
processing hardware suitable for on-line operation.
In the following sections different approaches for each of the three stages of the fault
monitoring process (modeling, estimation, decision) will be presented in detail.
The starting point in the design of a fault monitoring system is the development of the
process model in the form of a set of equations in the continuous or discrete time do-
main. Usually these process models are nonlinear and should not be linearised since a
process model for failure diagnosis should be valid over a large range of operating
conditions. This process description has to be translated into equations which allow to
estimate 9. Since usually least-squares or related methods are used for the estimation
phase, these equations must be linear-in-the-parameters, Le. of the following general
form,
r
[(YI,· .. ,Yn) =LOJi(YI,· .. ,Yn)
i=1
where Yj are the measured quantities. The functions /; may be nonlinear; time-derivatives
or integrals are also allowed since the derivatives can be obtained from the original
signals by state variable filtering (SVF, Young, 1981) or standard difference techniques
(pouliezos and Stavrakakis, 1989). More valuable than the estimated parameters 9 are
the physical process coefficients p, since they are directly related to the monitored
process. Thus deviations of the process coefficients from their "normal" values can be
attributed to a significant change in the process itself and allow a precise fault diagnosis
(localisation of fault) if there is a unique correspondence between 9, p. A simple ex-
ample will illustrate the importance of choosing the correct process model (Iserman,
1987).
Example 3.1 First order electrical circuit.
R
-c:=:::J-T+---<lO
U1 1 Uc ' I 1...L c 1 U2
o T 0
First approach: Defining voltage vI as input, voltage v2 as output, and al=RC, bo=l, the
dynarnic process model is:
Parameter estimation methods 183
Hence, 9 1 = lIRC, 92 = lIR and R = 1192, C = 92/9 1 and the two process coefficients
can be determined uniquely.
This simple low order example shows that it depends on the selected input and output
variables whether a change (fault) in one ofthe process coefficients can be detected. This
capability is closely related to identifiability conditions for the process coefficients. If not
all of the process coefficients can be determined uniquely then some of them must be
assumed as not likely to fail.
Nold and Iserman (1986) give the following algebraic identifiability condition relating 9
andp:
Theorem 3. J A system of equations,
9j _
-
~
"-..iCpi
p=l
[
K+
A n k"v].. _
m
v=l
Py ,1 -1, ... , r
provides unique solutions for the process coefficients P1, "" Pm in Jlm with the exclusion
of an open neighbourhood of 0 if the determinant,
Ull ul,n-m-l 1 k ll k 1m
where Uj =[ U1j, "" Unj]T are the base vectors ofthe equation system written in the form:
9= {cli}z
The nurnber ofbase vectors has to be n-m-J ~o, This result is illustrated by considering
184 Real time fault monitoring of industrial processes
The variables which can be measured are only u and i. Here the circuit diagram is already
a sufficient process description so that in this case, no block diagram is needed. One can
directly write down the equation in matrix form:
0 0 0 0
-R,
-I 0 1 0 U c,
0 0
0 0 0 +1 U R2
[~]=
0 1
0 0 -1 C 2s 0 12
0 0
-CIS 0 0 0 1 U C2
0 0
0 1 -R2 0 0 I,
0 0
iI
The H-matrix is now rowwise tranformed in upper triangular form. After tranforming
the first five rows one obtains:
-R, 0 0 0 0
U c,
1 -R, 0 1 0 1 0
U R2
[~]=
0 1 0 0 0
12
0 0 0 0 C 2 s __ J
U C2
CIS -RP,s 0 0 0 0 1
I,
0 0 0 -~ 0 0
The last equation represents the estimation equation which can now be written in the
form,
KRO~ClC-l
1 1 2
KRoRoCoCO ()l
1 2 1 2
1 0 0 0 0 0 0
KRO~ClCO {)2
1 1 2
0 0 0 0 0 0 0
KRl~COCO -{)3
1 1 2
0 0 0 1 0 0 =
KRl~ClC-l -{)4
1 1 2
0 0 0 0 0 0 0
KRO~COCO -{)5
1 1 2
0 0 0 0 0 0 0
KRl~ClCO -()6
1 1 2
iRO~COCl
1 1 2
"2 T = [0 0 0 0 0 -1 0 0]
"3 T = [0 0 0 0 1 -1 0 0]
The determinant thus becomes,
186 Real time fault monitoring of industrial processes
0 0 0 0 1 -1
-1 0 0 0 0 0 0
0 0 0 0 0
0 1 0 0 0 0
=1
0 0 0 -1
0 -1 -1 0 1 0 0
0 0 0 1 1 0
0 0 0 0 0 0 -1
The determinant is unequal to zero, the number of base vectors is 8-4-1=3, r=5<n=8
hence the system is structurally identifiable.
However, the theorem does not give the solutions for the process coefficients, it just says
whether there is a unique solution or not. In tbis example, the process parameters can be
easily determined from the estimated parameters. Using,
du(t) . di(t)
( } l - - + (}2U(t) = l(t) + (}3 - - + (}4(t)
dt dt
with,
(}l = Cl + C2
(}2 =R2 Cl C2
enough for real-time operation. For the first requirement some method of overweighting
old data must be used, but not at the expense of accuracy. A standard way to do that is
to use sliding windows of data values. The windows may be either rectangular with equal
weighting or exponential with some form of decreasing weights with distance from
present. Appropriate techniques will be discussed in the next sections. The second
requirement is fulfilled by using recursive methods and perhaps employing some
modifications for reducing calculations. Memory requirement is not a problem with
today's shrinking components.
In the following sections, several estimation techniques that have been used for fault
detection will be presented. At this point the elose relationship between identification
algorithms for slowly or fast-varying parameters and for fault detection purposes must
be pointed out. However, a subtle difference exists which should not be overlooked: a
fault detection mechanism assumes that a parameter value is known until a change
occurs. This means that any control loops utilizing these parameters should use the
known and not the estimated value of the parameters.
Most schemes for fault detection employing L.S. parameter estimation methods, model
the observed system (3.1) as,
y(t) = G(q-I; 9) u(t) + H(q-I; 9) e(t) (3.7)
where y(t) is the p-dimensional output, u(t) the r-dimensional input, e(t) is a zero-mean
white noise sequence with covariance A( 9) and G(lT 1; 9), H( lT 1; 9) are filters of
appropriate dimensions. Here lT 1 denotes the backward shift operator lT 1{ u(t)} = u(t-l)
etc. Finally, 9 denotes the n-dimensional parameter vector.
Equation (3.7) describes a quite general linear model which by proper choice of the
matrices G, H, A and parameters Oi can be put into more familiar forms. An ARMAX
(autoregressive moving average with exogenous signals) model is obtained if,
G(q-l'9) = B(q-l) =
H(q-l'9) C(q-l) A(9) =A. 2
, A(q-l) ' 'A(q-l)'
where,
A(q-I) = 1+ alq -I +... + an q na
a
B(q-I) = ~q-I + ... +bnbqnb
C(q-I) =1+Clq-l + ... +cncqnc
The parameter vector is taken as,
188 Real time fault monitoring of industrial processes
This structure can in some sense be viewed as a linear regression. T0 see this, rewrite the
model (3.9) as:
y(t) = tpT(t)9 + e(t) (3.9b)
where,
tp(t) = [ - y(t -1) ... - y(t - Da) u(t -1) ... u(t - Db ) ]T (3.9c)
Here though, the regressors (the elements of tp(t)) are not deterministic functions.
Finally, note that a linear stochastic model in state space form,
X(/+ 1) = A( 9)x(/) + B( 9)u(/) + w(/) (3.10)
y(t) = C( 9)x(t) + v(t)
where w(~ and v(t) are multivariate white noise sequences with zero means and co-
variances,
E{w(t)wT(s)} = R 1(9)t5t ,s
E{w(t)vT(s)} = Rd 9 )t5t ,s
E{ v(t)vT(s)} = R2(9)t5t,s
can be transformed into the general form (3.7) ifthe following definitions are made:
G(q-l; 9) = C(9)[qI -A(9)]-IB(9)
H(q-l; 9) = 1+ C(9)[qI - A(9)]-IK(9)
A( 9) = C( 9)P(9 )CI( 9) + Ri 9)
where K( 9) is the Kaiman gain,
Parameter estimation methods 189
and P(9) is the symmetrie positive definite solution ofthe Rieatti equation,
(3.11 )
A
The argument t has been used to stress the dependenee of 0 on time. The expression
(3.11) ean be eomputed in a reeursive fashion. Introduee the notation:
P(I) =[~f'{S)foT(SJ
(3.12)
Sinee trivially,
(3.13)
it follows that,
(3.14a)
ofy(t) made at time t-I based on the model eorresponding to the estimate O(t -I). If
E.{t) is small, the estimate 8(t -I) is "good" and should not be modified very mueh. The
veetor k(t) in (3 .14b) should be interpreted as a weighting or gain faetor showing how
mueh the value of E.{t) will modify the different elements ofthe parameter veetor.
To eomplete the algorithm, (3.13) must be used to eompute P(t) wbieh is needed in
(3. 14b). However, the use of(3.13) needs a matrix inversion at eaeh time step. Tbis
would be a time-eonsuming proeedure. Using the matrix inversion lemma (Gantmaeher,
1977) however, (3.13) ean be rewritten in a more useful form. Then an updating
equation for P(t) is obtained, namely,
P(t -I)ip(t)
= -----:-----'~-'---
The algorithm also needs initial values 8(0) and P(O). These can be either provided
from knowledge of system charaeteristies or calculated from an initial sampie using
Parameter estimation methods 191
Also, under mild conditions the LS estimate is consistent, i.e. 8 tends to 8 as N tends to
infinity, if,
E{ f1{t)tp1"(t)} is non singular (3.17)
E{ f1{t)d..t)} =0 (3.18)
Condition (3.17) is usually satisfied. A common cause of singularity is the condition of
non-persistent excitation of order nb of the input. Remedies for this irregularity will be
discussed later. Condition (3.18) is usually not satisfied unless d..t) is wbite, an
assumption usually made for most systems. However it must be stressed that violation of
(3.18) will render the whole fault detection scheme based on LS parameter estimation
invalid. In such cases the designer should resort to methods circumventing tbis problem
as for example instrumental variable methods (see Söderström and Stoica, 1988). Even
small biases should not be tolerated since these may trigger false alarms in sensitive
detectors.
There are several approaches for modifying the recursive LS algorithm to make it
suitable as a real-time fault detection method:
• Use of a forgetting factor.
• Use of a Kalman filter as a parameter estimator.
• Use of sliding windows of data.
The approach in tbis case is to change the loss function to be minimized. Let the
modified loss function be,
t
~(8) = LÄ.t - ss (s)2
s=1 (3.19)
The loss funtion used earlier had Ä.= 1 but now the forgetting factor .1. is a number
somewhat less· than 1 (for example 0.99 or 0.95). Tbis means that with increasing t, the
measurements obtained previously are discounted. The smaller the value of .1., the
quicker the information in previous data will be forgotten. One can rederive the RLS
method for the modified criterion (3.19). The calculations are straightforward. The
recursive LS method with a forgetting factor is:
192 Real time fault monitoring of industrial processes
Equations (3.20) are often referred to as the Recursive Weighted Least Squares (RWLS)
identification method or Decreasing Gain Least squares (DGLS) method. Experiences
with tbis simple rule for setting A show that a decrease in the value of the forgetting
factor leads to two effects:
(1) The parameter estimates converge to their true values quicker, thus decreasing the
fault a1arm delay time, Id.
(2) But at the expense of increased sensitivity to noise. If A is much less than 1 the
estimates may even oscillate around their true values.
There are various ways around tbis problem:
Time-varying forgetting factor: In tbis method the constant A in (3.20) is replaced by
Ä.(t). A typical choice is an exponential given by,
A(t) =1- A~ +(1- A(O»)
or recursively,
A(t) = AOA(t -I) + (1- A(O») (3.21)
Typical design values for Ao and Ä.(O) are 0.99 and 0.95 respectively.
Equations (3.20) with (3.21) in place of A minimise the quadratic cost function:
t
~(O) = LA(S)E?(S)
s=1
1
a(t) =_+,T (t)P(t -1),(t)
fJ(t)
k(t) =P(t -1),(t)a-1(t)
P(t) = _1_{p(t -1) - k(t)a(t)k T(t))
A(t)
Constant Trace. In the case of abruptly changing systems, the tracking capability and
consequently the fast response to parameter changes can be maintained by using the
forgetting factor to keep the trace of P constant. This idea resuIts in the recursive
Constant Trace Least Squares (CTLS) algorithm (Shibata et al., 1988) implemented by
the following set of equations:
8(i) =8~~ -1) - P(i -1),(i -1)e(i)
e(i) =loT (i -1),(i) - y(i»)a(i)
P(i) = A-1(l){P(i -1) - P(i -1),(i),T (i)P(i -1)a(i»)
Here 6(0) and P(O) must be defined. This method eIiminates the estimator wind-1Jp
problem which occurs when a constant forgetting factor is used and provides rapid
convergence after the onset of a parameter charlge.
Kaiman filters: Assuming that the parameters are constant, the underlying model
y(t) = ,T0(/) + e(t)
can be described as astate space equation,
x(t+ 1) = x(t) (3.26)
y(t) = ,T(t)x(t) + e(t) (3.27)
where the "state vector" x(t) is given by,
x(t) = [ 81 .•. 8 n
•
q ... bn ]T = 0
b
(3.28)
The optimal state estimate x(t + 1) can be computed as a function ofthe measurements
till time t using the Kalman filter. Note that usually the Kalman filter is presented for
state space equations whose matrices may be time varying but do not depend on the data.
The latter condition fails in the case of (3.27) since tp(t) depends on data up to (and
inclusive of) time (1-1). However, it can be shown that also in such cases the Kalman
filter provides the optimal (mean square) estimate of the system state vector (Aström,
194 Real time fault monitoring of industrial processes
1971).
Applying the Kalman filter to the state model (3.26) will give precisely the basic re-
cursive LS algorithm. One way of modifying the algorithm so that time-varying
parameters can be tracked better is to change the state equation (3.26) to
x(t+l) =x(t) + v(t); E{v(t)vT(s)} =R1t\s (3.29)
Tbis means that the parameter vector is modeled as a random walk or a drift. The
covariance matrix R 1 can be used to describe how fast the different components of 0 are
expected to vary. Applying the Kalman filter to the model (3.29), (3.27) gives the
following recursive algorithm:
k(t) = pet -I),(t)
1+,T (t)P(t -1),(t)
Observe that for both algorithms (3.20) and (3.30) the basic method has been modified
so that p(t) will no longer tend to zero. In tbis way k(t) also is prevented from decreasing
to zero. The parameter estimates will therefore change continually.
In the algorithm (3.20) R) has a role similar to that of Ä, in (3.20). These design variables
should be chosen by a trade-offbetween fast detection (wbich requires Ä, "small" or R}
"Iarge") on the one hand and reliability on the other (wbich requires Ä, elose to 1 or R}
"small"). Tbis trade-offmay be resolved by fault simulation.
The Kalman filter interpretation of the RLS algorithm is also useful in another respect. It
provides suggestions for the choice of the initial values 8(0) and P(O). These values are
necessary to start the algorithm. Since p(t) (times Ä,2) is the covariance matrix of 8(t) it
is reasonable to take for 8(0) an apriori estimate of (J and to let P(O) reflect the
confidence in tbis initial estimate 8(0). If P(O) is small then k(t) will be small for all t and
fhe parameter estimates will therefore not change too much from 8(0). On the other
band, if P(O) is large, the parameter estimates will quickly jump away from 8(0).
Without any apriori information it is common practive to take,
6(0) =0; P(O) =a}
where ais a "Iarge" number.
Increased flexibility in the choice of design parameters can be acbieved if additionally to
(3.29) one assumes,
E{e(t)e(s)} =r2(t)öu
wbich results in the modified set ofupdating equations,
Parameter estimation methods 195
=
,T(k)
(3.32)
and,
y =[y(I) ... y(k)]T
Furthermore, for a moving window oflength fIw, define,
,T(k-nw +l
,T(k-nw +1)]
=[ ..................... .
(/1(k,k - nw + 2)
(3.33)
Then as shown in Appendix 3.A,
6(k + 1) =6(k) - P(k + 1)[r(k + 1)6(k) - 6(k + I)]
p-I (k + I) =p-I(k) + r(k + 1)
where,
r(k + 1) =,(k+ I),T (k+ 1) -,(k- n", + I),T (k- n", + I)
196 Real time fault monitoring of industrial processes
It should be remembered that 8(k) is estimated using information from the last nw
sampies. Equations (3.34) form the sliding window least squares estimator (SWLSE).
Note that in tbis simple case a further reduction of p-l is not needed since only one
inversion is required. The reduction in speed is proportional to the length ofthe window
since the dimensions of P, rand 6 are independent of the window size.
The improvement in speed over the c1assical batch sliding window LSE is shown in the
operations count table for the scalar case in Table 3.1. No special methods for better
performance of individual operations (matrix inversion) are taken into account, since
these would apply equailY weil to both cases. It should be noted however that memory
requirements are not reduced, since at any one time a11 the window values must be
accessible. The sealar case considered may serve as a guideline for speed improvement
in the vector versions.
Table 3.1 Operations count for window size nw (scalar output case).
diagonal matrix with some diagonal elements equal to zero. In such cases it is easy to
find V(t). Then orthogonal transformations are applied to the rectangular matrix (Q(t)
V(t». The problem is to find an orthogonal matrix T(t) and a triangular matrix Q(t) such
that,
[ Q(t) I V(t) ] T(t) =[ Q(/) I 0 ] (3.38)
Then one has,
P(I) + R1(1) =Q(I)QT (I) + V(I)VT (I)
=[Q(I) I V(I)]T(I)TT(I)[Q;(I)]
C (I)
4. For i =1, ... ,j-l, go through step 5. (lfj=l, skip step 5).
Parameter estimation methods 199
5. Compute,
U··(t)
1) =U··(t -I) +V·11"
1)
II .
j
6. Compute,
The scalar Pd obtained after the dth cycle ofsteps 3-5 is the innovations variance,
Pd= A.(t) + ,T(t)p(I-I)fJ(I).
The algorithm is initialised by U(O)D(O)ur(O) = P(O)
The U-D analogue of the Kalman filter updating given by (3.30) is discussed in
Thornton and Bierman (1977) and consists of the foIlowing equations:
At time 1-1, U(t-I) and D(I-I) are given, as weIl as the factorisation R1(t)=V(I)JIf(t)
with V(t) a fuIl rank (nxs) matrix.
1. Compute k(t), U(t) and D(t) by performing steps 1-6 of (3.40) (U(t) and D(t)
are the matrices called U(I) and D(I) in (3.40».
2. Define the (n+s)-column vector W}O) as the kth column of ur(t) stacked on top of
the kth column of JIf(t); k=1, ... , d.
3. Define the (n+s)x(n+s) diagonal matrix Das the block diagonal matrix formed from
D ( t) and the sxs identity matrix.
4. For} =n, n-l, ... ,2 go through steps 5-8.
5. Compute,
Ui](t) =[ w}d-j ) ]T D w}d-j )
6. For i =1,2, ... ,}-1 go through step 7.
7. Compute,
U(t)ij =[ w/d- j ) ]T D w}d-NDit)
W.(d-j+l)
I
=w.(d-j) -
I
U.IJ''I)w.(d-j)
]
8. Compute,
200 Real time fault monitoring of industrial processes
+,
model (3.27) and (3.42) is given by,
a(t) =_1_ T (t)P(t -1),(t)
p(t)
When tp(t) tends to zero, 6(t) tends to 9it) and P(I) tends to (2a-a2)-IR1(t). Favier et
Parameter estimation methods 201
al. (1988), proposed a U-D factorised version of (3.43) consisting of the following
steps:
Let p(t)=P=UDlJf, ji =P(t -I) =UDffY .
1. Form the matrices,
Ua
[
=0
1 ",Tu
(l-a)U ~}D.{~
where Ua is (n+l)x(2n+1), Da is (2n+I)x(2n+1).
2. Apply the modified Gram-Schmidt orthogonalization procedure,
U a =UaT
where Uais a (n+ l)x(2n+ 1) unit lower triangular matrix, T is (2n+ l)x(2n+ 1) such
that TD.T T = D. and D. is (2n+ 1)x(2n+ 1) diagonal.
3. Write U a and Tas,
_=[1 0]
U
a u U0 '
T=
T
v n+!
where viis the i-th row vector of T. The gain k and the factors U and D are
given by,
k=u, U= Uo
IIV211~•
D=
with,
(3.44)
Table 3.2 Number of arithrnetic operations used for updating p(t) onee. The number of
parameters is n.
Fast algorithms rely on the exploitation ofthe so ealled shift structure. To illustrate this
notion eonsider (3.9) again,
y(t) + aIY<t-l) + ... + aiJ'{t-k) = b1u(t-l) + ... + bJll(t-k) + e(t)
Note that the orders ofthe polynomials are the same and equal to k. Now define,
* t = [X(t)] = [tp(t +
tp () tp(t) x(t - k)
1)]
Here the dimension of x is 2 since scalar signals are considered and the dimension of tp is
n=2k.
Since the relevant calculations for the derivation of fast formulae are rather involved the
necessary recursive equations are only cited and the interested reader is referred to the
work ofKalouptsidis and co-workers and Ljung and co-workers.
r
parameter estimate:
Yeh, (1991), implemented a conventional and a U-D filter on a DSP32C processor using
assembly and C languages.
As explained in previous sections the selection of the quantities that affect the con-
tribution of past data to the calculation of the estimate, plays a crucial role in the fault
detection process. In fact, it may be argued that the tuning of these parameters is the
weak link in the detection chain since usually they can be assigned only by simulation. In
a parameter identification context, especially for slow-or rapidly-varying systems, the
role of the weighting parameters is weil understood. Consider for example the process
model,
y(t+ 1) =a(t) y(t) + e(t+ 1)
where,
a(t) =-0.9 for t< 100.
= -0.3 for f? 100.
This model is a typical fault situation, where the estimated parameter jumps in a step-Iike
fashion. Less rapid changes can usually be weil approximated by aseries of step changes.
In Figure 3.3 the outcome of the estimation procedure using (3.20) for three different
values of A is shown. A clear conclusion is therefore drawn: the larger the value of A the
better the quality of the estimate but at the expense of slow response. On the other hand
smaller values of A result in poorer estimates but which are reached faster. A useful rela-
tion of A and effective window length is given by,
I
To = - -
I-A
This means that data older than To time units have a weight less than e-}~36% ofthat of
the most recent data. Analogous conclusions can be reached for the a1gorithm of (3.30)
for parameter R}(t). However in this case R} lies on the opposite size ofthe scale, close
to O. Window length for sliding window a1gorithms can also be determined by similar
arguments since A is directly related to effective data length.
To develop appropriate weight sequences for fault monitoring schemes one has to
observe that fault monitoring differs from time-varying system identification in that the
quality of the parameter estimates has not the same significance at every time instance. In
fault monitoring, parameters are assumed known until a change occurs, therefore until
then the most important operation is change detection. Following a positive change
decision, the part of parameter identification becomes important, while change detection
may be even suspended. This is acceptable if subsequent faults can be assumed to happen
in longer intervals than these needed for parameter estimation.
206 Real time fault monitoring of industrial processes
This reasoning points to the use of two different weighting sequences: a pre- fault and a
post-fault one. The whole procedure is shown in Figure 3.4. Two problems remain:
how to decide on a fault occurrence and how to change the weighting sequence. The first
question will be answered in subsequent chapters. Let us concentrate on the second.
When a fault is detected, the gain in the estimation algorithm should be increased. This
means that p(t) should be increased. This can be achieved in many ways, but there are
mainly two methods that have been previously used. The first one is to decrease the
forgetting factor Ä. The growth of p(t) is then nearly exponential. This approach can be
implemented by the variable forgetting factor method of F ortescue et al.. , (1981). The
necessary recursions are:
3. Gain
k(t) = P(t -1)fJ(t -1)
1+ fJT (t -1)P(t -1)fJ(t -1)
4. Estimate 8(t) =8(t -1) + k(t)s(t)
Parameter estimation methods 207
is kept constant and equal to~. In other words, the amount of forgetting will at each
step correspond to the amount of new information in the latest measurement, thereby
ensuring that the estimation is always based on the same amount of information. Thus
from (3.57),
Ä(t) = 1- 1/N(t)
where,
N(t) =Uo / (1- fiT (t -1)k(t»e 2 (t)
is the equivalent asymtotic memory length if Ä.=Ä(t) were to be used throughout the
estimation. Since ~ is related to the sum of the squares of the errors, one possible
guideline on how to choose it is to express ~ as,
~=uNo
where u. is the expected measurement noise variance based on real knowledge of the
process. Then N will control the speed of adaptation as it corresponds to a nominal
asymptotic memory length. Simulations have shown that if ~ is chosen in this manner,
then for a stationary process,
and,
E{HT}~HT
AUOIIrMH MnM
MIIM nNsnl~ln
.1
".,
rsruuloN 01
"ODU UTlHUU
luoaUMH MUM
NIX!
'OM SlNSUIYln
srn
HEXT
STU
!
t
lStlHUION 01
S UUMHIIS
.~ UUS
YES
U"HAtlON 01
nocESS
UUHETUS 1
OA+----J
:J
0.001 ' , ' ,
OUTPUT
, ,
so 100 ISO 100 250 300
I~~~~~~~~-~~~
-- .,
....
PARAMETER ESTlMA~----- ",
1.20
0.80
0.40
O. ~----------------------
0.8
0·ooI~--~SO~-"'1~00:---:-::15:-0-2~OO::----:-2SO~-:3-:-:-00
TIME (second.)
·0.40
Figure 3.5. Simulation ofself-tuning estimator with variable forgetting factor ~=O.0125.
This feature is extremely usuful in a fault monitoring situation: if a fault occurs all
information prior to the fault is really useless. The small value of A(t) at tbis instant
actually automatically "restarts" the algorithm at the time of the fault. A drawback of
tbis approach is that "restarting" is introduced to faulty and non-faulty parameters
simultaneously.
210 Real time fault monitoring of industrial processes
The second method is to add a constant times the unit matrix to I'(t) in which case it is
increased instantaneously. Equation (3.20) will then be substituted by,
where ~t) is a nonnegetive scalar variable which is zero except when a fault is detected.
When a fault is detected, a positive ~t) has the effect of increasing I'(t). The final
problem is to choose a suitable ~t). When no fault is detected, ~t) is zero. When a fault
is detected, it is reasonable to let ~t) depend on the actual value of I'(t) and on how
significant the alarm iso This may be done in many ways, and the following proposal is
just one possibility.
In the noise-free case, the progress of the estimation error, when 8(t) is constant, is
given by,
8e(I) = 8e(1-1) - I'(I)tp( I)t{t)
=(I - I'(t)tp(t)flT(/»8 e(t-I)
= U(t)fJ e(t-l)
All eigenvalues of U(t) are one, except the one corresponding to the eigenvector
1'(/)tp(t). This eigenvalue determines the step length in the algorithm. A small eigenvalue
causes large steps, while an eigenvalue close to one means that the step length in the
algorithm is small. Using (3.58) the eigenvalue can be written as:
1- rpT (t)P(t)rp(t) =1- fiT (t)P(t -1)fI(t) p(t)flT (t)fI(t)
Ä. + rpT (t)P(t -1)rp(t)
Ä.
= Ä. + rpT (t)P(t -I)rp(t) - p(t)flT (t)fI(t)
The eigenvalue is obviously between zero and one as long as P>O. Suppose now, that an
eigenvalue equal to v(/) is desired when a fault is detected. Then ~t) has to be chosen as,
P(t) = T I (vo(t) -v(t»)
fI (t)fI(t)
The eigenvalue v(/) should lie in the interval,
O<l'( t):5 vo(t)
in order to keep 1'(/) positive definite. In practice, this choice of IX..t) must also be
combined with a test for nonsingularity of flT(t)tp(t).
Parameter estimation methods 211
It remains to determine a suitable v(t). One choice is to model v(t) as a piecewise linear
function ofthe significance ofthe fault a1arm (Figure 3.6).
Vo t - - - - - - - - , .
o
Figure 3.6. Choice of v(t)
The preceding sections indicate that most, if not a11, traditional approaches to estimation
problems assume that the model is perfect and hence the only source of modeling error
that is considered is due to process or measurement noise. However, in practical
situations, it is frequently the case that the major source of modeling errors is the
approximate description of the system's response. Tbis gives rise to the need for robust
(in the presence of modeling errors) parameter estimation techniques. The importance of
tbis problem has a1ready been discussed in Chapter 11, with reference to residual-based
fault detection methods.
Efforts have also been made to produce robust parameter-estimation methods that could
be used in fault detection applications. Tbis is a very important development, since it is
weil known that if model mismatch exists, it is the source of false alarms, a condition that
will invalidate even the most sopbisticated a1gorithm. Weiss, (1988), computed an
uncertainty bound in the frequency domain wbich accounts for modeling errors. Tbis
bound is then used to construct a test variable for fault detection. Carlsson et al. (1988),
developed a related strategy in wbich the unmodeled response is embedded in a
stochastic process, so that bounds can be computed on the unmodeled errors.
In tbis section two approaches to robust parameter estimation-based fault detection are
briefly summarized: the "black box" approach suggested by Wahlberg (1990) and the
"embedding" approach ofCarlsson, extended by Kwon and Goodwin (1990). It must be
noted that both approaches are in fact off-Iine and have been mainly used for model
validation.
This approacb provides robust fault detection wben unmodeled dynamics, linearization
errors and noisy inputs are present. It fimctions by calculating upper bounds for
parameter estimates in tbe face of tbe previous anomalies.
Tbe model mismatcb is described by incorporating into tbe system description additive
unmodeled dynamics. System dynamics are described by tbe general linear model of
Equation (3.7). Two models are needed: tbe nominal model, witb transfer function,
and tbe unmodeled dynamics model witb tranfer function GL1(q-l). Tbe system output
tben satisfies,
y(k) == [G(q-l) + GL1(q-l)]u(k) + v(k)
== B(q-l; 8)uP(k) + rt..k) (3.60)
wbere v(k) models measurement noise ofzero mean and spectral density t/J(OJ) and,
1
uF(k) = I u(k)
A(q- )
TI(k) =Gß. (q-I)u(k) + v(k)
Equation (3.60) can be represented in standard linear regression form as,
y(k) == V(k)8 + rt..k) (3.61)
wbere,
fJ(k) == [uP(k-l) uP(k-2) ... up(k-nB)] (3.62)
Tbe parameters are estimated using ordinary least squares as,
(3.63)
wbere,
.==[qJ(I) qJ(2) ... qJ(N)]T (3.64)
Y == [y(I) y(2) ... rt..N) ]T (3.65)
8e =8-8=
A [
.T.]-1
From (3.61) and (3.63), tbe following expression for the estimation error is derived:
.TS
wbere,
s == [ rt..l) rt..2) ... rt..N) ]T (3.66)
Denoting tbe impulse response ofGL1 as {ho(.)}, rt..k) can be expressed as,
Parameter estimation methods 213
k
l1(k) = L ho (i)u(k -I) + v(k)
i=O
-1 " _
{
G"n(q
-I
) =Gn( q -1 ,(J" n) for In
G(q ,(J) - " -1 ( -1 " ) (3.69)
Gf(q ) = Gf q ,(Jf for I f
The fault detection procedure now amounts to comparing fJn and fJf ( or On and 0 f)
and deciding if the observed changes can be explained satisfactorily in terms of the
effects of noise or undermodeling. If not, then it may be concluded that a system fault
has occurred. The covariance functions of (On -Of) and (On -Of) under nonfaulty
conditions will be used as measures of the uncertainty due to noise and undermodeling.
An upper bound for the covariance of(On -Of) is given by Kwon and Goodwin, (1990),
as:
214 Real time fault monitoring of industrial processes
where,
and a;
is the equivalent noise variance corresponding to the upper bound S of the power
spectrum of v.
The first term on the right side of (3.70) accounts for the effects of undermodeling and
the difference in input signals for the two experiments. Note that if there is no
undermodeling or if the inputs are identical, these terms vanish. The second term
corresponds to measurement noise. The higher the SNR (signal-to-noise ratio) is, the
smaller the norm of this term. The matrix C can now be used to formulate appropriate
test variables for fault detection. For example, one may use,
(3.71)
(3.72)
These test variables are based on a comparison between the observed value and the
[8 8 ][8 8
expected value of n - f n - f ] T . If the test variable is larger than a fixed
threshold, then tbis is evidence that a fault has occurred.
Uncertainty bounds in the frequency domain can easily be obtained by extending the
previous results. Kwon and Goodwin, (1990), have shown that the expected value of
the difference between the estimated transfer functions in two nonfaulty experiments is
given by,
8 .
... - G(e}fIJIJ ,fJ)
]T
8fJ
=
m [IDJ CI'l .. . ro"JT
and G is the nominal transfer function, Cis given by (3.70), n is the number of fre-
Parameter estimation methods 215
quencies and * denotes conjugate transpose. Based on tbis result, test variables
analogous to tbe ones ofequations (3.71), (3.72) can be devised.
where GnA accounts for model mismatch due to input nonlinearity. Proceeding similarly
to the linear case, bounds in the parameter space are:
where,
u 2 (l)sign(u(I» o o
u 2 (2)sign(u(2» u 2 (I)sign(u(1» o
Rn =E{HnHJ}
Hn=[hn(O) hn (1) ... hn(N-1)Y
In the above, hik) denotes the impulse response of GnA and the remaining variables are
as in (3.70). An example oftbis procedure will be given in the applications section.
Black box identification approach.
This method, proposed by Wahlberg, (1990), aims at providing robust fault detection
216 Real time fault monitoring of industrial processes
when underrnodeling exists, ie. when a finite order structure is used to model systems of
infinite order. This is accomplished by using flexible models of a finite order in the
frequency domain, which approximate the process in the frequency range of interest.
The approach uses the high order black-box identification theory ofLjung, (1987).
The situation is again viewed as a standard hypothesis testing problem: given two data
sets, In and Iß decide if they relate to the same underlying system. Decision is made
based on the estimates of the system parameter vector 6 obtained from the two data sets.
The general model described by (3.7) and parameterized by 6eR" is used:
y(k) =G(q-l; 6)u(k) + H(q-l; 6)e(k) (3.77)
Equation (3.77) is supposed to model time-invariant exponentially stable linear system
with additive noise of the form,
=
y(k) Gr(q-I)u(k) +v(k)
where v(k) is the part of the output that cannot be explained from the input signal in
open-Ioop experiments. It is a zero mean, stationary stochastic process with spectral
representation,
co
v(k) = LhT(T)e(k-T); hT(O)=l
T=O
where {e(k)} is a sequence of independent random variables with zero mean and
variances Är.
The two hypotheses are then given by,
A A
~N i[L(q-I)(Y(k) - B(q~:)
k=l A(q )
U(k)J]2
where,
A 1 N A 2
Ä, =- L[D(q).Y(k)]
N k=l
T= ±[IE(CV )1 +IF(CVm)n
m
2
m=l
(3.78)
where NI> N2 are the sampie sizes for the two experiments, nl' n2 are the estimated
orders of B(q) and U ~n(nl' n2)' The input spectral density estimates may be given
by a-priori information or estimated using a smoothed DFT. By noting that under Ho,
E(cvm) and F(cvm); m=l, ... , f, are asymptotically jointly complex normally distributed
with zero mean and identity variance, the test variable T is ;Cl distributed with 4f
218 Real time fault monitoring of industrial processes
degrees of freedom. Hence for a given confidence level a, a confidence interval can be
calculated. If T falls outside tbis interval, a fault is declared.
Following the parameter estimation phase of the fault monitoring algorithm a decision
has to be made, based on the estimated values, on whether a change/fault has occurred.
There are two basic approaches to tbis problem:
• decide based on the estimated values ofthe model parameters 8i
• decide based on the calculated values of the process parameters Pi using the
inverse relationsbip (3.6).
It should be stressed that decisions made using the first approach do not probibit in-
ferences about the state of the process parameters. As an example consider the following
set ofrelations amongst 8 andp. (pouliezos and Stavrakakis, 1989):
81 = R '(}2 = KmN '(}3 =!'(}4 =- Km '(}5 =..E.,86 1_
= __
L L L Jm Jm JmN
resulting in,
() 1 () () 1
Pt =R = 82 ,P2 =L=e' P3-
-K - 2 P -J - 2_
m-(}N' 4 - m--(}(}N--llN
3 3 3 3 4 176
(}()
Ps = P = __2_5_
(}3(}4 N
Reasoning backwards, a change inp3=Km would appear in (}2, (}4 and a change inpl=R
in () 1. It is obvious therefore that by constructing a table of interrelations, inference on
change ofpi can be made by simple observations ofthe (}j.
Let us examine these two approaches in detail.
Decision based on 8. Hägglund, (1984), proposes the following procedure for
detecting a change in one or more ofthe estimated parameters:
Define the variables,
A8(t) = -8(t) + 8(t -1) (3.79)
w(t) =rjw(t -1) + A8(t) 0 ~ rl < 1 (3.80)
In the case a fault has occurred, w(t) can be viewed as an estimate ofthe direction ofthe
parameter change. The test sequence that will be studied is (s(t», where s(t) is defined
as,
s(t) =sign [A8(tl w(t -1)] (3.81)
Parameter estimation methods 219
The sign function makes the test sequence insensitive to the noise variance. It is now
clear in principle how to carry out the fault detection:
• Inspect the latest values of s(t): if s(t) is +1 unlikely many times, conclude that a
fault has occurred.
Under normal operation, Le. when the parameter estimates are close to their true values,
s(t) has approximately a symmetric two point distribution with mass 0.5 each at +1.
When a fault has occurred, the distribution is no longer symmetric, but the mass at +1 is
larger than the mass at -1. To add the most recent values of s(t), the stochastic variable
r(t) defined as
,(t) = r2r(t-l) + (l-r2)s(t); 0~r2<1
is introduced. The sum of the most recent values of s(t) is replaced by an exponential
smoothing in order to obtain a simple algorithm. When the parameter estimates are close
to the true ones, r(t) has a mean value close to zero. When a fault has occurred, a
positive mean is expected. The parameter r2 determines, roughly speaking, how may s(t)
values should be included. For example r2=0.95 corresponds to about 20 values, which
is a reasonable choice in many applications. A small r2 allows a fast fault detection,
although at the price of reduced security against false alarms. When the signal to noise
ratio is small, it is not possible to detect the faults as fast as otherwise. It is then
necessary to have more information available to decide whether a fault is present. This
can be achieved by increasing r2'
For values of r2 close to one, r(t) will have an approximately Gaussian distribution with
variance,
U _l-r
__ 2
2-
1+r2
Since r2 is generally chosen in this region, it will be assumed that r(t) has a Gaussian
distribution. If r(t) exceeds a certain threshold '0, a fault may be concluded with a
confidence determined from the value of the threshold. In the present algorithm, the
threshold can be computed directly as a function of the rate of false alarms ft If a false
alarm frequency equal to.lj is acceptable, a fault should be declared every time r(t) is
greater than the threshold '0, defined by,
~ r o] = ~
1 rexp[ - -x2] dx =f E
2
p[ r(t)
(27t)u IO 2u
(3.82)
If a small value of the threshold is chosen to make it possible to detect faults quickly, the
false detection rate will be high. This is seen in (3.82) where there is an inverse relation
between '0 and ft The determination of '0 by this method has the advantage that it is
formulated in terms of the expected frequency of false detections, which may be chosen
to suit any particular application.
220 Real time fault monitoring of industrial processes
Tbis procedure is not sensitive to changes in noise level. This characteristic is not found
in other fault detection methods. If this property is not needed, study of the quantity
w(t)Tw(t) can result in more powerful tests. However, it does suffer from the fact that it
cannot locate the faulty parameter. To overcome tbis shortcorning, the following test
based on the time dependancy of the estimated parameter can be performed:
Given the time-dependent quantity a(k) define the function.f{a, NI> N2) as:
1 N2
f(a,NI ,N2 ) = La(k)
N2 - NI + 1k=N1
Use the following detection criterion for each estimated parameter ej , iE[I,n],
IBs/t) - BL,j(t)1
lj(t) = I(}L,i (t)I
where,
Bs/ t) =f( ej , t - N s + 1, t)
BL,j(t) = f(e j, t - N L - N s + 1, t - Ns)
and Ns and NL satisfying O<NS<NL denote the short- and long-term estimator
computation windows respectively. Then adecision on fault occurrence can be made
according to the value of J;(t) as folIows:
If J;(t) ~Jimin \iiE[I,n]: no change
If Jimin <J.-(t) < Jimax for some iE [l,n]: alarm for ith parameter
If Ji(t»Jimax for some iE [1,n]: change in ith parameter
Tbis procedure suffers however from the difficulty of choosing the following parameters:
Ns , Nb Jimax, ~min; i =1, ... , n. These may be chosen through simulation studies only
since their probability distributions are not known. Furthermore the conducted tests
showed that tbis method is difficult to implement because the detection criterium value
fluctuations are much greater after model changes, making the choice of thresholds even
more difficult.
Decision based on p. After calculation of the physical process coefficients consider
p(k) as a gaussian vector with its components statistically independent and its
realizations p(i) and p(j) in different sampie instants i<>j statistically independent. It is
assumed that the mean vector p(k) = E{p(k)} and the covariance matrix,
;=1
A fault will be declared by a significant deviation of the mean p; and/or variance
B; o[ .0; (k), the ith component of p(k) , from the non-error vaIues Pj, 0";.
This is a c1assical hypothesis test problem and can be handled by the formulation of
(m+ I) hypotheses H j , O$,i~:
no fault in the mean and /or the variance of pU = 0)
{
H; = fault of type i (significant deviation of mean and / or variance of .0;
Each hypothesis H j can be associated with a gaussian conditional density function, where
p(Hj ) and dl(Hi ) denote conditional mean and variance for hypothesis H j . Therefore, the
non-error case is described by Jl(Ho), dl(Ho), whereas Jl(H;), dl(Hj ), 1<km, describe a
fault of parameter i.
Fault detection and localization is possible by computing appropriate logarithmic
likelihood ratios. The algorithm used here for estimating the non-error statistics and the
fault detection and 10caIization is described by Geiger (1986), where a common window
technique is used in order to calculate the relevant statistics of the fault occurrence.
However, the procedure employed is non-iterative and this greatly increases
computational time. An iterative procedure described by Pouliezos and Stavrakakis
(1989) is:
(i) For the traininglnon-error case, Ho:
a) Jl;(k) = ~[(k -I)Jl;(k -1) + .0; (k)], i = 1, ... , m, k = 1, ... ,Ns
k
The aIgorithm is initiaIized by Jl; (1) = .0; (1).
b)
2
0";
k- 2 2 1 (A
(k)= k_10";(k-I)+'k p;(k)-Jl;(k-I)
)2 i =l, ... ,m, k=2, ... ,N •
s
aJ(k)=_I- ±[Pj(J)_,uj]2
N w j=k-Nw +l
These three iterative schemes need a starting window of Nw sampie values Pj in order to
be initialized. The window size, NWJ is chosen so that reasonable rates for missed alarms
and false detections are achieved. The advantage of using a small moving window of
sampie parameter values Pj ( k) is in the improved speed of detection. This is offset,
however, by an increasing false detection rate.
A fault in the ith parameter is declared at time k, ifthe quantity,
resulting in less operations per time update. The threshold may be set by defining an a-
priori probability of no-fault PiO ' Then a suitable test is,
Parameter estimation methods 223
Aj(k»ln~
< l-ljo
An alternative discrimination measure is the Kullback discrimination index (KDI)
proposed for fault detection purposes by Kumamaru el al. (1988). The KDI can be seen
as distortion measure to compare two probability density functions. In fault detection
situations one pdf refers to an interval of no-fault, while the other to subsequent
intervals where fault monitoring is in effect. If these intervals are denoted I}, h, their
estimates 0)j2(/) their points numbered by {l, 2, ... , N}, {I, 2, ... , I} respectivelyand
parametric models of the form given by equation (3.7) are considered, the KDI can be
written as,
I N [12]=Jp(Y /8 U )In p(YN ~81,UN_I) dY
t' N I' N-I p(YN /8 2(t),UN _I ) N
where Yj ,Uj are data collections up to point i for interval Il> and p is the likelihood
function. The index can be calculated iteratively using,
I:" [1,2] =I:" [1,2]<2) + I:" [1,2](3)
where,
N 2
I:" [1,2](2) =.!. LIIHil(GI - G2 )u(k)1I
2 x=} A-1
<
The value of the threshold can be determined according to the statistical properties of
the KDI. Under the normal situation Söderström and Kumamaru (1985), have shown
r
that all the terms involved in the iteration have asymptotic distributions with degrees
equal to the dimension of the parameters included in their expressions. This index has
performed quite weil in adaptive control schemes, where a sensitive fault index is
required because of the system's adaptation properties. In those cases monitoring of the
estimation history alone is not enough to trigger fault alarms.
224 Real time fault monitoring of industrial processes
In tbis section several examples of fault detection methods using parameter estimation
applied to real technological problems will be presented. These examples use different
mixtures of modelslestimationldeeision approaehes and thus present an interesting
framework for eomparison.
Dalla Molle and Himmelblau (1987), have applied real time parameter estimation
teehniques for fault deteetion in an evaporator. The eomplexities of a real evaporator
have been simplified, so that the model reduees to,
••por
r..
hit
CE, x, r, Cp )
Figure 3. 7. Evaporator configuration and
notation
As the heat transfer surfaee becomes fouled or scaled, the heat transfer rate is decreased
and the emcieney of the proeess is reduced. On the other hand eomposition at the input
of the evaporator could be useful in determining if the previous unit was operating
properly.
To illustrate the types of trajeetories that oeeur for the two parameters, the following
faults were simulated:
UA Xp
Noise was added to the process measurements to represent randomness, and was also
introduced into the inputs. For the simulations all process parameters were assumed to
remain constant (except for the fault parameters). The standard deviations ofthe noise
factors are listed in Dalla Molle, (1985).
Two fault detection methods were used:
a Least squares with jorgetting jactor. As they are, equations (3.83) are not
suitable for applying the standard L.S. procedure (3.20). However, as shown by Dalla
Molle, (3.83) ean be put into the form:
S(k)p = b(k) + (k)
where,
b(k) = [x(k) - x(k -I) ]/r - Ax(k) - Bu(k) - r(k)
226 Real time fault monitoring of industrial processes
and s(k), r(k) contain the coefficient terms of the fault parameters and other non-linear
terms respectively and 6 is the discretization time constant. Then the parameters p(k)
can be estimated from the following L.S. equations:
p(k + 1) = [I s - U(k)V(k)S(k + 1) ][p(k) + U(k)b(k + 1)]
r
The initial values for the algorithm are given by,
7·,·-illMl~~.
67.S
55.'
62.S
0.032
0.030
XI'
0021
-
0.026
0.824
100 200 ]00 .00 TDO! (MIN)
H
equation, the following system is obtained:
UA
12.S
(a) '.'34
?t •• ..I3i!
67.5 •. nl
xr
&5.' '.te1
62.S '.t2&
&I.'
I" i!M 31'
T'ftE (ft'"'
Gas turbine performance degrades over time due to the influence of many effects
including tip clearance changes in the rotating components, seal wear, blade fouIing,
blade erosion, blade warping, foreign object damage, actuator wear, blocked fuel nozzles
and sensor problems. In some applications, such as in the commercial transport field, the
availabily of reliable cruise data facilitates the use of performance trending techniques for
alerting maintenance personnel to emerging problems. However, the successful
implementation of any trending techniques to gas turbine performance data still depends
very largely on the skill and experience of the operator, especially when trying to
diagnose some faults to module or line replaceable unit level. This situation is futher
exacerbated in the military area because combat aircraft, in particular, seldom operate
with their engines in a steady-state condition for extended periods. Thus, the selection of
a suitable data capture window to provide maintenance personnel with reliable steady-
state data is often difficult without resorting to dedicated tests, either on the ground or
in- flight. In view of this, it would be convenient if operational transient engine data
could be used for assessing engine condition and for diagnosing some of the more
difficult engine faults.
Current generation military aircraft are often equipped with an Engine Monitoring
System (EMS) which can be configured to capture selected engine data under certain
conditions. These conditions include, durlng each take-off and in-flight if one or more of
the measured parameters exceed predetermined limit values. The take-off data, in
particular, has the potential to provide a consistent data base for assessing engine
condition provided the analytical means are available for extracting the fault information.
Parameter estimation methods 229
Because these data comprise engine accelerations from part-power positions, the current
steady- state methods for assessing engine condition are not suitable. Methods have been
developed for extracting fault information from gas turbine transient data (Baskiotis et
al., (1979), Carlsson et al., (1988), Henry (1988), Smed et al. (1988». However, these
methods suffer in their ability to detect small changes which usually accompany the
presence of degraded engine components.
A L.S. method of analysing transient engine data based on the stochastic embedding
principle has been implemented by Merrington et al., (1991). This method, which has
the potential to detect the presence of degraded engine components from the actual EMS
take-off measurements, folIows.
Exact models of aircraft engines are highly nonlinear (Merrill, 1984) and thus simplified
linearized models are usually employed (Dehoff et al., 1978). For example, taking the
engine fuel flow WF as the input and the fan-spool speed NL as the output, an
appropriate Iinearized nominal model is given as folIows:
( -I) B(q-I,e) 1
lW- +b2q-2
Gq = A(q-I) = 1+8Iq-1 + 82q-2
The denominator A(q-I) is determined from apriori information about the system, e.g.,
approximate values of dominant poles or by some prior estimation experiments.
Using this system description the system ouput has the form,
(3.87)
where,
I 1
1'1 WF = ( ) 1'1 WF(k)
A q-I
1](k)=G1lL1[I'1W~(k)r +v(k)
Equation (3.87) can be put in the standard regression form of (3.61) if,
230 Real time fault monitoring of industrial processes
Two noise-fTee non-faulty data sets (CLF6 and CLF61) and a faulty data set (LTEF)
with a -2% change in the low pressure turbine efficiency were chosen for the study
(Figure 3.11). Note that LTEF has the same operating point as that of CLF6 but that
CLF61 has a different operating point with a very similar output as LTEF.
74
,, .. -----------
.....
z ,,
72
, ~
,,
I
70
J
68 68
0 2 4 I 0 2 4 6
Me Me
1.25.----r--"T"""-..., 1.25 .
1.2 1.2
1.15
~ 1.1 ~ 1.15
1.05
o 2 4 I 0 2 4 8
Me
Figure 3.11 Non-faulty data sets (--CLF6; - -CLF61) and faulty data set (LTEF) in aircraft
engines.
Using the data sets and the theory of section 3.3.5 appropriate test variables for fault
detection can be formulated. For example, equations (3.71), (3.72) may be used:
[ A A]T C- I[On-Of
1J=On-Of
A A] (3.71)
(3.72)
The following constants were chosen: sampling period Ts=0.02, number of data points
N=350, a; = 0.152 (with a reference value of 100%) and the input L1WF was assumed to
be corrupted by white noise with variance a~ = 0.003 2 (with a fuet range ofO to 1). The
fixed denominator was taken by prior experiments as al=-1.8238 and a2=0.8294 and the
values of ßn and a; as ß = 0.0837 and a; = 0.0818.
n
Simulation results for test TI are shown in Figure 3.12 and summarized in Table 3.3.
Parameter estimation methods 231
Note that 100 trials were conducted with different noise realizations. These results show
that this fault detection method works very weil even under the effect of linearization
errors.
2000 , . . - - - - , . - - - . . . ,
.= 1000
500
The early detection of process faults is especially atttractive for engines. In this example
a centrifugal pump with a water circulation system, driven by a speed-controlled direct
current motor is considered (Figure 3.13, after Iserman, 1984). The goal is to detect
changes (faults) in the d.c. motor, the pump and the circulation system based on
theoretically derived process models and parameter estimation.
The dynarnic models of the d.c. motor, the centrifugal pump and the pipe system are
gained by stating the balance equations for energy and momentum and by using special
physical relationships. In order not to obtain too many parameters, appropriate
simplifications have to be made, as lumping of more than one process coefficients
together, e.g. the mction coefficients of the motor cFMI and the pump cFPw and the
torque coefficient gw ofthe pump.
232 Real time fault monitoring of industrial processes
Figure 3.13
----_:~-----------~--~~
----------------- M
The resulting four basic equations will be used for parameter estimation in the following
form:
(a) Armature circuit:
dI}(t) =aItM}(t) +aI2 Aw(t) +qAUI(t)
dt
(b) Mechanies of motor and pump:
dw(t)
- - =a2I AII (t) +a22 Aw(t) +a23 AM(t) (3.88)
dt
(c) Pipe system:
dM .
-dt =a33AM(t) +d3A Y(t) (3.89)
[MI(t)] [all a l2
A block diagram of the modeled system is given in Figure 3.14. The parameters of
(3.87)-(3.90) can be estimated by bringing them into the form of (3.9) and applying the
least-squares method. The simple case of the d.c. motor and pump with c10sed valve
and measured signals AU!> All and Aw will be considered.
In this case M(t)=O, so that only (3.87) and (3.88) are to be used. Both equations are
written due to (3.9),
where,
YI(t) = dIl(t)/dt; Y2(t) =dw(t)/dt
V/{(t) = [AII(t) Liw(t) AUI(t)]
9{(t)=[all al2 ~]
"y (t) =[A1 (t) 1 Li w(t)]
Or (t) =[a21 a 22 ]
Using (3.91), the following five process coefficients can be calculated based on the five
A A
RI =-a12~ =-all I ~
~ =11 ~,
,p =-a12~ =-a12 I~, e=,p I a 21 =-a12 / ~a21
234 Real time fault monitoring of industrial processes
I
I I
L.--ARMATURE CIRCUIT -~I. - - MECHANICS -----0·-+1-·-PIPESYSTEM -----{
I D.CMOTOR : D.C.MOTOR
AND PUMP
l I
018
Figure 3.15 Step responses for a change of
015 the speed setpoint.
012
u)=U)/U J, arrnature voltage U )=60V
i)=/)/l). armature current, 1)=O.5A
09 W=Wl/W 1> angular velocity, w 1=62.83s- 1
(~600 rev min- 1)
03 06 09 12 15 18
[sec) I
q,
4
~,
2 I
I
(fo
,
,-Ioosenmg
Figure 3.18: Change of pump
o 20 BO 120 160 200 240 280 packing box friction by tightening and
Imin Jt
loosening ofthe cap screws.
Parameter estimation methods 237
Stavrakakis and Dialynas, (1991), have used recursive least squares estimation with
forgetting factor and hypothesis testing techniques on the process parameter values, for
improving the reliability performance of power substations. Following a positive fault
decision, the sub station is reconfigured according to a detailed fault tree. The fault
detection methodology adopted was applied to the following power substation
components:
A. Power transjormers, modeled by their one-phase equivalent circuit, described by,
V,. =R1I.+ 1.
dL· - M
_1
dIo
- (3.94)
1 1 ~e dt dt
dIi dIo
Vo=M dt -R2Io-~edt (3.95)
where,
V;, Vo : actual input (primary) and output (secondary) voltages,
I j , 10 actual input (primary) and output (secondary) currents,
R 1, R2 primary and secondary winding resistances,
LI, L 2 primary and secondary winding self-inductances,
Lm mutual inductance between windings on the same core, and,
L L
M =-1!!., 4 =~ + L m ~ =L e + -1!!..
a a2
The faults that most frequently arise in practice in the power transformers, were
classified as folIows:
1. Failures in the magnetic circuits (cores, yokes and clamping structure).
2. Failures in the windings (coils and minor insulation and terminal gear).
3. Failures in the dielectric circuit (oil and major insulation).
4. Structural failures.
By monitoring the estimated values of R1, R2, LIe> L 2e> M and performing a hypothesis
testing using the likelihood ratio test, a change in these parameters can be detected,
leading to adecision regarding one ofthe failures 1-4, described above.
B. Substation lines and cables, modeled by their equivalent one-phase circuit which
neglects entirely the susceptance and leakance, and is described by the simple first order
differential equation,
dI·
v· =v,o +RI· +L_
dt
1
1 1
(3.96)
The most important failures occurring on the lines or cables of power substations are the
238 Real time fault monitoring of industrial processes
short circuits which are generally due to insulation breakdown. By applying the
previously described method on the parameters R and L of this model, short circuits can
be detected and localised early, in this way avoiding further degradation ofthe system.
e. Synchronous generators. The model used corresponds to an unsaturated
cylindrical-rotor machine under balanced polyphase conditions, and is described by,
Category C: Outage on the busbar section(s). All the incoming or outgoing circuits
connected to these busbar sections are disconnected and each circuit can be transferred
to any of the available busbars by closing the appropriate breakers and isolators.
Category D: Outage on the components (breaker, isolator) of the busbar sectionilising
branches. After isolation of the outage, the respective busbars are divided into two or
more parts not directly connected to each other. If this substation configuration is not
operationally accepted, all the circuits connected to the affected busbar sections can be
transferred to other busbars with the same restoration procedure followed after the
occurence of an outage on busbar sections (Category C).
Category E: Outage on components belonging to a branch containing a transformer.
Since the power supply from the superior to the inferior voltage level is decreased,
alternative restoration procedures may exist and can be deduced from the list of the
transformer branches being open.
Category F: Outage on the remaining substation components. Alternative restoration
procedures may exist.
The basic steps of the developed algorithm for deducing the suitable substation
configuration after the diagnosis of a substation abnormality are the following:
(i) Consider the detected faults and simulate the corresponding outage.
(ii) Depending on the outage category:
(a) For outage category B, detect the isolators (and their corresponding branches)
belonging to the same interlocking sequence with these taken out. Deduce their
second and third order combinations of the outage contains two and three isolators
of such type respectively.
(b) For outage category E, detect the sub station open branches containing
transformers.
(c) For outage categories F and A on outgoing circuit components, detect the
substation open branches not considered in steps (a) and (b) and their second and
third order combinations.
(d) For outage categories C and D, detect the breakers and isolators which may
close to transfer the disconnected circuits to healthy busbars. Deduce all the
alternative restoration procedures by considering the substation interlocking
shceme.
(iii) Deduce the list of possible alternative restoration procedures by combining the
relevant switching actions obtained in step (ii).
(iv) For each circuit load-point to be considered:
(a) Read the paths from data base.
(b) Identify the closed and open paths.
(c) For each open path deduce the order ofits discontinuity by counting the con-
tained open components.
(d) For either total loss of continuity (no path in operation) or partial loss of
continuity (one or more paths in operation, the supplied load less than required),
Parameter estimation methods 241
consider an the possible alternative restoration procedures and for each ofthem:
• Detect the paths which can be closed by considering only the paths with order of
discontinuity less than or equal to the order of the procedure.
• If one or more paths can be closed, evaluate the load supplied to the load-point
being considered by performing a load-flow on the modified sub station configura-
tion. This configuration contains a limited number of nodes since an the sub station
busbars connected to each other by branches having zero impedance are linked
together.
In order to illustrate the increased and more meaningful information for sub station
opearation that can be achieved using the described computational techniques, a typical
4001150 KV high voltage sub station was analysed. The sub station employs the tripie
busbar scheme for all system busbars and its detailed one line diagram is shown in Figure
3.19. It consists of 34 nodes, 58 branches and 109 components while its interlocking
scheme is shown in Table 3.4. An opearational sub station configuration was studied by
assuming the breakers and isolators status shown in Figure 3. 19. Source points are as-
sumed to be the circuit busbar L8 and the generator busbar L17 while load-points are
the nodes L25 and L33. The minimal paths leading to each load-point from an sources
were deduced and retained in compact form in a data base. Finally, parameter estimation
methods and hypothesis testing on the process parameters were used to deduce the alter-
native restoration procedures which are available after the diagnosis of faults on the sub-
station components. The category of each fault and the components to close are shown
in Table 3.5. For category E and F faults, it has also been assumed that breakers 40 and
93 are open and 32 and 86 are closed.
Table 3.5 Substation configuration after restoration of supply
10
IgT 20
txI
/21
LB
22
109
1
45
23H47
X 48
ll7
46
-.!!-
L5
10 52
II I
2a l l!
L3 LIO
29
18
41
2?.
"~
45t
38 LI5
37
91
97
I
l52 94 x 93
72
l20 lZ8
57 82
Ll9 58 L27
83
105
x 7& 101
36t }75 49t J100
73 1 74
L25
J99
98
L33
Figure 3.19: Detailed one line diagram of a typical high voltage substation. L5, node; x, circuit
breaker (c1osed); <8}, circuit breaker (open); /, isolator (c1osed); 0, isolator (open);
CD, transformer; -. line; e. generator.
Stavrakakis et al., (1990), describe a fast fault detection system for robotic D.C. motor
drives. The detection system is implemented on a commercially available parallel proc-
essing machine.
Using the global dynamic model of a 3 degrees of freedom robotic manipulator derived
by Tzafestas and Stavrakakis (1986), the state-space representation for the actuator of
the i'th link of the robot can be written as,
Parameter estimation methods 243
(3.98)
where,
where,
V; applied armature voltage,
TL; disturbanee torque referred to the link side of the drive shaft,
i; armature eurrent,
OJ; shaft angular velocity referred to link side of the drive shaft,
N; gear ratio,
Jm; moment of inertia of drive rotor,
Km; eleetromeehanical constant of the motor (the baek-emf eonstant is
equal to the torque constant),
R[ armature resistence,
L; armature inductanee,
p; viseous frietion eoefficient.
The subseript i denotes the ithjoint ofthe robotic manipulator. Define,
e _Rj e _KmjNj e __1_ e __ K mj
I - L.' 2 - L. ' 3 - L. ' 4 - J .N.'
I I I ml I
(3.99)
i.e.
eT = [~ ()2 t% ()4 ()S ()6] E R6
The following variables are measured for eaeh motor:
• armature eurrent,
• angular velocity,
• armature voltage,
• shaft torque.
The former two are the system outputs, whereas the latter are the system inputs. Input
and output signal measurements are available at discrete times t = kTo, k = 0, 1, ... ,N, ... ,
where To is the sampling time, defined as i;(k), OJ;(k), V;(k), Tr;(k). The following obser-
244 Real time fault monitoring of industrial processes
The fault detection a1gorithm for tbis case consists ofthe following steps (tasks) carried
out at every sampling instant k:
Measurements: Measure ij(k), w;(k), Vj(k), TuCk) and compute the derivatives iP) (k)
and wj1) (k) by a third order backward formula:
Task 2: Perform one iteration ofthe parameter estimation algorithm for parameters,
ol = [04 05 06]
Task 3(a): Calculate the physical parameters p;(k), i = 1, 2, 3 from the previously com-
puted estimates Oa and 0busing,
PI
( k) - R -
- j- °
·°1
3
(k)
(k)' P2
(k) - L - _1_
- j - (} (k)' P3
3
(k) - N K _ 02 (k)
- j mj - (k) ° 3 (3.100)
The case of a fault occurrence into the gearbox is considered as an event with prob ability
O.
Task 3(b): Redefine the data window by accepting the new estimates
A(k), i = 1, 2, 3, dropping the oldest estimates pAk - N w -1) and recalculating the
real time parameter mean and variance estimates (i.e. the parameter statistics are esti-
mated over the Nw+1 most recent parameter estimates). The recursive ralations described
in Section 3.4 are used.
Parameter estirnation methods 245
Task 3(c): Compute the likelihood ratio for the hypothesis detection problem.
Task 3(d): Decide on whether a fault condition exists. The decision is taken by compar-
ing the likelihood ratio obtained in Task 3(c), against a predefined threshold. To avoid
false alarms, the fault condition is signalied if the threshold is exceeded in M consecutive
instants. The optimal threshold value and M are best chosen by trial and error using
simulation.
Task 4: Perform Tasks 3(a) to 3(d) for parameters pik), Ps(k), using,
The above procedure assumes that the algorithm is run initiallyon a fault free d.c. motor.
From this run the non-error statistics are obtained and are used subsequently in Tasks
3(b), 3(c), 4(b) and 4(c).
The effectiveness ofthe method was verified using simulated data. For this purpose the
d.c. motor robotic actuator parameters were chosen as,
R = 1.04 Q, L = 0.00089 H ,Km = 0.0224 Vsec/rad,
Jm =0.00005 kgm2, P= 0.005 kgm2/sec, N= 64.
A 2KHz sampling frequency is considered. The non-error statistics are calculated using
Ns = 300 sampies, whereas the detection window was Nw = 50. The first parameter es-
timate to be used by the detection procedure was taken at time k = 70, giving a large in-
itial sampie. The likelihood ratio fault detection threshold value is 11.2 and M = 10.
From sampie time k = 1 to 130, the normal operating d.c. drive was simulated. A simu-
lated fault occurred at k = 131, indicated by a 4.8% change in the armature resistance R;
(Rif= 1.09 D). A recursive least squares (RLS) estimator with a forgetting factor of Ä =
0.95 for estimating 0a' and Ä = 0.99 far Ob is used. All estimates converge quickly to
their respective true values. The exact estimated values are shown in Table 3.6.
A major factor for the success of the algorithm is the 2KHz sampling rate. This means
that the algorithm must be implemented on a computer capable of performing all the
above calculations in O.5ms. The above procedure however is suitable for implementa-
tion in commercially available parallel processing machines e.g. the INMOS transputer
system. This algorithm was implemented in a system employing four processors operar-
ing as two-stage pipeline as shown in Figure 3.20. At the input a measurement unit M
feeds the first two processors. At the output, a fault decision unit operates as aseparate
unit, having however a light computational load and it is therefore a low cost processor
system. The numbers shown in Figure 3.20 correspond to the tasks performed by each
machine according to the task partition described earlier. This implementation forms a
2-stage pipeline where its first stage consists of machines 1 and 2 and its second of ma-
chines 3 and 4. The FD machine which is underutilised by the algorithm, leaves power
246 Real time fault monitoring of industrial processes
for suitable presentation of the results. The computational complexity (i.e. the multipli-
cations and divisions per recursion, MADPR) is 30 for each estimator and 60 for the de-
tection procedure.
2 2
Rl LI Km l Nl 'mINI PINI
Truevalue 1.09 0.000890 1.4336 0.2048 20.48
The field of fault detection based on parameter estimation techniques is vast. The pre-
ceding sections present only a small sampie of what has been developed. The interested
engineer can look at the relevant references for more information. Some additional work
follows in summarised form.
The team around Iserman has published several reports of application of parameter esti-
mation fault detection methods to industrial processes. Reiß (1991), developed models
for drilling processes and applied the LS algorithm to the detection of tool wear in two
machining centers. Wanke and Reiß (1991) and Reiß et al. (1990), applied similar tech-
niques to milling machine drives. Janik and Fuchs (1991), used a singular decomposition
technique to enhance LS estimation performance in order to detect tool wear and grind-
ing chatter of grinding processes. Neumann (1991), applied a two-step identification
algorithm for the estimation of a parametrie signal model with ARMAX structure.
Signal spectra are then used for fault detection of machine tools. Freyermuth and
Iserman (1991), have combined parameter estimation techniques with statistical feature
classification methods. This idea was tested on detecting malfunctions of sensors, actua-
Parameter estimation methods 247
tors and gears in industrial robots. Iserman (1991) and Iserman et al. (1990), proposed a
general hybrid framework for machine fault detection using parameter estimation tech-
niques with knowledge processing. Finally, Cho et al. (1992) studied the detection of
broken rotor bars in induction motors. This was done by estimating the rotor resistance
from measurements of stator voltage, stator current, stator excitation frequency and ro-
tor velocity.
Appendix 3.A
,T(k-nw +2)
,T (k + 1)
I
The iteration for ~ ~ k is considered first.
and,
T
~k+l~k+l = [~ T( k,k - nw + 2) If)(k + 1)][~(k,k - nw + 2)]
---'--T---"---'-
, (k +1)
where,
y(k)
then,
y(k-D w +2)
Yk+l =
=[y(k,k - Dw + 2)]
y(k+l)
y(k+l)
Hence,
and,
T
tPk+lYk+l =[tP T( k,k-Dw +2) I.,(k+l)J[Y(k,k-D w +2)]
--'---=T-~----'-
Y (k+l)
Defining,
Appendix 3.B
where Wis upper triangular (or equivalently, find Wand b '= Tb direetly).
The following algorithm is a numerieally improved adaptation of the classical Gram-
Sehmidt orthogonalization proeedure. When eomputations are made exaetly (no round-
off) the result is equivalent to the classical Gram-Sehmidt result. However, when round-
off errors oeeur, Björek (1967) has shown that the MGS proeedure is mueh more aeeu-
rate. The algorithm ean be derived from the classical Gram-Sehmidt orthogonalizing
proeedure, and is essentially the classical Gram-Sehmidt proeedure in reverse order
(Kaminski, 1971). The MGS algorithm is stated in a form whieh eomputes Wand b' di-
reetly.
MGS Aigorithm: For k= 1, ... , n eompute,
250 Real time fault monitoring of industrial processes
_ ~ATCk)ACk)
(Jk - k k
b~ =_1 AJCk)bCk)
(Jk
k W ·
A~k+l) = A C.k ) - - " Aik ), }=k+l, ... , n
J J (J k
(here single suffix denotes column of matrix and double suffix denotes element of ma-
trix). If (Jk =0 at any stage in the algorithm, then the rank of A is less than n.
References
Cho K.R., Lang lH. and S.D. Umans (1992). Detection ofbroken rotor bars in induc-
tion motors using state and parameter estimation. IEEE Transactions on Industry
Applications, 28, 3, 702-709.
Cordero A.O. and D.Q. Mayne (1981). Deterministic convergence ofa selftuning regu-
lator with variable forgetting factor. Proceedings lEE, Part-D, 128, 1, 19-23.
DehoffR.L. Hall W.E. Jr. Adams R.J. and N.K. Gupta (1977). FI00 multivariable con-
trol synthesis program. AFAPLTR-77-35, Vol. land II.
DehoffR.L. and W.E. Hall Jr. (1978). Models for jet engine systems. Part 11: state space
techniques and modeling for control. Controland Dynamic Systems, 14,259-299.
Dalla Molle D.T. (1985). Fault detection via parameter estimation ia a single effect
evaporator. MS Thesis, University ofTexas, Austin.
Dalla Molle D.T. and M.D. Himmelblau (1987). Fault detection in an evaporator via pa-
rameter estimation in real time. Fault Detection anti Reliability: Knowledge-based and
other approaches, Pergamon Press, 131-138.
Favier G., Rougerie C., Bariani lP., de Amaral W., Gimena L. and L.VR de Amanda
(1988). A comparison offault detection methods and adaptive identification algorithms.
Proceedings, IFAC Identijication and System Parameter Estimation, Beijing, PRC,
535-542.
Fortescue T.R., Kershenbaum L.S. and B.E. Ydstie, (1981). Implementation of self-
tuning regulators with variable forgetting factors. Automatica, 17, 6, 831-835.
Freyermuth B. (1991). An approach to model based fault diagnosis ofindustrial robots.
Proceedings, IEEE International Conference on Robotics and Automation, April 7-12,
1991, Sacramento, USA.
Freyermuth B. and R. Iserman (1991). Model based incipient fault diagnosis ofindustrial
robots via parameter estimation and feature classification. Proceedings, European
Control Conjerence ECC '91, 2-5 July 1991, Grenoble, France.
Gantmacher F.R. (1977). The theory ofmatrices. Chelsea Publishing Company.
Geiger G. (1982). Monitoring of an electrical driven pump using continuous- time pa-
rameter estimation methods. Proceedings, 6th IFAC Symposium on Identijication anti
Parameter Estimation, Washington.
Geiger G. (1984). Fault identification of a motor-pump system using parameter estima-
tion and pattern classification. Proceedings, 9th IFAC Congress, Budapest.
Geiger G., (1986). Fault identification using a discrete square root method.
International Journal ofModeling and Simulation, 6, 1, 26-31.
Goodwin G.C. and M.E. Salgado (1989). Quantification ofuncertainty in estimation us-
ing an embedding principle. International Journal of Adaptive Control anti Signal
252 Real time fault monitoring of industrial processes
Processing, 8, 232-345.
Hägglund T. (1984). Adaptive control of systems subject to large parameter changes.
Proceedings, IFAC 9th Triennial World Congress, Budapest, Hungary, 993-998.
Henry JR. (1988). CF-18F404 transient performance trending. AGARD, Paper No.
448, Quebec City.
Iserman R. (1984). Process fault detection based on modelling and estimation methods -
A survey. Automatica, 20, 387-404.
Iserman R. (1987). Experiences with process fault detection methods via parameter es-
timation. In System Fault Diagnostics and Related Knowledge-Based Approaches, S.
Tzafestas et al.. (eds.), D. Reidel.
Iserman R. (1991). Fault diagnosis ofmachines via parameter estimation and knowledge
processing. Proceedings, IFACIIMACS Symposium "SafeProcess '91", 10-13
September 1991, Baden-Baden, Germany.
Iserman R., Appel W., Freyermuth B., Fuchs A., Janik W., Neumann D., Reiss Th. and
P. Wanke (1990). Model based fault diagnosis and supervision ofmachines and drives.
Proceedings, IFAC I Ith Triennial World Congress, Tallinn, Estonia.
Janik W. and A. Fuchs (1991). Process- and signal-model based fault detection ofthe
grinding process. Proceedings, IFACllMACS Symposium "SafeProcess '91", 10-13
September 1991, Baden-Baden, Germany.
Kaminski P.G., (1971). Square root filtering smoothing for discrete processes. Phd.
Thesis, Dept. Aeronautics and Astronautics, Stanford University.
Kalouptsidis N. (1987). Effident transversal and lattice algorithms for linear phase mul-
tichannel filters. IEEE Transactions on Circuits and Systems, CAS-37, 805-813.
Kalouptsidis N., Carayannis G. and D. Manolakis (1984). A fast covariance type algo-
rithm for sequential least squares filtering and prediction. IEEE Transactions on
Automatie Control, AC-29, 8, 752-755.
Kalouptsidis N., Manolakis D. and G. Carayannis (1983). A family of computationally
effident algorithms for multichannel signal processing. Signal Processing, 5, 1, 5-19.
Kalouptsidis N. and S. Theodoridis (1987). Parallel implementation of effident LS al-
gorithms for filtering and prediction. IEEE Transactions on Acoustics, Speech and
Signal Processing, ASSP-35, 11, 1565-1569.
Karaboyas S. and N. Kalouptsidis N. (1991). Effident adaptive algorithms for ARX
identification. IEEE Transactions on Acoustics, Speech and Signal Processing.
Kumamaru K., Söderström T., Sagara S. and K. Morita (1988). On-line fault detection
in adaptive control systems by using Kullback discrimination index. Proceedings, IFAC
Identification and System Parameter Estimation, 1135-1140.
Parameter estimation methods 253
Kwon O.-K. and G.C. Goodwin (1990). A fault detection method for uncertain systems
with unmodeled dynamics, linearization errors and noisy inputs. Proceedings, 1Ith
IFAC Triennial World Congress, Tallinn, Estonia, 367-372.
Liu lS.H. (1977). Detection, isolation and identification techniques for noisy
degradation in linear, discrete-time systems. Proceedings, 1977 CDC, 1132-1139.
Ljung L. (1987). System Identification: Theory for the User, Prentice Hall, Inc.
Englewood Cliffs, NJ.
Ljung L., MorfM. and D. Falconer (1978). Fast calculations of gain matrices for
recursive estimation schemes. International Journal oj Control, 27, 1-19.
Maguire L.P. and G.W. Irwin (1991). Transputer implementation ofKalman filters. IEE
Proceedings-D, 138,4,355-362.
Manolakis D., Carayannis G., Kalouptsidis N., (1980). Fast inversion ofvector gener-
ated matrices for signal processing. Signal Processing: Theories and Applications,
North-Holland, 525-532.
Merrill, W. (1984). Identification ofmultivariable high-performance turbofan engine dy-
namics from closed loop data. JournalojGuidance, 7, 677-683.
Merrington G., Kwon O.K., Goodwin G. and B. Carlsson (1991). Fault detection and
diagnosis in Gas Turbines. Transactions ojthe ASME, 113,276-282.
Neumann D. (1991). Fault diagnosis of machine-tools by estimation of signal spectra.
Proceedings, IFAClIMACS Symposium "SajeProcess '91", 10-13 September 1991,
Baden-Baden, Germany.
Nold S. (1987). Fault detection in AC-drives by process parameter estimation.
Proceedings, IFAC 10th Triennial World Congress, Munich, Germany.
Nold S.and R. Iserman (1986). Identifiability ofprocess coefficients for technical failure
diagnosis. Proceedings, 25th IEEE Conjerence oj Decision and Control, Athens,
Greece, Dec. 1986, 1587-1592.
Pot l, Falinower, V.M.and E. Irving (1984). Regulation multivariable adaptative des
fours. Colloque CNRS "Commande Adaptative. Aspects Pratique et Theoriques", St.
Martin d'Heres.
Potter lE. (1963). New statistical formulas. Memo 40, Instrumentation Laboratory,
MIT.
Pouliezos A, Stavrakakis G. and C. Lefas (1989). Fault detetcion using parameter esti-
mation - A survey. Quality and Reliability International, 5, 4, 283-290.
Pouliezos A and G.S. Stavrakakis (1989). Fast fault diagnosis for industrial processes
applied to the reliable operation of robotic systems. International Journal oj Systems
Science, 20, 7, 1233-1258.
254 Real time fault monitoring of industrial processes
Reiß T. (1991). Model based fault diagnosis and supervision of the drilling process.
Proceedings. IFA CIIMA CS Symposium "SafeProeess '91", 10-13 September 1991,
Baden-Baden, Germany.
Reiß T., Wanke P. and R. Iserman (1990). Model based fault diagnosis of a flexible
milling center. Proeeedings. IFAC Triennial World Congress, Tallinn, Estonia.
Rhodes I.B. (1990). A parallel decomposition for Kalman filters. IEEE Transaetions on
Automatie Control, AC-35, 3, 322-326.
Shibata, H. Ikeda, Y, Maruoka, G., Aoki, S. and T. Ogawa (1988). Application of es-
timation techniques to failure detection for AC. electric machines. Proceedings. IFAC
Identifieation and System Parameter Estimation, Beijing, PRC, 1147-1152.
Smed, T., B. Carlsson, C.E. de Souza and G.C. Goodwin (1988). Fault detection and di-
agnosis applied to gas turbines. Teehnieal Report EE8815, Dept. of Electr. Engr. and
Computer Science, Univ. ofNewcastle, Australia.
Söderström T. and K. Kumamaru (1985). On the use of Kullback discrimination index
for model validation of fault detection. Report UPTEC 8520R, Uppsala University,
Sweden.
Söderström T. and P. Stoica (1988). System Identification. Prentice Hall.
Stavrakakis G.S. and E.N. Dialynas (1991). Efficient computer based scheme for im-
proving reliability performance of power sub stations. International Journal of Systems
Scienee, 22, 9, 1527-1539.
Stavrakakis G.S. and A Pouliezos (1991). Fatigue life prediction using a new moving
window regression method. Meehanieal Systems and Signal Processing, 5, 4, 327-340.
Stavrakakis G.S., Lefas Ch. and A Pouliezos (1990). Parallel processing computer im-
plementation of a real time DC motor drive fault detection algorithrn. lEE Proeeedings.
Part B, 137, 5,309-313.
Thomton C.L. and G.J. Bierman (1977). Gram-Schmidt algorithrns for covariance
propagation. International Journal ofControl, 25, 243-260.
Tzafestas S.G. and G.S. Stavrakakis (1986). Model reference adaptive control ofindus-
trial robots with actuator dynamics. IFACIIFIPIIMACS International Symposium on
Theory ofRobots, Vienna, Austria, December 3-5.
Wahlberg B. (1990). Robust frequency domain fault detection/diagnosis. Proeeedings.
1lth IFAC Triennial World Congress, rallinn, Estonia, 373-378.
Wanke P. and T. Reiß (1991). Model based fault diagnosis and supervision ofthe main
and feed drives of a flexible milling center. Proceedings. IFACllMACS Symposium
"SafeProeess '91", 10-13 September 1991, Baden-Baden, Germany.
Watanabe K. and D.M. Himmelblau (1983). Fault diagnosis in nonlinear chemical proc-
Parameter estimation methods 255
4.1 Introduction
Automatic fault diagnosis, supemSlon and control of very complex systems are
becoming extrememly important. This is the direct consequence of the occurrence of
recent disasters because of unsatisfactory control or missed diagnosis of failures (Three
Mile !sland and Tchemobyl are but a few examples). Control and fault diagnosis cannot
be realized without a good methodology of modeling, i.e. representing the structure and
behavior of the systems under consideration in the significant states of their operation.
The conventional methods of large scale modeling require comprehensive knowledge
about the system consisting of conforming elements (e.g. a set of ordinary differential
equations), and no gaps in the knowledge are allowed. Complex physical systems (e.g. a
nuelear power plant, chemical processes) contain several types of elements and processes
(e.g. nuelear, mechanical, electrical, electronic, etc.) with different types of description
and eventually gaps in the available knowledge. The purely numerical-mathematical
approach oflarge scale systems modeling could not offer adequate methodology to solve
the problems arising in this field, therefore, symbolic and artificial intelligence methods
have been tried to obtain an adequate solution.
Diagnosis is currently one of the largest application domains of expert systems.
Strategies and capabilities for diagnosis have been evolving rapidly. Most of the past
applications involving diagnosis have been rule-based. That is, they use simple
production rules to provide a mapping between the possible causes and inputs of a
system and the possible faults.
The most primitive approach to automation would be to store diagnostic procedures in a
computer and activate them when symptoms arise. This approach is valid, however, only
when the symptoms are anticipated and the corresponding procedures can be predeter-
mined by the designer of the diagnostic system.
The leading wave of technology, however, provides powernd new techniques that are
applicable in a broad range of situations. These techniques give the ability to build and
Automatie expert process fault diagnosis and supervision 257
reason ab out deep models and can operate with a wide range of information, such as
learning from experience, probabilistic information, fuzzy reasoning and learning from
examples. At the same time, a clearer picture has emerged about the range of strategies
available and when they are most appropriate.
In this chapter these strategies will be examined and the nature of automatic expert
diagnostic and supervision systems, with respect to them, will be revealed . A framework
for coupling them together and their real-time implementation features will be provided.
Examples from the recent expert diagnostic practice in industry will be presented to help
the reader to delve into the matter.
An industrial Expert System (iES), in its most basic sense, is no more than a tool to
organize and codifY for the computer, the experience and thought processes of a human
with expertise concerning the operation of a technological process or an industrial plant
or a given piece of equipment.
Knowledge engineering (KE) is the process ofbuilding expert systems. Such systems are
medium-to large-scale software products which are designed to solve problems of differ-
ent kinds using a knowledge-based approach, where the knowledge is represented in an
explieit manner. They have a wide area of applieability, particularly in industrial control.
Generic categories of knowledge engineering applications are interpretation, prediction,
diagnosis, design, planning, monitoring, debugging, repair, instruction and control. Such
systems normally contain two main components: the inference mechanism (the problem
solving component) and the knowledge base (which may ·actually comprise a number of
knowledge bases). Generally speaking, expert systems work best in narrow application
domains.
The process of building an expert system consists of two main activities which usually
overlap: acquiring the knowledge and implementing the system. The acquisition activity
involves the collection of knowledge about facts and reasoning strategies from the do-
main experts. Usually, such knowledge is elicited from the experts by so-called knowl-
edge engineers, using interviewing techniques or observational protocols. However, ma-
chine induction, which automatically generates more elaborate knowledge from an initial
258 Real time fault monitoring of industrial processes
set ofbasic knowledge (usually in the form of examples), has also been extensively used.
In the system construction process, the system builders (i.e. knowledge engineers), the
domain experts and the users work together during all stages of the process, which tra-
ditionally has involved extensive prototyping.
To automate the problem solving process, the relevant task knowledge in the domain of
interest needs to be understood in great detail. However, acquiring the knowledge for
expert system building is generally regarded as a hard problem. This is not surprising, as
acquiring knowledge from an expert entails answering some really fundamental questions
such as:
• What is the relationship between knowledge and language?
• How can different domains be characterized?
• What constitutes a theory of problem solving?
The process of extracting knowledge from an expert is not the process of transferring a
mental model lying in the brain of an expert into the mind of the system builder, but the
formalization of a domain for the first time, and this is inherently a difficult process.
Ideally, models of conceptual structures of problem solving behavior are required as a
prerequisite to the knowledge transfer process. However, cognitive science approaches
have not yet yielded sufficient information to enable a full understanding of the knowl-
edge structures and problem solving strategies of experts to be applied, so that current
approaches are incomplete and often ad hoc.
The situation is further complicated by the fact that experts often have faulty memories
or provide inconsistencies. This means that separate validation of the expertise elicited
from experts is essential. Furthermore, experts exhibit cognitive biases such as overcon-
fidence, simplification, and a low preference for the abstract, the relative and conflicting
evidence. It is therefore important to test and validate expert systems both by analyzing
the expertise in the knowledge base and by examining failures in actual performance. As
far as possible, cognitive biases should be filtered out during the elicitation process.
A great deal of experimental evidence exists about the limitations of human decision
making and it has been suggested that the development of systems which mimic human
problem solving should be approached with some degree of caution. In order to reduce
the chances of bias, experts should be made aware of commonly found biases in
judgment, the elicitation process should include probes to foster the consideration of
alternatives and when experts run through sampie problems in the elicitation process, it
should be borne in mind that the way in which the problems are presented, will have an
impact as to how far any derived rules will exhibit cognitive bias.
Appraising the knowledge engineering process from a cognitive engineering viewpoint,
the following six stages, termed mainstream development, are suggested:
1. Knowledge e1icitation.
2. Cognitive bias filtering.
3. Knowledge representation and control scheme selection.
Automatie expert process fault diagnosis and supervision 259
KNOWLEDGEENGUffiERmG
Developing knowledge-based systems is a far from trivial process. Those who build
knowledge-based systems for industrial systems supervision and diagnosis know that no
significant systems development can sensibly take place without a structured approach.
Systematic and structured approaches to KBS development available today can be found
in Hickman et al. (1989) and Luger and Stubblefield (1989).
Expert systems can be introduced into industrial systems to provide support for different
classes of people such as designers, operators and maintenance personnel. In general
such systems will be off-line (for designers and maintenance personnel) and on-line (for
operators). The knowledge engineering task will be different for each of these applica-
tions since the tasks involved will comprise different knowledge sources and structures.
One difference is that between technologicaVscientific knowledge and experimental
knowledge. This difference was described as a knowledge 0/junctioning versus a knowl-
edge 0/ utilization. The former knowledge is used by designers and maintenance person-
nel whereas the latter characterizes the one used by operators.
Off-line knowledge-based systems are not time critical. They may utilize several knowl-
edge sources including technical documents, reference literature, handbooks, ergonomic
knowledge, and knowledge about operator personnel (for use in user modeling). Whilst
their operation is not time critical, operator time constraints may still have to be taken
into account.
The most critical and challenging industrial expert systems are those developed for sys-
tem operation. They may encompass support for the automatic monitoring system as
weil as support for the operators, and may provide heuristic control, fault diagnosis, con-
sequence prediction and procedural support. The latter is particularly suitable for consis-
tency checking of input sequences or for operator intent recognition (Johannsen and
Alty, 1991).
Expert systems will be more effective when linked to dynamic databases. Knowledge can
then be applied to new situations by periodically executing rules and queries. Because of
this linkage, knowledge acquisition costs can be amortized over many instances of reuse.
Users will not lose their motivation to employ the system because of the the pain of
having to enter data each time. Furthermore, data-driven expert systems can propose and
rank suggestions to deal with the world as changes are observed (Kaiser et al. 1988).
Support expert systems work under time constraints because they are running in parallel
with the dynamic industrial process. These expert systems will depend upon a number of
knowledge sources related to knowledge of functioning and knowledge of utilization.
Additional knowledge such as that of senior engineers will be required.
Whilst a support expert system for predicting the consequences of some technical failure
will normally need only engineering knowledge, procedural support, diagnosis and heu-
ristic control modules will need operational knowledge as well. Since they will also have
to be integrated with the supervision and control system, they will need to support nu-
merical as well as symbolic knowledge.
Automatie expert process fault diagnosis and supervision 261
The importance of signal and symbol processing has been emphasized by Nawab et al.
(1987) and Rouse et al. (1989). They point out that models of symbol processing are
much harder to be identifiable than those of signal processing because semantics and
pragmatics playa large role in symbol processing systems. In particular, they stress the
need for symbolic representations in industrial process control applications.
In all cases of knowledge-based systems development, it will be necessary to define
carefully the goals and functionalities of the various systems and their interdependencies
at an early stage. It is also important to realize that in the industrial environment, not all
applications are suitable for the application of knowledge-based techniques. For exam-
pIe, existing numerical supervision and control systems are based upon thorough engi-
neering methodologies and replacement by knowledge-based techniques would, in most
cases, lead to performance degradation.
Finally, it must be realized that most industrial applications are very complex and this
makes the problem of acquiring and assembling the knowledge in the industrial envi-
ronment much more severe than in traditional computing domains. The elicitation and
conceptualization processes are liable to be far more complex and attempts to prove the
consistency of the knowledge will be very time-consuming. The full process is likely to
take years rather than months. In the absence of a powerful methodology, one is forced
to work with inadequate tools for some time to come.
The most time consuming portion of constructing an expert system is the knowledge ac-
quisition phase (Forsythe and Buchanan (1989), Adelman (1989». Conceptually, knowl-
edge engineering is a measurement problem This measurement problem is a complex one
because there are five sources of variation: domain experts, knowledge engineers,
knowledge representation schemes, elicitation methods and problem domains. Adelman,
(1989), suggests the use of two or three distinctly different knowledge engineers,
knowledge representation schemes and elicitation methods when working with two or
more domain experts. The expert system development team will then be able to identify
which, if any, of these sources of variability result in disagreement in the predictions of
the knowledge base and, thereby, resolve them.
The techniques used in knowledge acquisition can be broadly divided into two catego-
ries: elicitation and machine induction. Strictly speaking, there is a continuum between
human-human elicitation and automatic induction. Three general principles have been
proposed for the acquisition process by Gruber and Cohen, (1987). They are concerned
with primitives and generalizations.
The first principle prescribes that task-Ievel primitives should be designed in order to
capture important domain concepts defined by the expert. The knowledge engineer must
use a language of task-Ievel terms rather than imposing implementation-Ievel primitives.
262 Real time fault monitoring of industrial processes
This principle stresses the importance of separating out acquisition from implementation.
These task-Ievel primitives must be natural constructs for describing information, hy-
potheses, relations, and actions, in the language of the domain expert. This would
suggest that task analyses should be combined with knowledge analyses.
The second principle suggests that explicit declarative representational primitives are
preferable to procedural descriptions. This principle is based upon the observation that
most experts understand declarative representations more easily. Formulating procedural
aspects in this way can facilitate acquisition, explanation, and maintenance. Gruber and
Cohen, (1987), suggest that an expert should be asked "for the parameters of a domain
that affect control decisions, and then to formulate control knowledge in terms of these
parameters" .
The third principle requires representations at the same level of generalization as the
expert's knowledge. Experts should not be forced to generalize except when absolutely
necessary and they should not be asked to specify information not available to them. An
example of an oversimplified generalization would be the requirement to categorize a
process variable as high, medium or low, when the expert needs to differentiate between
many more steps or even a full range of numbers.
Knowledge elicitation.
A number of techniques for knowledge elicitation are now in use. They usually involve
the collection of information from the domain expert(s) either explicitly or implicitly.
Originally, reports written by the experts were used, but this technique is now out of fa-
vour since such reports tend to have a high degree ofbias and reflective thought. Current
techniques include interviews (both structured and unstructured), questionnaires or ob-
servational techniques such as protocol analyses and walkthroughs.
Knowledge elicitation methodologies have more in common with the field-work orienta-
tion of anthropology and qualitative sociology than with the experimental orientation of
many cognitive sciences. It is suggested that knowledge engineers also use the large
amount of literature and experience as weil as the much longer tradition of the social
sciences in field worle, particularly data-gathering methods such as face-to-face inter-
viewing. Some pitfalls of knowledge elicitation are described on the basis of this experi-
ence in the social sciences. In particular, some interviewing problems such as obtaining
data versus relating to the expert as a person, fear of silence and failing to listen, diffi-
culty in asking questions, interviewing without arecord as weIl as conceptual problems
such as treating interview methodology as unproblematic or blaming the expert, are
explained.
Interviews. In a structured interview, the knowledge engineer is in control. Such inter-
views are useful for obtaining an overall sense of the domain. In an unstructured inter-
view, the domain expert is usually in control; however, such interviews can, as the name
implies, yield a somewhat incoherent collection of domain knowledge. The result can be
a very unstructured set of raw data that needs to be analyzed and conceptualized. It is
Automatie expert process fault diagnosis and supervision 263
obviously important for the knowledge engineer to have some knowledge of the domain
before wasting the valuable time of the expert. This might be obtained through text-
books, manuals and other well-documented sources. Group interviews can be useful
particularly in the phase of cognitive bias filtering.
Far from coming naturally, interviewing is a difficult task that requires planning, stage-
management, technique and a lot ofself-control. Forsythe and Buchanan, (1989), present
some ethnographic techniques that can be applied to the problem of identifying and miti-
gating difficulties of communication between knowledge engineers and experts during
interviews.
Questionnaires and rating scales. Questionnaires can be used instead or in addition
with interviews. The interviews can be standardized in question-answer categories or
questionnaires can be applied in a more formal way. However, the latter should be
handled in most cases in a relaxed manner for reasons of building up an atmosphere of
confidence and not disturbing the expert too much when applied in actual work
situations.
Rating scales are formal techniques for evaluating single items of interest by asking the
expert to cross-mark ascale. Verbal descriptions along the scale such as from "very low"
to "very high" or from "very simple" to "very difficult" are used as a reference for the
expert. The construction, use and evaluation of rating scales is described very weil in the
psychological and social sciences literature. Rating scales can also be combined with in-
terviews or questionnaires.
Observations. Observations are another technique for knowledge elicitation. They re-
quire little or no active participation of the expert. All actions and activities of the expert
are observed as accurately as possible by the knowledge engineer who makes recordings
of all the observed information. A special mixture of interview and observation tech-
niques are the observation interviews. Sequences of activities are observed and questions
about causes, reasons and consequences asked by the knowledge engineer during these
observations. The combined technique is very powernd because the sequence of activi-
ties is observable whereas decision criteria, rules, plans etc. are elicited in addition
through what-, how- and why- questions.
Protocol analysis. Protocol analyses are useful for obtaining detailed knowledge. They
can involve verbal protocols in which the expert thinks aloud whilst carrying out the
task, or motor protocols in which the physical performance of the expert is observed and
recorded (often on videotape). Eye movement analysis is an example of a very
specialized version of this technique. Motor protocols, however, are usually only useful
when used in conjunction with verbal protocols.
In a verbal protocol, the expert thinks aloud and a time-stamped recording is made of his
utterances. In such protocols, the expert should not be allowed to include retrospective
utterances. He or she should avoid theorizing their behavior and should "only report
information and intentions within the current sphere of conscious awareness". As a ver-
264 Real time fault monitoring of industrial processes
bal protocol is transcribed, it is broken down into short lines corresponding roughly to
meaningful phrases. This technique can collect the basic objects and relations in the do-
main and establish causal relationships. From these a domain model can be built.
Experience with the use of verbal protocols for the analysis of trouble-shooting in
maintenance work of technicians, is described by Rasmussen, (1984).
The critical decision method (CDM) as described by Klein et al., (1989), is a special
protocol analysis which elicits knowledge from experts and novices in a retrospective
way. Non-routine cases such as critical incidents are selected in order to discriminate the
true knowledge of the expert(s). Sources of bias are minimized by asking for
uninterrupted incident descriptions. Subsequently, the history of the incident is
reconstructed by means of time lines and decision points are identified and probed. It is
claimed that, by using the critical decision method knowledge can be elicited with
relatively little effort .
The cognitive task analysis approach. Roth and Woods, (1989), and Lancaster et al.
(in SIGART newsletter, p. 152, 1989) suggest a multi-phase progression from initial
informal interview techniques (to derive a preliminary mapping of the semantics of the
domain), to more structured knowledge elicitation techniques (to refine the initial
semantic structure), to controlled experiments designed to reveal the knowledge and
processing strategies utilized by domain practitioners.
The first phase (categorial knowledge structure) gives preliminary cognitive description
of the task as a guide for further analysis. It is important here not to horne in on specific
rules. One possibility is to get the experts to provide an overview presentation. Only
when an overview of the semantics of the application has been developed, can more
structured techniques be used.
The second phase (temporal event organization) concentrates on how practitioners per-
form their tasks; thus, there is an emphasis on observation and analysis of actual task
performance. It involves techniques such as critical incident review, discussion of past
challenges, or the construction of test cases on which to observe the experts at work.
During this phase the use of expert panels it is also recommended in order to obtain a
corpus of challenging cases for identifying critical elements and strategies for handling
them.
The third phase (causal structures in knowledge) uses observational techniques under
controlled conditions to observe expert problem solving strategies. The practitioner is
observed and asked to provide a verbal commentary (i.e. the why and how of a particular
domain). The task can be deliberately manipulated, for example, by forcing the expert to
go beyond reasonable routine procedures. In some cases, the expert hirnself controls the
information gathering. Altematively, it is controlled by the ob server. Each approach
provides useful information; the former provides data on the diagnostic search process
and the latter on the effect (or bias) of particular types of information on expert interpre-
Automatie expert process fault diagnosis and supervision 265
tations. Another useful technique is to compare the performance of experts with different
levels of expertise, so as to isolate what factors really account for superior performance.
Teachback interviewing. In tbis technique, the expert first describes a procedure to the
knowledge engineer, who then teaches it back to the expert in the expert's terms until the
expert is completely satisfied with the explanation. Johnson and Johnson, (1987),
describe tbis technique and illustrate its use in two case studies. Their approach is guided
by Conversation Theory, in wbich interaction takes place at two levels: specific and gen-
eral. The paper gives a useful set of guidelines on the strengths and weaknesses of tbis
technique.
Walkthroughs. More detailed than protocol analysis and often better because they can
be done in the actual environment, resulting in better memory cues. They need not, how-
ever, be carrled out in real time. Indeed, such techniques are useful in a simulated envi-
ronment where states of the system can be frozen and additional questions pursued.
Time lines. Tables in wbich several items of knowledge are contained in columns. The
left column has to be filled with the time of occurrence of particularly interesting events
such as failures or operator actions. Related information about the behavior of the tech-
nical process, the automatie system and the human operators at these times is recorded in
separate columns with as much detail as is feit appropriate.
Formal techniques. These include multidimensional scaling, repertory grids and bierar-
cbical clustering. Such techniques tend to elicit declarative knowledge. The most com-
monly used is the repertory grid teehnique based on the personal eonstruet theory. It is
used in ETS, (Boose, 1986), wbich assists in the elicitation of knowledge for
classification type problems. In ETS, the expert is interviewed to obtain elements of the
domain. Relationsbips are then established by presenting triads of elements and asking
the expert to identifY two traits wbich distinguish the elements. These are called
eonstruets. They are then classified into larger groups called eonstellations. Various
techniques such as statistical, clustering and multidimensional scaling are then used to
establish classification rules wbich generate conclusion rules and intermediate ruies
together with certainty factors. The experts are interviewed again to refine the
knowledge. ETS is said to save 2-5 months over conventional interviewing techniques.
The system has been modified and improved and is now called AQU/NAS (Boose and
Bradshaw, 1988). To obtain procedural knowledge, techniques such as verbal protocols
can be used.
Hypertext as a means of knowledge acquisition. Hypertext is an approach to informa-
tion management in wbich information is organized as a network of nodes connected by
links. Nodes may contain text, grapbics, audio, video and generally software for operat-
ing on numeric and/or symbolic data. While other software paradigms are promising
similar things, the essence of hypertext is that linking is machine-supported. At the de-
velopment level most hypertext environments feature control buttons (link ieons) wbich
can be arbitrarily embedded witbin the content material by auser. Hypertext allows for
266 Real time fault monitoring of industrial processes
easy and intuitive access to documents and programs by linking dispersed yet interrelated
information throughout a document, a program or aseries of documents/programs.
Traditional document structure is sequential, in other words there is a single linear se-
quence defining the order in which the text is to be accessed. Hypertext is nonsequential,
i.e. there is no single order that determines the sequence in which text is to be read.
Hypertext is simply the nonlinear presentation of any informational medium. Some com-
munities prefer to reserve this term for textual information only and use hypermedia as a
more general one. The term hypertext will be used here in the more general sense, as
applied to the broad scope of informational media including graphics, video disc, and
other such media. Hypertext systems have been used for information management and
intelligent computer aided instruction systems, and they should prove equally useful in
the area of expert systems technology.
A general knowledge acquisition tool designed around a hypertext concept could allow a
knowledge engineer to list important concepts, create nodes attached to these concepts
which explain their relevance, connect related concepts by linking their nodes, use
graphics to explain difficult concepts, and even critique information entered into the
system previously. In such a system, knowledge acquisition would not be confined to
linear input of information. The knowledge engineer could use the hypertext system to
compile knowledge gathered from an expert after interviewing, or (s)he could enter the
knowledge into the system as the expert sits there saying what information to encode.
It is the nonsequential capabilities of hypertext systems that make them attractive as
automated knowledge acquisition tools. The user of a hypertext system can dynamically
create new nodes for information, make notes to herlhimself in these nodes, and attach
these nodes to the places in the hyperspace whieh have eaused herlhim to think of the
nodes. The user eould even make anode and leave it unattached to any referenee, but
displayed on the screen as areminder of information whieh needs to be added to the
hypertext database. A hypertext system facilitates linking together related information in
hierarchieal structures that resemble the relationships between the nodes of information,
regardless ofhow complex the structure ofthe relationship may be.
Consider an example session with a knowledge engineer or an expert sitting down for a
first pass at amassing the knowledge relevant to a project at hand. The user brings up the
hypertext knowledge acquisition tool and enters a name for the knowledge base. The
system confirms a new knowledge base and displays anode for relevant concepts. The
user enters in list format short phrases to describe the topics which (s)he believes the ex-
pert system will have to know about in order to complete its function. Each concept will
represent a link to anode in which that concept is described in further detail. There can
be multiple links from concepts to other nodes. Also, concepts can be linked to one an-
other to indicate their relationships (similarly to semantic network connections). To the
nodes describing the concepts, other nodes can be attached. Such nodes can be definition
nodes, primarily for use in the expert system for explanation and help faeilities, graphie
nodes, also for user help in the expert system, note nodes, as discussed above, and more
Automatie expert process fault diagnosis and supervision 267
informational nodes describing subconcepts within the main concepts. The key here is
that the node attachments are created by the user as (s)he sees they would best be able to
represent the knowledge required for implementing the expert system.
Another capability wbich the hypertext environment otTers, is the contribution of
knowledge to the knowledge-base from multiple experts. Once an initial attempt has
been made at gathering the knowledge for the expert system, the experts can navigate
through the hypertext system and judge the validity of knowledge represented. Each
expert can attach comment nodes to currently existing nodes which indicate what (s)he
tbinks about the information contained within the original node. Experts can even
critique the information input by other experts (tbis is a capability wbich may or may not
be desirable, so there should be provisions for enabling and disabling such a feature).
There exists, however, a fundamental drawback in hypertext: user disorientation and
even confusion. Increasing the number of connections, or links, increases the possibility
that a user will get lost in irrelevant information. User disorientation may be particularly
severe for large scale applications such as those involved in the utilization of power plant
databases. The process of moving through a hypertext information base is referred to as
navigation. Tsoukalas et al., (1991), outline navigational tools based on the theory of
fuzzy graphs and fuzzy relations. These tools quantity context-dependent user
preferences and application-specific constraints in such a manner that a user may direct
(him)herself to an information island of interest. A numerical example and a prototype
for monitoring special material in a nuclear power plant are included.
Bottom-up and top-down knowledge capturing.
There are two competing views about the knowledge acquisition task which might be
described as bottom-up and top-down.
The bottom-up proponents aim to prise data and concepts out of the expert and then
iteratively refine it. The implication is that deeper mining will reveal more relevant
knowledge, but this assumes that there is a simple relationsbip between what is verbal-
ized by experts and what is actually going on in their minds.
The basic assumption underlying the bottom-up approach is that an expert system is
based upon a large body of domain specific knowledge, and that there are a few general
principles underlying the organization of the domain knowledge in an expert's mind.
However, the existence ofunderlying principles and causal relationsbips may be an indi-
cation that expert knowledge is somehow domain independent. So, expert behavior that
is seemingly domain-specific, may originate from bigher level problem solving methods
wbich are weil structured and have some degree of domain independence. Currently, the
most popular heuristic rule generation procedures are based on schemes of top-down
induction ofdecision trees (TDID1) (Gray, 1990).
The essence of a TDIDT program for decision tree generation is depth-first search. One
starts with some training set of example objects, each characterized by attribute values
and a class designation. The program selects and tests a binary attribute, resulting in two
268 Real time fault monitoring of industrial processes
recursive subproblems. Each subproblem involves a subset of the original example set.
The first subproblem is then analyzed in the same manner leading to two further recur-
sive subproblems. Search proceeds depth-first until either all objects associated with a
general subproblem are in the same class (completing adecision tree branch), or all at-
tributes have been utilized (demonstrating that either data are incorrect or the attribute
set is inadequate). When a branch is completed, the program backtracks to a previous
choice point and from there explores the subproblem(s) associated with alternative val-
ues for the tested attribute.
If the attribute set is sufficient, many decision trees may exist that correctly classify
training set examples. The program uses inductive inference to construct adecision tree
that correctly classifies other objects in addition to those in the training set.
Consequently, adecision tree must capture some meaningful relation between an object
class and its attribute values. Given several trees that correctly classify training set data,
the "simplest" is usually chosen on the grounds that it is more likely to capture the
inherent structure of the problem.
TDIDT methods restrict search by employing a heuristic measure: they considere combi-
nations of attributes appearing to have a high information content. This useful restriction
can make rule generation feasible, but other aspects ofthe TDIDT approach are unhelp-
fut. Its depth-first recursive search with backtracking, causes it to impose inappropriate
context restrictions on rule search. These restrictions lead to opaque knowledge repre-
sentations and excessive sensitivity to noise.
Gray, (1990), proposed algorithmic modifications so that users can still measure heuristi-
cally the information content of attributes in order to guide the search. The program
iteratively examines all positive instances remaining to be covered, along with negative
training set instances. Moreover, the search does not take place with irrelevant context
restrictions. This algorithm is no more complex than usual TDIDT, it is just as fast, it is
less sensitive to noise, and it leads to clearer representations of the information present in
trainingset data.
Machine induction.
Machine induction is a special case of machine learning which encompasses heuristics for
generalizing data types, candidate elimination algorithms, methods for generating deci-
sion trees and rule sets, function induction and procedure synthesis. MacDonald and
Witten, (1989), developed a framework for describing such techniques that allows an
evaluation of the usefulness of any method in solving particular knowledge engineering
problems. They have concentrated upon decision tree and rule set generation approaches
because these techniques have been successfully used in a number of knowledge acquisi-
tion situations.
It is commonly observed that experts have great difficulty in explaining the procedures
which they use to arrive at decisions. Indeed, experts often make use of assumptions and
beliefs which they do not explicitly state, and are surprised when the consequences of
Automatie expert process fault diagnosis and supervision 269
these hidden assumptions are pointed out. The inductive approach reHes on the fact that
experts can usually supply examples of their expertise even if they do not understand
their own reasoning mechanisms. This is because creating an example set does not re-
quire any understanding of how different evidence is assessed or what conflicts were
resolved to reach adecision. Sets of such examples are then analyzed by an inductive
algorithm (one of the most popular being the ID3 algorithm, see knowledge acquisition
tools in the following) and rules are generated automatically from these examples.
The problem with inductive techniques is that the rules induced depend both upon the
example set chosen and the inductive algorithm used. There is no guarantee that the rules
induced will be valid knowledge. Therefore, the approach normally involves acheck with
the expert to validate the induced rules. It is not uncommon to cycle a number of times
through the induction process, refining the knowledge base with the domain expert.
The most important guidelines in the appropriate use of inductive techniques are:
• The technique is useful if there are documented examples or if they can be obtained
easily. It is not suitable where an unpredictable sequence of observations drives the
system (e.g. as in some real-time situations).
• The technique is consistent and unbiased and is very suitable for domains where
rules form a major part ofthe knowledge representation.
• Induction provides the knowledge engineer with questions, results and hypotheses
which form a basis for consultation with the expert.
• There is no explanation for the rules produced' All output must be examined criti-
cally.
• The process assurnes that the example set is complete and current.
• Results should not be sensitive to small changes in the training set.
The inductive technique has been used successfully for weather prediction, predicting the
behavior of a new chemical compound, diagnosing plant disease, symbolic integration,
improved debt collection, and designing gas-oil separators.
Knowledge acquisition tools.
A large number of tools for supporting the knowledge acquisition process have been
developed in the academic environment and some of these have been mentioned already.
The general aim of all these tools is to minimize the number of iterations needed for the
whole knowledge engineering process by bridging the gap between the problem domain
and the implementation. Boose and Gaines, (1988), and Johannsen and Alty, (1991),
give abriefsummary ofthe main tools under development and provide a summary. Some
tools endeavour to make the process fully automatic. KRITON for example, has a set of
procedures and pre-stored interviews, and caters for incremental text analysis and
protocol analysis. Repertory grids are used to pull out declarative knowledge. An
intermediate knowledge representation system is suggested for supporting the
knowledge elicitation techniques. The knowledge representation scheme involves a
propositional calculus for representing transformations during the problem solving
270 Real time fault monitoring of industrial processes
process and a descriptive language for functional and physical objects. Tbis is then
translated semi-automatically into the run-time system but tbis commits the knowledge
engineer to a particular representation.
Other tools, for example KADS and ACQUIST, merely provide a set oftools to aid in a
more methodological approach. Thus, KADS aims only at producing a document de-
scribing the structure of the problem in the form of a documentation handbook.
The KADS methodology is based upon the following principles (Hickman et al., 1989):
• Knowledge and expertise should be analyzed before the design and implementation
starts, i.e. before an implementation formalism is chosen.
• The analysis should be model driven as early as possible.
• Expert problem solving should be expressed as epistemological knowledge.
• The analysis should include the functionality of the prospective system.
• The analysis should be breadth-first allowing incremental refinement.
• New data should only be elicited when previous data has been analyzed.
• All collected data and interpretations should be documented.
KRITON supports only bottom-up knowledge acquisition but KADS supports both top-
down and bottom-up through a hypertext protocol editor (PED) and bierarcbies are
developed and manipulated by a context editor (CE). Top-down is supported by a set of
interpretation models each describing the meta-level structure of a generic task.
KEA TS-1 provides a cross reference editing facility (CREF) and a graphical interface
system (GIS), to support data analysis and domain conceptualization. CREF organizes
the verbal transcript text into segments and collections and GIS allows the knowledge
engineer to draw and manipulate domain representations on a sketch pad. In KEATS-2,
these have been replaced by ACQUIST, a hypertext application for structuring the
knowledge from the raw text data. Fragments from the data are collected around con-
cepts, concepts are factored into groups, and groups into meta-groups. Links can then be
defined between any of these entities. The emerging structure is displayed grapbically.
ACQUIST provides support for both bottom-up approaches (fragments to concepts to
groups to meta-groups) and top-down approaches (using what are called coding sheets
on wbich a caricature ofthe observed behavior ofthe domain expert is captured). In tbis
approach, the knowledge engineer uses a redefined abstract model to guide the
knowledge acquisition process. Use of such models (even if incomplete or inadequate)
can dramatically improve the knowledge acquisition process. The coding sheet is a set of
hypertext cards.
A further knowledge acquisition tool is ROGET. It conducts a dialogue with a domain
expert in order to acquire (his)her conceptual structure. ROGET gives advice on the
basis of abstract categories and evidence. Initial conceptual structures are selected on
tbis basis. Only a small set of example cases have been tested on tbis system.
The systematic acquisition of knowledge about the faulty behavior of a technical system
was suggested by Narayanan and Viswanadham, (1987). A procedure involving the
Automatie expert process fault diagnosis and supervision 271
development of a hierarchical failure model with fault propagation digraphs and cause-
consequence knowledge bases for a given system, is proposed. It uses the so called
augmented fault tree as an intermediate knowledge representation. Fault propagation
digraphs describe the hierarchical structure of the system with respect to faults in terms
of propagation. The cause-consequence knowledge bases characterize failures of
subsystems dependent on basic faults by means of production rules. The knowledge
acquisition process can be reduced to defining parameters required by the knowledge
representation scheme and transforming human expertise into these parameter values.
The augmented fault tree is a conceptual structure, which describes causal aspects of
failures as in conventional fault trees as weil as probabilistic, temporal and heuristic
information (see also Contini, 1987). The production rules of cause-consequence
relations are derived from the augmented fault tree by decomposing it into mini fault
trees. The proposed methodology has reached a relatively high level of formal
description. However, it cannot yet deal with inexact knowledge by using ranges of pa-
rameters. An example of a failure event in a reactor system is given.
Strategic knowledge acquisition tools have been built especially for diagnosis purposes:
TEIRESIAS, ASK, CART, ODYS, NEOMYCIN, HERACLES, BDM-KAT, CATS, ID3,
ELf, HELIOS, MILKAM, KSSO assistants and their evolutions are some of the more
recent knowledge elicitation systems which are in use in medical and industrial diagnosis
practice. The reader is referred to the SIGART Newsletter (1989), Mussi and Morpurgo
(1990) and Brule and Blount (1989) for details.
Having concluded that the human's cognitive limitations, biases, errors, and lack of
knowledge are major obstacles in diagnostic performance, it seems reasonable to attempt
full or partial automation of the diagnostic procedure in order to aid the human
diagnostician in real-time. One approach in tbis category that has recently received
widespread attention is that of expert systems (ES).
Some recent surveys of ES appropriate in fault detectionldiagnosis of technological
processes are provided by MiIne (1987), Tzafestas (1987, 1989, 1991), Majstorovic
(1990), Prasad and Davis (1993) and Dounias et al.(1993).
Several designs have been efficiently employed for fault diagnosis in various domains
such as medical, electronic, computer HIW and SIW, and industrial process diagnosis. A
brief description of the main characteristics of the most important current approaches is
given in the following.
Shallow reasoning approach.
Since shallow reasoning is highly domain-specific, diagnosis is fast if the symptom has
been experienced and thus has been included in the knowledge base. This reasoning
272 Real time fault monitoring of industrial processes
typically uses (production) rules which consist of antecedents and consequents. An ante-
cedent is a condition part and a consequent is an action part of a rule. If certain condi-
tions are met, then some actions are performed. For this reason, such a rule is often
called an IF-THEN rule. These rules can be classified by their behavior, i.e. self-managing
rules or meta-rules. A self-managing rule is one in which actions are performed without
referring to any other rules. A meta-rule is one in which its actions result from the
triggering of other rules.
A disadvantage however, is that shallow reasoning is rigid in the sense that substantial
changes in the rules may have to be made if a single component is added or deleted.
Consequently, the number of rules becomes practically unmanageable as the number of
components of the system being diagnosed increases. If multiple faults can occur
simultaneously, then this approach becomes combinatorially explosive.
The classical example of a shallow expert diagnostic system is MYCIN and
NEOMYCIN (see Section 4.2.2). The main shortcomings ofthis approach are:
• Difficult knowledge acquisition.
• Unstructured knowledge requirements.
• Diagnosability or knowledge-base completeness non guaranteed.
• Excessive number of rules.
• Knowledge-base highly specialized to the individual process.
These disadvantages can be overcome by decomposing the problem into smaller prob-
lems either in a hierarchical manner or according to unit operations.
Deep knowledge approach.
This approach is appropriate for man-made technological system diagnosis (causal ori-
ented systems) and is based on a structural and functional model ofthe problem domain.
A deep knowledge ES attempts to capture the underlying principles of a domain (or
process) explicitly, and so the need to predict every possible fault scenario is eliminated.
Obviously, this approach leads to expert system tools that are able to handle a wider
range ofproblem types and larger problem domains (Yoon and Hammer, 1988).
Deep reasoning, compared to shallow reasoning, is more flexible and thorough, but
slower. "More flexible" means that since deep reasoning is not domain-specific, it is eas-
ier to modify the model when a single component is added or deleted. Deep reasoning
may not be sensitive to the change. "More thorough" means that deep reasoning may
answer "what-if' type questions which may not be possible in shallow reasoning. This
implies that there is no limitation of fault coverage in deep reasoning. "Slower" means
that the speed of reasoning is slower than that of shallow reasoning because a deep
knowledge base does not contain every detail of a symptom.
The principal deep-knowledge diagnosis methods are:
• The causal knowledge search method.
• The physical system mathematical model method.
Automatie expert process fault diagnosis and supervision 273
The hypothesis jormulationlhypothesis testing method follows the usual human diagnos-
tics path, i.e. a cause for a system malfunction (upset) is postulated, the symptoms ofthe
postulated fault are determined, and the result is compared with the process observables.
Of course the search for the location of a fault can be narrowed by using appropriate
heuristics. Hypothesis testing requires qualitative simulation of the effects of the postu-
lated malfunctions. Qualitative (non numerical) simulation requires prediction of the
direction of deviation of measured variables of the process as a result of faults.
Qualitative simulation models need to be enriched with suitable heuristics and
precedence rules in order to be able to resolve competing causal influences on the same
process variable.
A means of interpreting observations made of a physical system across time, in terms of
qualitative physics theory, is described by Forbus, (1987), and Bandekar, (1989). The
theory described is ontology-independent as weil as domain-independent. This means
that it only requires a qualitative description of the domain capable of supporting
envisioning and domain-specific techniques for providing an initial qualitative description
of numerical measurements, even when noisy data should be handled. For the diagnosis
problem this theory provides a general method for testing fault hypotheses using an
analogous AI model, to test if it actually explains the observed behavior. Trave-
Massuyes et al., (1990), present a qualitative calculus (qualitative equations setting, the
orders ofmagnitude qualitative algebras, qualitative equations solving techniques) which
is a key point in qualitative simulation for automatic intelligent fault diagnosis and
supervision. Although qualitative calculus gives rise to numerous problems, this paper
and its authors' previous and current work provide a quite complete resolution scheme.
Ontological analysis approach. Another general knowledge engineering methodology
which is based on the deep knowledge approach is the so called ontological structure
analysis. Ontological analysis proceeds in a step-by-step articulation of the knowledge
structures needed to perform a task by following the objects and relationships that occur
in the task domain itself The application of ontological analysis to practical
troubleshooting/diagnosis problems can be done very effectively by decomposing the
notion of ontological structure into three levels, namely:
1. Static ontology: Definition of actual physical objects in the problem domain and
their properties and relationships.
2. Dynamic ontology: Definition of the state space in which the problem solving must
occur and the actions that transform the problem from one state to another.
3. Epistemic ontology: Definition of the form of constraints and heuristics that control
navigation in the state space.
For example, in the case of electronic instrument troubleshooting, static ontology
encompasses the components and knobs of the instrument which are connected
electronically by nodes, and are grouped in blocks. Dynamic ontology defines the states
of the problem which consist of belief states and instrument states. A beliej state consists
of diagnostic beliefs about the diagnostic condition of each module (component, knob or
Automatie expert process fault diagnosis and supervision 275
block), wbich have associated justifications. An instrument state consists of knob set-
tings and signal inputs that are used to stimulate the instrument under test. Transfor-
mations are measurements of signals and electrical parameters. Measurements are
grouped into tests, and tests are grouped into strategies. Each test and strategy has
implicl\tions, i.e. diagnostic beliefs implied by the test results. Finally, epistemic ontology
defines appropriate types of knowledge that make it possible to choose effective
transformations and thus navigate the problem state space in a reasonable amount of
time. Most experts use the hypothesis formulation/hypothesis testing method presented
previously, i.e. they diagnose an instrument by having (or formulating) a set of
hypotheses about wbich modules might be good and bad. In order to test such
hypotheses, they have heuristic diagnostic strategies, wbich relate each module to the
method by wbich it may be tested.
Formal tools exist for defining and communicating the ontological analysis. These tools
are illustrated in Freiling et al., (1986), with an example domain equation language called
SPOONS (SPecijication Of ONtological Structure) and the presented results are based
on the knowledge-based systemHIPE (Hierarchicallnference Processing Engine).
Hybrid reasoning.
Shallow reasoning has been widely used but, because of the disadvantages mentioned
above, deep reasoning has emerged. However, since it requires more search time and
thus shows an undesirable speed of reasoning for some complex systems, deep reasoning
alone is not satisfactory either. Hence, hybrid reasoning, combining these two
approaches has been attempted in order to perform the diagnostic process efficiently
(from deep reasoning) and effectively (from shallow reasoning). In other words, hybrid
reasoning utilizes both deep and shallow reasoning methodologies in an attempt to take
advantage of the strengths of each. Two directions exist:
1. Deep first, then shallow (D-S).
2. Shallow first, then deep (S-D).
Examples of an S-D approach are CHECK (Combining Heuristic and Causal
Knowledge) by Torasso and Console, (1989), and IDM (Integrated Diagnostic Model)
by Fink and Lusth, (1987). An example of a D-S approach is ISA (Integrated Status
Assessment) developed by Marsh, (1988).
The current trend in diagnosis is toward hybrid reasoning. Yet, there has been no
comparative study of the various types of reasoning. However, when the system to be
diagnosed is relatively small (tbis also implies a small number ofrules), an S-D approach
seems to be preferred. For a large-scale system like a manufacturing plant, a D-S
approach is often chosen. The D-S type hybrid reasoning diagnosis model is analytically
described in Appendix 4.A for the interested reader.
276 Real time fault monitoring of industrial processes
rule number (assuming that the rules are numbered from I to n). If m" =0, then the rule
reduces to a fact.
A syntax mIe corresponding to the above logic rule for an homologous attribute gram-
mar (ie. a grammar wbich when processed by its interpreter will give the same resuIts as
those coming from the successful application of the above logic rules) has the form,
< Ro > :: = < R1 >< R2 > ... < Rm > I-I
"
where the combination of the last two characters means the end of the syntax rule.
The parser of the homologous attribute grammar has the following features.
• No terminal symbols are used (ie. it is degenerate).
• An extended stack is used for saving the attribute values as weil.
• Calls to the attribute evaluator are included.
• A meta-variable, named FLAG, is used to show variable matcbing (when a value
mismatch occurs, the PLAG takes the value false).
A false value of FLAG results in a badctracking of the parsing process. The semantic
mIes that perform the unification of variables can be written in a straightforward way.
The interpreter includes a facility of providing "why" and "how" explanations.
To deal with inexact knowledge (i.e. uncertain facts and rules or imprecise items of evi-
dence) each rule is assigned a certainty measure wbich can be the conditional probability
Automatie expert process fault diagnosis and supervision 277
ofthe validity ofits conclusion, given the corresponding premises (de Kleer, 1990). Each
of the premises is assigned a posterior probability evaluated from previous inference. Tbe
updating of these posterior probabilities can be done using Bayes inference rule. De
Kleer, (1990), applied a similar procedure for the construction of a diagnostic engine in
order to identify automatically the faulty components of a malfunctioning device in the
fewest number of measurements. A minimum entropy technique is used (pandelidis,
1990) to select the next best measurement to be used for the diagnosis procedure.
Alternatively, one can use an upper or lower bound of the validity probability of each
item, i.e. the so called possibility and necessity measures, or the weU known Shortliffe's
certainty factors employed in the expert system MYCIN, mentioned previously.
In the previous sections of this chapter, rule-based knowledge systems for modeling
intelligent behavior and building expert systems for automatie process fault monitoring,
are described.
However, most rule-based programs are extremely computationally intensive and run
quite slowly. The slow speed of execution has prohibited the use of rule-based
knowledge systems in domains requiring high performance and real-time response such
as in real-time process fault diagnosis. In this section various methods for speeding up
the execution of rule-based knowledge systems are explored. In particular, the role of
parallelism in high-speed execution of rule-based knowledge systems is examined and the
architectural issues in the design of computers for rule-based systems are studied. It is
shown that contrary to initial expectations, the speed-up that can be obtained from paral-
lelism is quite limited, only about tenfold. The reasons for this small speed-up are:
1. The small number of rules relevant to each change in data memory.
2. The large variation in the processing requirements of relevant mIes; and
3. The small number of changes made to data memory between synchronization steps.
Furthermore, in order to obtain this limited factor of tenfold speed-up, it is necessary to
exploit parallelism at a very fine granularity. A suitable architecture to exploit such fine-
grain parallelism is a shared-memory multiprocessor with 32-64 processors. Using such a
multiprocessor, it is possible to obtain execution speeds of about 3800 rule-firings/sec
(Gupta et al., 1989).
A rule-based knowledge system is composed of a set of IF-THEN rules (also called
productions) that make up the rufe memory, and a database of assertions called the
working memory. Tbe assertions in tbe working memory are called working memory
elements. Eacb rule consists of a conjunction of c01uiition elements corresponding to tbe
IF part of the rule (also called tbe left-hand side of tbe rule), and a set of actions
corresponding to tbe THEN part oftbe rule (also called tbe right-hand side oftbe rule).
278 Real time fault monitoring of industrial processes
The aetions associated with a rule can add, remove, or modifY working memory
elements, or perform input-output.
The rule interpreter is the underlying mechanism that determines the set of satisfied mies
and controls the execution of the mle-based knowledge system. The interpreter executes
a rule-based program by performing the following recognize-act cyde:
• Match: In tbis first phase, the left-hand sides of aII mies are matched against the
contents of working memory. As a resuIt a conflict set is obtained, wbich consists
of instantiations of a11 satisfied rules. An instantiation of a rule is an ordered list of
working memory elements that satisfies the left-hand side ofthe rule.
• Conflict resolution: In tbis second phase, one of the rule instantiations in the con-
flict set is chosen for execution. If no mies are satisfied, the interpreter halts.
• Act: In this third phase, the actions of the rule selected in the conflict-resolution
phase are executed. These aetions may change the contents ofworking memory. At
the end of tbis phase, the first phase is executed again.
The recognize-aet cyde forms the basic control structure in mle-based programs. During
the match phase, the knowledge ofthe program (represented by the rules) is tested for
relevance against the existing problem state (represented by the working memory).
During the conflict-resolution phase, the most promising piece of knowledge that is rele-
vant is selected. During the act phase, the aetion recommended by the selected rule is
applied to the existing problem state, resulting in a new problem state.
Parallelism is possible to be used wbile performing each of the above three phases. It is
further possible to overlap the processing performed witbin the match phase and the con-
fliet-resolution phase of the same recognize-aet cyde, and that within the act phase of
one cycle and the match phase of the next cycle. However, it is not possible to overlap
the processing witbin the conflict-resolution phase and the subsequent act phase, because
the conflict-resolution must finish completely before the next rule to fire can be
determined and its right-hand side evaluated. Thus, the possible sources of speed-up are:
1. Parallelism witbin the match phase.
2. Parallelism witbin the conflict-resolution phase.
3. Parallelism within the aet phase.
4. Overlap between the match phase and the conflict-resolution phase of the same
cyde.
5. Overlap between the act phase of one cyde and the match phase of the next cyde.
Parallelism within the match phase.
In the following, several ways in wbich parallelism may be used to speed up the match
phase, are discussed.
Rule-Ievel parallelism. When using rule-Ievel parallelism, the mies in a program are
divided into several partitions and the match for each of the partitions is performed in
parallel. In the extreme Case, the number of partitions equals the number of mies in the
Automatie expert process fault diagnosis and supervision 279
program, so that the match for each rule in the program is performed in parallel. One of
the main advantages of using rule-Ievel parallelism is that no communication is required
between the processes that perform a match for different rules or different partitions.
Contrary to all expectations, Gupta et al. (1989), show that the true speed-up expected
from rule-Ievel parallelism is really quite small, only about twofold. Some of the reasons
for tbis are given below:
• Simulations show that the average number of rules affected per change in working
memory is around 28. (A rule is said to be affected by a change in working memory,
if the new working memory element matches at least one of the condition elements of
that rule). In most matches, determining the set of affected rules is much faster than
processing the state changes associated with the affected rules. Thus the number of
affected rules bounds the amount of speed-up that can be acbieved using rule-Ievel
parallelism.
• The speed-up obtainable from rule-Ievel parallelism is further reduced by the variance
in the processing time required by the affected rules. The maximum speed-up that
can be obtained is proportional to the ratio tav/t1flDX' where tavg is the average time
taken by an affected rule to finish match and tmax is the maximum time taken by any
affected rule to finish match. The parallelism is inversely proportional to tmax because
the next recognize-act cycle cannot begin until all rules have finished match. Note
that nominal speed-up (or concurrency) is defined to be the average number of
processors that are kept busy in the parallel implementation. Nominal speed-up is to
be contrasted against true speed-up, wbich refers to the speed-up with respect to the
bighest performance uniprocessor implementation, assuming that the uniprocessor is
as powerful as the individual nodes of the parallel processor. True speed-up is usually
less than the nominal speed-up because some of the resources in a parallel
implementation are devoted to synchronizing the parallel processes, scheduling the
parallel processes, recomputing some data that are too expensive to be
communicated, ete.
• The tbird factor that influences the speed-up is the loss of sharing in the data flow
network when rule-Ievel parallelism is used. The loss of sharing happens because
operations that would have been performed only once for sirnilar rules are now
performed independently for such rules, since the rules are evaluated on different
processors.
• The fourth factor that influences the speed-up is the overhead of mapping the parallel
algorithm on to a parallel hardware arcbitecture. The overheads may take the form of
memory-contention costs, synchronization costs or task-scheduling costs.
Some implementation issues associated with using rule-Ievel parallelism are now dis-
cussed. The first point that emerges from the previous discussion is that it is not advis-
able to allocate one processor per rule for performing match. If tbis is done, most of the
processors will be idle most of the time and the hardware utilization will be poor. When
280 Real time fault monitoring of industrial processes
using only a small number of processors, two alternative mapping strategies can be
considered. The first is to divide the role-based program into several partitions so that
the processing required by roles in each partition is almost the same, and then allocate
one processor for each partition. The second strategy is to have a task queue shared by
all processors in wbich entries for all roles requiring processing are placed. Whenever a
processor finishes processing one role, it gets the next role that needs processing from
the task queue. Some advantages and disadvantages of these two strategies are given
below.
The first strategy is suitable for both shared memory multiprocessors and non-shared
memory multicomputers, since little or no communication is required between proces-
sors. The main difficulty, however, is to find partitions of the role-based system that re-
quire the same amount of processing. Note that even if one finds partitions with only one
affected rule per partition, the variance in the cost of processing the affected rule still
destroys most of the speed-up. The task of partitioning is also difficult because good
models are not available for estimating the processing cost of rules, and also because the
processing cost of roles varies over time. A discussion of the various issues involved in
the partitioning task is presented in Carriero and Gelernter, (1989).
The second strategy is suitable only for shared memory arcbitectures, because it requires
that each processor has access to the code and state of all roles in the program (wbile it
is possible to replicate the code in the local memories of all the processors, it is not pos-
sible to do so economically for the dynamically changing state associated with the roles).
Since the tasks are allocated dynamically to the processors, tbis strategy has the advan-
tage that the load distribution problem is not present. Another advantage of tbis strategy
is that it extends very weil to finer granularities of parallelism. However, tbis strategy
loses some performance due to the synchronization, scheduling, and memory contention
overheads.
Node parallelism. When node parallelism is used, activations of different multi-input
nodes in the data-flow network are evaluated in parallel.
It is important to note that node parallelism subsumes role-Ievel parallelism, in that node
parallelism has a finer grain than role-Ievel parallelism. Thus, using node parallelism, both
activations of two-input nodes belonging to different roles (corresponding to role-Ievel
parallelism), and activations of two-input nodes belonging to the same role (resulting in
the extra parallelism) are processed in parallel.
The main reason for going to tbis finer granularity of parallelism is to reduce the value of
tmax>the maximum time taken by any affected role to finish match. Tbis decreased
granularity of parallelism, however, leads to increased communication requirements be-
tween the processes evaluating the nodes. When using node parallelism, a process must
communicate the results of a successful match to the successors of that two-input node.
However, no communication is necessary ifthe match faits: To evaluate the effectiveness
of exploiting node parallelism, it is necessary to weigh the advantages of reducing tmax
Automatie expert process fault diagnosis and supervision 281
against the cost of increased communication and the associated limitation on feasible
hardware architectures.
Another advantage of using node parallelism is that some of the sharing lost when using
rule-level parallelism is recovered. If two rules need anode with the same functionality,
it is possible to keep only one copy of the node and to evaluate it only once, since it is no
longer necessary to have separate nodes for different rules. The gain due to the increased
amount of sharing is a factor of 1.3, which is quite significant.
Action parallelism. Usually when a rule fires, it makes several changes in the working
memory. Processing these changes concurrently, instead of sequentially, leads to in-
creased speed-up from rule, node, and intranode parallelism. This source of parallelism is
named action parallelism, since matches for multiple actions in the right-hand side ofthe
rule are being processed in parallel.
Data parallelism. A still finer grain of parallelism may be exploited by performing the
processing required by each individual node activation in parallel. This task can be
speeded up using data parallelism (Carriero and Gelernter, 1989). Such parallelism is
expected to reduce tmax even further, and thus help increase the overall speed-up. The
disadvantage of exploiting data parallelism of conventional shared memory multiproces-
sors is that the overhead of scheduling and synchronizing these very fine grained tasks (a
few instructions) nullifies the advantages. However, exploiting data parallelism is not as
hard on highly parallel machines.
Parallelism in conflict resolution.
The conflict-resolution phase is not expected to be a bottleneck in the near future. The
reasons for this are:
• Current rule-based interpreters spend only about 5 percent of their execution time
on conflict-resolution. Thus the match phase has to be speeded up considerably be-
fore conflict-resolution becomes a bottleneck.
• In rule-Ievel and node parallelism, the matches for the affected rules finish at differ-
ent times because of the variation in the processing required by the affected rules.
Thus many changes to the conflict set are available to the conflict-resolution proc-
ess while some rules are still performing match. Thus much of the conflict-resolu-
tion time can be overlapped with the match time, reducing the chances of conflict-
resolution becoming a bottleneck.
If the conflict-resolution does becomes a bottleneck in the future, there are several
strategies for avoiding it. For example, to begin the next execution cycle, it is not neces-
sary to perform conflict-resolution for the current changes to completion. It is only nec-
essary to compare each current change to the highest priority rule instantiation so far.
Once the highest priority instantiation is selected, the next execution cycle can begin.
The complete sorting of the rule instantiations can be overlapped with the match phase
for the next cycle. Hardware priority queues provide another strategy.
282 Real time fault monitoring of industrial processes
Like typical software development, expert system development has a life cycle.
Validation is formally included in most expert system development frameworks, in the
form of phased or task-stepwise decomposition of the complete development process.
The term validation is used many times inconsistently and often confused with
evaluation. Validation is defined here to be distinct form evaluation.
Validation is the process of determining that an expert system accurately represents an
expert's knowledge in a particular problem domain. Tbis definition of validation focuses
on the expert system and the expert. In contrast, evaluation is defined as the process of
examining an expert system's ability to solve real-world problems in a particular problem
domain. Evaluation focuses on the expert system and the real world. Grogono et al.,
(1991), outline some ofthe issues involved in evaluating expert systems and cite almost
200 significant papers on tbis topic.
Validation has two dimensions, verification and substantiation. Verification is the
authentication that the formulated problem contains the actual problem in its entirety and
is sufficiently weil structured to permit the derivation of a sufficiently credible solution.
Substantiation is defined as the demonstration that a computer model witbin its domain
of applicability possesses a satisfactory range of accuracy consistent with the intended
application of the model.
Among the many concems expressed about developing and validating expert systems are
the following:
• What should be validated?
• How is it validated?
• What are the procedures for validation?
• How is bias controlled?
• How is validation integrated into development?
• How are costs controlled?
These concems are particularly relevant when developing demonstration prototypes,
where costs and time resources are constrained. In these situations, it is easy to minimize
or overlook validation. All too often validation becomes bighly informalized and, as a
result, does not become an integral part of development. 0' Leary et al., (1990), extend-
ing Buchanan's and previous testing tasks, presented a specific formal validation para-
284 Real time fault monitoring of industrial processes
digm for prototype expert system development witbin time and cost constraints. It incor-
porates many of the descriptive elements addressed by others, and explicitly incorporates
validation into the development life-cycle approach for prototype development.
The validation process involves verification that the model sufficiently addresses the real
problem in its entirety, and substantiation that the model possesses a sufficient range of
accuracy. Verification and substantiation are evaluated through a three stage procedure
ensuring face validity, establisbing subsystem validity and comparing input-output
transformations.
These stages and processes are related by the interaction of the knowledge engineering
term, the expert(s), the prototypical expert system and the real world. Central to the
validation process are the expert(s) and the knowledge engineering team, consisting of at
least two members. One member, the system designer, has primary responsibility for
knowledge acquisition and encoding the prototypical expert system. The other member,
the third-party validator, has primary responsibility for validation.
The development process begins as the system designer interacts with the expert to de-
velop a view of the expert system. (S)he then creates a tangible representation of tbis
view in the form of an initial prototype (Buchanan's identification, conceptualization,
formalization, and implementation tasks).
During formal validation (Buchanan's testing task), the third-party validator, the system
designer and the expert's work closely together. The validator examines the prototype to
ensure that the system designer's view and the expert's view are consistently represented
and that the prototype is able to respond to domain-specific real world situations. Tbis
examination iterates through three stages: face validity, subsystem validity, and input-
output comparison. As the team members find inconsistencies or unacceptable limitations
in the prototype, they make system reformulations, redesigns, and refinements, and re-
visit appropriate tasks. In tbis manner, validation becomes the driver as the initial proto-
type evolves into a demonstration prototype.
Tbis paradigm is especially relevant to expert system endeavors where demonstrating
feasibility and potential performance is necessary or appropriate before making a sub-
stantial resource investment. As organizations consider integrating expert system tech-
nology into their repertoire of computer-based applications, it is important that experi-
ence precede development work.
A new class of diagnostic systems is emerging from recent programs directed toward ve-
bicle operator aids for fighter aircraft, submarines and helicopters. These systems are
neither static off-line aids nor real-time controllers. Instead they are expert control advi-
sory systems wbich span the time seales of both regimes. These systems interface with
Automatie expert process fault diagnosis and supervision 285
controllers to interpret the error codes and to conduct tests and implement reconfigura-
tions. On the other hand, these systems also interact with the vehicle operator to priori-
tize their activity consistent with the operator's goals and to recommend diagnos-
tic/emergency procedures. The extension of the applicability of these methods to the
industrial fault diagnosis practice is straightforward.
System status (SS) is the function responsible for in-flight diagnosis of aircraft equipment
failures and SS examples will be used here to describe the requirements for diagnosis in
expert control advisory systems (pomeroy et al. , (1990), Passino and Antsaklis (1988».
The diagnostic architecture developed for SS integrates a number of separate technolo-
gies to achieve coverage of all the requirements. This architecture is a fusion of statistical
fault detection techniques like Kalman filters (see Chapters 2 and 3) with artificial
intelligence techniques such as rule-based logic, blackboards, causal nets and model-
based reasoning (see Section 4.2.1). This approach exploits the strengths of each
technique and provides a mechanism for automated reasoning using both quantitative
and qualitative information. Furthermore, the concept of an "event" has been introduced
to track multiple faults and maintain diagnostic continuity through priority interrupts
from the SS controller. A specific application of this approach to jet engine diagnosis is
described by Pomeroy et al. (1990).
Levels of architecture. In the real-time environment of system status any diagnostic
activity must be structured so that it can be interrupted and restarted as SS control reacts
to new events and changing priorities. The diagnostic process must also provide answers
with varying degrees of resolution depending upon the time available for processing.
Both ofthese requirements are met by dividing the diagnostic process into four levels:
1. Monitor for abnormal data.
2. Generate hypotheses that might explain the abnormal data.
3. Evaluate the available data to confirm or rule-out the hypothesized faults; if more
data are required request tests to be done.
4. Execute the tests, and monitor for the results. Tests may consist of running models
of the systems, initiating non-intrusive built-in tests (BITs) in the systems, or
requesting operator approval for intrusive or operator-initiated tests.
These levels communicate through messages as shown in fig. 4.2, and each level is a
knowledge source within the SS blackboard control scheme. While these messages pro-
vide the internallexternal communication functions of diagnosis, something more is
needed to provide coordination of the multiple diagnostic processes which can occur
with overlapping time frames. This problem is solved by linking the overall diagnostic
procedure to the concept of an event.
Events. An event is triggered by a new abnormality appearing in the bus data stream. An
event includes all of the subsequent diagnostic steps leading to isolation of the fault
which caused the abnormality. A frame-based data-structure is used to track each event
and keep it untangled from other events which may be proceeding through processing at
286 Real time fauIt monitoring of industrial processes
the same time. Tbis structure also provides arecord of the event that may be useful for
post-operation maintenance.
fault-found *
bus data a t-corrected*
fiul
I
FauIt
Monitor
- data-abnormal *
new-data
Hypothesis ~
po- Generator I--
eval- fault-suspected *
complete
~ Hypothesis
- -
~
Evaluator
-
test-
complete test-requested
Hypothesis
Testing ~
Faulted models
I Operator initiated
tests
Each event is an instance of a general event class; event frames have the following slots:
BUS DATA: a list of data sampies connected with the event; tbis is a "snapshot" ofthe
situation near the event, and may include later sampies collected during testing.
ANOMALIES: a list of abnormal data items wbich triggered tbis event. Tbis list is used
by the Fault Monitor to suppress further data-abnormal messages once an event has been
spawned; it provides a "we know about that and we're working on it" sort ofbehavior.
HYPOTHESES: a list ofpossible faults.
TESTS PENDING: a list oftests that are to be performed.
TESTS COMPLETED: a list oftests and their results.
FAULTS CONFIRMED/RULED-OUT: the hypotheses are sorted into one of these two
categories.
STATUS OF EVENT: is pending until it becomes resolved or unresolved.
Diagnosis stops when there are no new hypotheses.
Interaction with other functions. Communication between SS Diagnosis and the SS
Limits Estimation and Corrective functions is provided by the activity of a causal net-
work (see Section 4.2.1.3).
Automatie expert process fault diagnosis and supervision 287
Communication with the outside world consists of the input and output streams dis-
cussed earlier in connection with fig. 4.2. All communication between the system subsys-
tems is by means of the bus data stream, which implements the following division of la-
bor between SS diagnosis and the local system diagnosis:
1. All fault detection is performed within the local systems. Detection requires con-
tinuous screening of sensor data at the sampling rate of the local controller, and
detection processes are typically included in the controlloop to protect against sen-
sor failure. Transtnitting the sensor data to a central detection process in most cases
would require high bandwidth communication. Fault detection can be done more
efficiently in the local systems.
2. Isolation offaults is shared between SS and the local systems. In general, fault iso-
lation can be most efficiently done by the central diagnostic process (SS) which can
bring multiple sources of information to bear on the problem, and which can exe-
cute tests beyond the scope of the local systems.
3. On the other hand, there are classes of faults which must be isolated by the local
system in order to reconfigure quickly enough to avoid loss of control.
Thus the bus interface to SS normally reports only the results of continuously running
built-in tests (BITs), i.e. error alerts; in the case of jet engines these BITs are generated
by a Kalman filter that continuously compares the engine sensor data to outputs from an
engine model. Only when a fault occurs and SS begins isolation, does SS request access
to detailed data sampling streams.
Multiple faults that are related through a common mode can be addressed within the
event-based architecture by adding a Fault Predictor to the four functions in fig. 4.2.
Whenever a fault is found this predictor searches for common mode relations, e.g. func-
tionally connected or physically connected, and posts the names of components which
may be effected to act as a focus mechanism for the hypothesis generator.
Process parameters and some process observables are gathered during the process exe-
cution, so they may be represented as discrete curves with time as the independent vari-
able. Frequently, an ideal curve can be associated with each process. This is what is
expected from a perfectly executed process.
Problems in operation are often identifiable when the input curve deviates from the ideal
curve. The deviation may be a difference in slope, amplitude, or duration between the
input and ideal curves. The difference in curves may be caused by malfunctioning equip-
ment, processing an already damaged part, or processing problems (e.g., operator er-
rors). In all ofthese cases, it is important to identify the problem in order to make ap-
propriate corrections. Analysis of curves is therefore an important tool for diagnosis.
288 Real time fault monitoring of industrial processes
Diagnostic techniques have been developed to analyze process parameters and observ-
ables that change over time (Dolins and Reese, 1992). These techniques can use specific
digital signal-processing algorithms to transform the input signal into symbolic data.
Knowledge-based diagnosis is performed on the symbolic data to determine malfunc-
tions. The monitoring system informs appropriate personnel of problems by sounding an
alarm or printing a message.
Curve analysis involves detecting and identifying deviations of an input curve from an
ideal curve. There are two alternative ways to perform analysis: one approach is to com-
pare the input curve to a set of curves that result from unsuccessful processing. Another
compares the input curve only to the ideal curve using qualitative analysis of the differ-
ences.
In the first approach, a knowledge base of abnormal curves is defined, where each curve
is a characteristic representative of a particular problem. Associated with each character-
istic abnormal curve is a diagnosis. Ifthe input curve closely matches one ofthe abnor-
mal curves, then the associated cause of the problem is reported. The advantage of this
approach is implicit diagnosis; when the input curve matches successfully, it already has
an associated diagnosis. However, tbis approach has two disadvantages. First, it may be
difficult to build a complete knowledge base as the anomalous curves must be defined to
match closely with actual erroneous measurements. The second disadvantage is that
curves are hard wired, i.e., if the process changes, then the entire knowledge base must
be changed to support the new data describing the correct and incorrect behavior of the
process.
The second approach compares an input curve to the ideal curve only. Ideal and input
curves are composed of regions. A region is a continuous group of data points where
each point has approximately the same slope. Regions can be inclining, flat, or declining.
If a process engineer is uninterested in several contiguous regions, then (s)he may elect
to aggregate them into one region. In general, region divisions correspond to significant
changes in the process, e.g., an abrupt change in the value of a parameter. Tbis approach
is possible if the user has some technique available to describe anomalous curves with
respect to the ideal curve. Such a description should allow the user to express deviations
using qualitative as weil as quantitative criteria, and associate causes using symbolic
processing. Suppose one uses a technique based on tbis approach, to interpret an input
curve that has a flat region with a longer duration than the ideal curve. The technique
should allow her(bim) to describe the problem in terms of the flat region having a dura-
tion that lasts too long. Also, the user must be able to associate causes of problems with
the different anomalous curves. Several diagnostic systems have been developed to diag-
nose manufacturing problems based on tbis second approach (Dolins and Reese, 1992).
Dolins and Reese (1992), developed a technique that allows manufacturing and process
engineers to describe abnormal curves. The abnormal curves are described in terms of
their differences from the ideal curve, wbich is the curve that best describes a process
parameter or observable after a given industrial process successfully finishes processing.
Automatie expert process fauIt diagnosis and supervision 289
Manufacturing engineers can describe the differences symbolically, e.g., "if the first re-
gion of the curve lasts too long then the machine must have agas leak". The user can
also input numeric values to set tolerances for determining unacceptable input curves.
The technique is independent of any industrial process, and all domain specific informa-
tion is input by the user, who is an expert in the process, to the program.
The technique has two operating modes: process definition and process monitor-
ingldiagnosis.
In process definition, the human expert has to describe the ideal curve and anomalies. An
ideal curve is initially input into the computer program, and the user manually selects
regions. Each region is an interesting feature in the ideal curve which corresponds to a
specific manifestation of the process.
After defining the ideal curve the human expert describes input curve anomalies by creat-
ing a knowledged-base of process-specijic ru/es. Process-specific roles relate generic
tests to input and ideal curve regions for a given process. Generic tests are built-in func-
tions provided by the diagnostic technique that compare different symbolic attributes of
input and ideal regions. For example, length is a symbolic attribute of a region, and the
result of a comparison of the length of two regions can be described as either too /ong,
too short, or okay.
In the process monitoringldiagnosis mode, the technique analyzes input curves in two
steps: signal-symbol transformation and knowledge-based diagnosis, see fig. 4.3. The
signal-symbol transformation step identifies regions of the input curve by matching all of
the points of the input curve to the ideal curve. After a11 points are matched, the regions
of the ideal curve are used to find the regions of the input curve. The second step applies
the complete knowledge-base of process-specific roles to compare the regions of the
input curve to the regions of the ideal curve.
An expert is required to select an ideal curve for a particular process and input the curve
to the program. Some machines may have idiosyncracies that make their ideal curve dif-
fer in shape from the ideal curves generated by the other machines of the same type. In
these cases, an ideal curve has to be defined for each machine.
Once the ideal curve is input, the expert divides the curve into meaningful regions, Le.,
(s)he marks divisions where process-related changes occur. These regions are stored and
used later in the analysis. The expert also defines a set of roles for testing input curves.
Entering an ideal curve, dividing the ideal curve into regions, and defining roles are
initialization tasks required of the human expert. These tasks constitute the process
definition mode.
The diagnostic system can now ron automatically without human intervention until an
error is detected, Le., the program can operate in a process monitoringldiagnosis mode.
The combination of signal-to-symbol transformations and role-based reasoning has sev-
eral advantages, but it is not a panacea for a11 diagnostic problems based on curve inter-
290 Real time fault monitoring of industrial processes
pretation. One disadvantage of the diagnostic technique is that two potential processing
problems may have identical input curves. In tbis case, a better diagnosis can only be
provided if more data are available and more reasoning provided. A second disadvantage
of tbis technique is that an abnormality in a curve may mask other problems. One ap-
proach is to explain only the first difference between the ideal and input curves.
Knowledge Base
1\ "
..
,-------------------~
ATTENTION:
Knowledge-based Diagnosis
Signal-Symbol Transfonnation
Figure 4.3 Curve analysis based diagnosis combining digital signal processing and mle-based
reasoning.
One advantage of tbis method is that the signal processing algorithm used to transform
the input signal into symbolic data allows the fast analysis of regions that vary with re-
spect to time. Tbis is important because the durations of regions may vary due to unsuc-
cessful processing. Regions ofthe input curve, with varying durations, can match directly
to corresponding regions of the ideal curve. Tbis processing allows the user to examine
regions symbolically.
A second advantage is that few false alarms are generated with tbis method. Problems
are detected by the process-specific mIes, and the process engineer has complete control
over the criteria for judging acceptable and unacceptable traces. False alarms can only be
caused when process engineers define rules that incorrectly diagnose problems or incor-
rectly set thresholds.
The system's ease of use is a third advantage. Only an ideal curve and process-specific
rules have to be defined. Furthermore, few rules are needed for the system to be effec-
Automatie expert process fault diagnosis and supervision 291
tive, which is unlike most knowledge-based systems. For these cases, a process engineer
many only need to define a single rule to detect a commonly occurring error.
Several applications to detect manufacturing problems as soon as they occur are dis-
cussed by Dolins and Reese, (1992), to illustrate the general purpose use of this tech-
nique.
Petri nets are a powerful tool for system description (Al-Jaar, 1990). Nevertheless up to
the present they have mainly been used only for simulation purposes. The problem of
process fault monitoring in an industrial plant can be stated as folIows: The measurement
signals come from the system with a constant scanning rate. When processing these data,
a computer-based system should decide on-line in real time if an error has occurred or
not. To perform this, the computer program needs some expert knowledge about the
system (or the "total" process, which is composed of several partial processes, like big-
ger subsystems in a power plant or in a chemical factory) under consideration.
By modeling the system as a Petri net, failures with slow time constants are detectable in
real-time. Sensor or process errors which are manifested in signals related to physical
conservation quantities can be identified. After a fault is detected, a prognosis of the
future system's behavior can be provided.
The original Petri net theory only describes the causal correlation between places and
transitions within a system (an event is a consequence of another one). There were no
statements about its temporal behavior. This, however, is absolutely necessary for de-
scribing events and processes in the manufacturing area. There are different theories how
to link Petri nets with time. In the manufacturing techniques the processes (milling, drill-
ing, assembling, ... ) are responsible for the consumption of time. This is the reason, why
one has to associate time with the transitions. Thus, in the case of firing a transition, the
tokens of the places before a transition will be removed. If the firing time is over they
will be at the place behind the transition. An example is the time which takes a slide from
one limit switch to the following one.
The nets for diagnosis purposes represent the temporal progress of aplant or machine,
which are to be controlled, as a model. This explains why the nets used for control form
the basis for the construction of the nets for diagnosis (see fig.4.4). The places in both
nets represent the inputs and the outputs of the PLC and, therefore, they are the interface
between control and diagnosis. Thus, in both nets the count and the indication of the
places must be identical.
The most important function of a diagnosis system is the monitoring component. Its ca-
pability defines the nature, the scope and the precision of the failure detection. Only after
the detection of a failure a specific diagnosis can start. The power of monitoring is
292 Real time fault monitoring of industrial processes
equivalent to the quality and quantity of infonnation from the machines. This is especially
the case when sensors and actuators do not have their own infonnation processing and,
therefore, they are not able to monitor themselves.
The range of methods for monitoring depends extremelyon the support for the methods,
which is provided directly from the model.
e.I SC
@
, sc
SI .-----1( S2
.....
SC ." .'
S 4 1 - - - - -.... S4
CONTROL DIAGNOSIS
SI,S2:input signal from sensor
S3,S4:output signal to actuator
SC:secondary condition. This condition is necessary for
firing of a transition.
Ifthe transition fires, no tokens will be removed from the
place before the transition.
Within the concept of monitoring, one can distinguish between a functional and a tempo-
ral comparison. The required state is determined by the interpretation of the Petri net
data structure. This takes place on the facility level as weil as on the station level. The
actual state on the station level results from the inputs and the outputs of the PLC, which
are assigned to the places of the Petri net. On the higher levels, the actual state results
from the condensing of the state reports from the different PLCs which control plant
components as single machines, conveyors, robots etc.
In order to show the different monitoring methods clearly, the following cases have to be
distinguished (see fig. 4.6):
1. The real process has kept to the required time.
2. The real process has fallen short ofthe required time.
3. The real process has exceeded the required time.
In the case of time monitoring, the duration of performing a real action is recorded and
compared to the required time. The required state of time is taken from the active transi-
Automatie expert process fault diagnosis and supervision 293
tion in the Petri net. If more than one transition is simultaneously active, the time moni-
toring will be processed in a parallel way. Ifmicrocomputers are used on the facility level
and PLC's on the station level, their operating system provides several timers. These
timers can be used for monitoring.
ü--1 T+At
-0
occurance of
required stote
runtime oftransition
start of
transition Trequired
f f f
0 (0 0
Figure 4.5 Diagnosis of sensors.
I··· -_...•
j~~:
S1
()--..J 0 sec Tl t----O: :
~~I
, I I ~_ _ _..,
, I I
SI S2 ~,!
1'\ S3
I ...
Sl,S3:Sensors I
I
I
S2 :Actuator
.. _------I
Figure 4.6 Different states in the Petri net based monitoring concept.
The interpreter of the Petri nets within the diagnosis system always determines the next
required state and, by means of the time component, also the precise time of its occur-
rence one step in advance compared to the real plant. During the runtime of the system it
294 Real time fault monitoring of industrial processes
is important that the diagnosis program and the control program work concurrently
(Maäberg and Seifert, 1991).
In order to prevent the indication of a failure in cases of small deviations from the re-
quired time, a tolerance time is additionally implemented. A tolerance time can be clearly
assigned to a transition. After the required time of an active transition within the
diagnosis net model has passed, the component for monitoring of the tolerance time will
be activated and the required conditions of the places behind that transition will be ac-
tualized. Within the monitoring of the tolerance time, a continuous comparison between
the required and the actual state of the places which are directly connected to the transi-
tion, is performed. Ifthe required and the actual state ofthose places are equal (case 1 in
fig 4.6), a failure has not occurred, the comparison between the required and the actual
state will be broken off, and control and diagnosis of the plant will be continued. If, even
after finishing the tolerance time a deviation between the required and the actual state of
the places can be determined, a failure will be detected from the time monitoring (case 3
in fig 4.6).
In order to select and define the correct reaction of the diagnosis system in case of a fail-
ure, a thorough analysis of all possible failures by the operator of the plant is necessary.
The failures have to be classified according to their effects and the reactions correspond-
ingly defined. In case of serious failures the diagnosis system must react with emergency
shutdown or emergency stop. Deviations, which do not represent any failure can there-
fore be ignored. This means that the diagnosis system causes reactions which do not
stop the plant, but make an operation possible in those individual manufacturing
parameters (e.g. velocity of motion etc.) which are changed. Another possibility of the
limited operation (LO) of the plant is the activation of alternative predefined control
strategies, wbich, for example, transfer the plant into a secure condition. With tbis con-
cept minor failures can be compensated or even corrected by control instructions. As
soon as the classification of failures has finished and individual failures as weIl as combi-
nation failures have been assigned to the correct reactions, the results are made available
to the diagnosis system in the form of a so called reaction model. This model consists of
IF-THEN rules. The causal correlations describe which preconditions lead to which
reactions. A mechanism, that handles the rules can choose, after detecting a failure, the
correct rule and activate the planned reaction.
The essential module of the cooperation between the functions of monitoring, diagnosis
and therapy is the mechanism of handling the rules. It does not only process the reaction
model as the collection of all failure rules, but it also administers the failure vector as the
interface between the above mentioned mechanism and the module, which compares the
actual and the required state.
The failure vector is structuraHy identical with the vector of the required and actual value
and consequently identical with the structure of the IF-part of the failure rules. Each
column of that part of the rule represents a vector, which has as many elements as there
Automatie expert process fault diagnosis and supervision 295
are states in the Petri net. This sort of data compatibility guarantees a very fast
processing in the mechanism ofrule handling (see fig. 4.7).
Figure 4.7 Concept ofthe mechanism, which handles the rules in Petri net based fault diagnosis
In the case of an emergency shutdown or emergency stop, control orders for stopping
the operations are given out, the diagnosis is stopped and areport about this interruption
is produced.
In case the reaction intends a limited operation, the control orders which the operator
has defined in advance in the form of a program for a PLC, are activated. For this
purpose, a message is sent from the station level to the area level which immediately
selects the corresponding Petri net for control.
Such a diagnosis system must be designed in such a way, so as to be able to be imple-
mented on all levels of a hierarchical control structure. This concept is supported by the
capability of the Petri nets to decompose complex systems. By means of Petri nets it is
possible to describe a manufacturing system as a rough net in which a transition for
296 Real time fault monitoring of industrial processes
example, represents an individual machine, robot or conveyor belt. The places within the
net represent, according to their definition, static components as a storage system or as a
buffer for workpieces. The individual transitions, however, can be specified in greater
detail depending on their meaningfulness. A transition on a higher level represents an
entire net in a subordinated level. With that capability it becomes possible to model
individual function units like an actuator or the movement of a slide, as weil as individual
places like a limit switch. This means, that on the station, or PLC level, individual units
and their functional sequences can be monitored and diagnosed, whereas on the higher
controllevels the level of abstraction increases and because of this the entire plant or the
cooperation between individual machines is to be monitored and diagnosed here.
The module which activates the reactions has the same allocation of tasks. Each control
level is autonomous in its reaction behavior. In case a failure is detected on the
component level, areaction will be activated and areport will be sent to the area level
computer. If the reaction is a limited operation, arequest is additionally sent to the area
level which immediately transfers the replaced control program and its corresponding
diagnosis program to the PLC.
An event, for example the stoppage of an individual machine by the local PLC, causes a
change of the corresponding actual state on the higher level. On this level, a deviation
will be recognized from the monitoring component and an appropriate reaction will be
activated. Such areaction may be the activation of a redundant machine on which the
workpieces can be further processed.
The advantages of the decentralized diagnosis tasks are the relief of the area level com-
puter, the uniformity of failure recognition algorithms at all control levels and therefore
the high response speed in case of deviations.
Maäberg and Seifert, (1991), present a Computer Aided Automation Environment sup-
porting a user during all phases of the life cycle of an automated plant, beginning from
the planning and projection phase up to the runtime of the plant. Petri nets are the
integrating components. They are generated in the projection phase by translating the
function charts. In the realization phase they are used for simulation and planning and
finally in the running phase of the plant they are used for controlling, monitoring and
diagnosis of the plant.
Prock, (1991), describes a new technique of on-line fault detection in real time, in a
process independent formulation, using pIace/transition nets. Place/transition nets are a
subclass of Petri nets. For readers who are unfamiliar with the Petri net theory, some
basic definitions of place/transition nets (hereafter called pt nets) are given in Appendix
4.B. This formal presentation for pt nets theory will help the reader to understand related
application examples as weil as to deal with related diagnostic problems from
engineering practice.
Prock, (1991), applied this method to the real time fault monitoring of a secondary
eooling loop of a nuclear power plant. The deteetion of abnormal proeess behavior or
Automatie expert process fault diagnosis and supervision 297
measurement faults with low time constants was possible and a prognosis of the future
system behavior was given in the error case. Due to the simplicity of the fault detection
criterion no diagnosis of the failure localization could be provided. This is not areal
drawback because fast transients, as a consequence of serious faults, are weil managed
by the automatic plant safety systems.
The Petri net fault monitoring methods are predestinated for the surveillance of complex
technical systems like production lines or transport circuits (Wiele et al., 1988). Because
of the lack of the diagnosis feature, this method should be considered as part of an on-
line process information system which is able to trigger a (possible off-line and thus more
practical from the implementation point ofview) diagnosis and interpretation unit.
where Aij is a linguistic variable. A linguistic variable is a variable whose value can be
presented by a linguistic term used by experts such as "high", "normal" or "low" (i.e.
words or sentences in a synthetic language). A linguistic variable includes an adjective-
like term and its antonym, a modifier and a connective. The modifier is a measure of in-
tensity which is associated with a possibility distribution. This is often referred to as the
membership function in the literature. The fuzzy logic connectives are the weil known
conjunction, disjunction and negation operations. The value of a linguistic variable can
be presented by a fuzzy set which permits the definition of a membership function p" re-
flecting the degree to which an element belongs to the set. The membership function for
elicited expert knowledge about the fuzzy test limits can be represented by a piecewise
linear function. Such a function is presented in fig. 4.8. The four values a, b, c, and d are
numerical values stated by the experts in the process of knowledge acquisition Bi is a
possible conclusion.
Membership
degree (p.)
a b c d
Variable
Figure 4.8 Representation ofthe fuzzy function
Since the value Ai,j is represented by a fuzzy set, it is possible to associate it with the
grade of membership of conclusion by means of rules of fuzzy logic, even in cases where
the input value A; is not equal to that in the implication part of the rule, as contrary to
the "modus ponens" oftraditionallogic:
Automatie expert process fault diagnosis and supervision 299
A* input value
A~B fuzzy statement
A * 0 (A ~ B) =B* fuzzy conclusion
The practical solution ofthe above equation for mB=l, is shown in fig. 4.9.
i =measured value
r = repatibility ofmeasurement
ZI and Z2 '" interseetion oftwo
lines y and y"
a,b,c,d= predetennined values for
definition of fu:zzy sets
a b x-r c X d i+r x
The explanation is as follows: As the fuzzy set Ais defined by four values a, b, C, d and
by their membership degrees, the two lines of different slope and known equation
describing this set are y(+) and y(-). SimilarIy, as the set A* is also fuzzy, taking into
account the measuring errors that may occur, it is described by the linesy*(+) andy*(-).
The maximum ordinate of the intersection z, between these two sets, can be found by
means ofthe followingthree mIes:
1. If2(1,2):::;;1 andz(2,1»1 then ms*(y)=z:::;; 1
2. If Z(1,2) >1 and 2(2,1»1 then ms'(y) =1
3. If Zi1.21 > 1 and z(2.11:::;; 1
300 Real time fault monitoring of industrial processes
The quantitative analysis of the possibility of a certain situation in the system described
by the fuzzy conditional statements, is made through the evaluation of its grade of mem-
bership according to Zadeh's compositional equation,
mB*(y) = m.ax (min(mB.(Y)' m~n (max(min(mA*(xj),mi,/Xj))))))
1:5:1:5:m 1 l:5:J:5:n j
taking into account the solution for the intersections mentioned above. Since each Bi;
;=1, 2, ... , n, in the fuzzy conditional statements can be considered as a fuzzy singleton
over a domain consisting of certain situations y, the starting value for mß; while evaluat-
ing mB*(Y) at the first level is 1, and at the second level the evaluated mB*(Y) becomes the
starting value.
The uncertainty of the knowledge in the knowledge base is taken into consideration by
giving different weight factors to fuzzy conditional statements. The choice of weight
factors is rather subjective. Trained artificial neural networks (ANNs), as generators of
membership functions and weight factors in fuzzy conditional statements, are potential
tools for the purpose of fuzzy logie process fault monitoring. Details and ANN
application examples are given in Chapter 5.
Zadeh's compositional inference rule is adopted as an inference mechanism. It accepts
fuzzy descriptions of the process symptoms and infers fuzzy descriptions of the process
faults by means ofthe fuzzy relationships described above.
The main characteristies of a fuzzy logic diagnosis system performance are:
1. Automatie interpretation of relations among the test (observation results) and pos-
sible situations, pointing out the process condition.
2. Detailed explanation ofhow the particular conclusion has been reached.
3. Indieation ofthe possible causes offailures.
4. Description ofthe possible consequences.
5. Recommendation for process maintenance and repair under new circumstances.
There may be difficulties for the above techniques to comprise the elements of the fail-
ures and symptoms and their logieal connections perfecdy. In other words, it is a plausi-
ble criticism against rule-based (fuzzy or not) diagnosis, that its design may be beyond
human knowledge since exceptional events can be introduced as soon as an improvement
has completed the system. A practieal answer should be provided for this argument.
How should the exception be expressed and included to reinforce the diagnosis? In the
present chapter the exception is expressed in a practical form of fuzzy logic. First, the
logical form of the exception is derived as the conjunction of the dictative functions.
Second, the cancellation law in binary logic is fuzzified in order to give an arithmetie for
calculating the linguistic truth value for reasoning. The logieal form of the exception is
derived in Appendix 4.C and requirements for its use are clarified there as weIl. The can-
cellation law is extended to fuzzy logic in order to devise the diagnostic method with the
exception (Maruyama and Takahashi, 1985).
Automatie expert process fault diagnosis and supervision 301
4.3.1 Automatie expert diagnostie systems for nuelear power plant (NPP)
safety
Research in applying expert systems software to nuclear power plants (NPP) has
substantially increased in the last two decades. The dynamically complex system of a
NPP is a most challenging topic for artificial intelligence (AI) specialists. Malfunction
diagnosis of NPP's systems generally uses shallow knowledge incorporated in fault
models. This derives from the fact that most NPP's systems receive signals from sensors
and that their possible malfunction causes and effects on system variables are weIl
known.
In recent years many important results have been obtained about representation and rea-
soning on structure and behavior of complex physical systems using qualitative causal
models. The current AI trend in this aspect is qualitative reasoning using deep
knowledge representation ofphysical behavior (Soumelidis and Edelmayer, 1991).
The suitability and the limits of a qualitative model based on deep knowledge for fault
detection and diagnosis of the emergency feed water system (EFWS) in a NPP is pre-
sented here. The EFWS has been chosen because of its importance in the safe function-
ing ofthe NPP.
The EFWS is a standby system which is not operated during normal plant operation. The
role of the EFWS is to provide full cooling of the Reactor Coolant System in emergency
conditions. The EFWS is automatically activated in three cases ofNPP malfunction:
1. Loss of offsite power (LOOP).
2. Low-Iow level in any steam generator.
3. Loss ofalternative current (LOAC).
302 Real time fault monitoring of industrial processes
The possible malfimctions which can occur in the EFWS and their causes and effects on
system variables are weIl known. They are associated with cracks in the pump or
condensate storage tank (CST) casing, pipe or valve mptures, and pump or valve
operation failure.
As the EFWS is working only in emergency conditions, the occurrence of a malfunction
in the EFWS will lead to catastrophic results. Safety insurance is. an acute problem in
NPP. It is expected that expert systems can contribute to the improvement of flexibility
and man-machine communication in NPP.
The expert system diagnostic process is performed by a forward-chaining inference en-
gine that operates on the knowledge base. The inference mechanisms adopted in deep
modeling techniques are used.
The diagnostic process module consists of two modules: Fault Detection and Fault
Diagnosis (see fig. 4.10).
The process starts with the Fault Detection module which detects a symptom of
malfimction by observing any qualitative variation of the output parameters. Several
information sets are instantiated in the initialization phase. The process then continues
with the identification of the causes of malfunction by exploiting the information
contained in the Model. The Model actually represents the Knowledge Base. It contains
descriptions of the Physical System (generic components, initial measurements,
connections, possible measurements, actual components). Note that only the correct
systems behavior is described in the model (Obreja, 1990). The Fault Diagnosis module
then pro pagates the observed qualitative variation through the system model using a
constraint propagation method (de Kleer, 1987). Thus, all possible fault models are
generated. This step ends when some input parameters (i.e. parameters in the LHS of
the mIes) are unknown, thus making further propagation impossible. The qualitative
reasoning process can continue only if new measurements are taken.
The decision on the choice of optimum measurements is taken according to heuristic
criteria, i.e. probabilities of component failures. From these probabilities one can
compute candidate probabilities and Shannon's entropy function (de Kleer, 1990, 1987).
After the most appropriate point to measure next has been identified, the measurement is
taken, and the qualitative propagation is continued for this measurement.
The Knowledge Base contains qualitative information derived from the EFWS model.
This information is used by the Diagnostic Process as presented above. The EFWS
model is described by components, connections, equations involving: process variables
and design parameters. Components are manual isolation valves, pumps, tanks and "t-
connections", i.e. pipes.
Components are connected together by process variables. The component's behavior
involves process variables and design parameters. Design parameters have nominal val-
ues stated by design. Qualitative analysis considers design parameters as constants.
Automatie expert process fault diagnosis and supervision 303
component description
MODEL
eomponent interconnections
START DIAGNOSIS
Figure 4.10 The expert system diagnostie process for NPP safety.
Each possible malfunction affects in a known way (inc or dec), but in an unknown meas-
ure, some variables. Thus, the model and its analysis are intrinsically qualitative. The use
of simple dynamic models for the system and the components is indispensable for the
real-time implementation of the proposed diagnostic procedure. Even quite complicated
technical systems, as NPP, are made up of rather simple components. It is usually quite
clear what is the input and what is the output of each component and their interactions.
Therefore, it is usually easy to conceptually split the process into subprocesses, with
simple interactions between them. For each process wbich is desired to survey, a sub-
model is written. Each submodel is fed, in real-time, with measurements of the variables
that influence the corresponding subprocess. If the usual relationsbip between the proc-
ess variables is broken, tbis indicates that sometbing is wrong. A given submodel re-
ceives the same input as the corresponding subprocess and should also give the same
output. A fault in a given subprocess may after some time spread its influence over a
large part of the total plant and give abnormal values to all variables, but the normal
relation between these abnormal variables is still valid, except in the faulty subprocess.
The ability to say where the fault is situated, in addition to saying that there is a fault is
an important advantage of tbis procedure.
Each diagnosis can be considered an independent, time-stamped object completing
witbin seconds from the point of invocation. Parallel diagnosis invocations may exist,
facilitating the simultaneous analysis-detection of multiple faults.
In implementing expert diagnostic procedures in real-world problems, considering PC-
platforms is notbing more than a waste oftime. For an efficient real-time implementation,
the model-based fault detection part should be coded in FORTRAN on a general
purpose number-cruncher, wbile the qualitative diagnostic reasoning should be per-
formed on a dedicated AI workstation, utilizing for example the convenient and powerful
LISP development environment. Coupling between the two modules can be facilitated by
means of Ethernet hardware and multivendor network software glue like the TCP!IP
Arpanet protocols. Use of TCP (transmission control protocol) ensures a reliable
transfer of critical data between the computers involved in the system. As TCP is a
connection-oriented communications protocol, some overhead in connection
management can be noticed. More speed could be gained by using connectionless
protocols, ego UDP (user datagram protocol) or the lower level IP (Internet protocol)
directly, but tbis would come at the expense of reduced safety and reliability, unless
carefully programmed by the application developer.
The methodology described above is applicable in a straightforward manner to other
complex plants (chemical plants, conventional power plants, marine equipment, etc.).
Automatie expert process fault diagnosis and supervision 305
Reactor Building
RHR
Turbine Bui.lding
Turbine
Generator
Table 4.2 shows the matrices r(;land Pijlappearing in eqs. (415), (4.16) while Table 4.3
gives eijl and hjl appearing in eqs. (4.26), (4.27). Unity was specified for the upper
bound of ei.iu only if ei.ilwas greater than zero in eq. (4.26). The upper bound was zero at
the zero lower bound. The domain (1~ i ~ 5, l~j ~ 7) corresponds to Ri;' and (6 ~ i ~ 21,
4 ~j ~9) to Eij'
Examples of diagnosis through exceptions.
To ascertain whether the arithmetic given by eqs.(4.B-26), (4.B-27) is capable of
detecting a failure or not, presents a problem. Several examples were solved by utilizing
only the exceptions, where an input to this diagnostic method was provided by the
linguistic truth value of the proposition "the fth symptom is recognized", which
determined both the lower and upper bound.
Automatie expert process fault diagnosis and supervision 307
Table 4.2 Matrices of fuzzy relation of fail- Table 4.3 Matrix ofalternative fuzzy relation
ures with symptoms. Ei]' and vector of exceptional proposition ~
(a) Lower bound of Rija appearing in eq. (a) Lower bound of E ija appearing in eq.
(4.15): (4.27).
1.0 1.0 1.0 1.0 1.0 0.0 0.0 j=4 j=9
0 0 j=1
1.0 1.0 0.0 0.0 0.0 1.0 0.0
0.6 0.0 0.0 0.0 0.6 0.6 j=6
Ijjl= 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.6 0.0
1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.9 0.0 0.0 0.0 0.0 0.9
1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0 0.0 0.9 0.0
0.0 0.9 0.0 0.0 0.0 0.9
0.9 0.0 0.0 0.0 0.6 0.0
0.6 0.0 0.9 0.0 0.0 0.6
elil = 0 0.0 0.0 0.9 0.0 0.6 0.0
(b) Lower bound of Piju. appearing in Eq. 0.0 0.0 0.9 0.0 0.0 0.6
(4.16): 0.9 0.0 0.0 0.0 0.6 0.0
0.9 0.9 0.6 0.6 0.3 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.6
0.9 0.6 0.0 0.0 0.0 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.9
0.0 0.0 0.0 0.9 0.9 0.0
1'ijl= 0.9 0.6 0.3 0.3 0.0 0.0 0.0
0.0 0.0 0.0 0.6 0.0 0.6
0.9 0.0 0.0 0.0 0.0 0.0 0.6 0.0 0.0 0.0 0.0 0.9 0.0
0.9 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 j =21
Example 4.1. "No symptom has been recognized". This example should take the lin-
guistic value "completely false" for all} and consequently the range ofthe a-cut is [0, 0]
for all a so that,
bjt = 0.0, bju = O.O;} = 1, ... , 9
The calculated failures were,
Aia = 10; i = 1, ... , 21
This resulted in producing the announcement that no faiIure exists.
Example 4.2. This next example will diagnose a hypothetical state. Suppose a large
leakage occurs at the main steam line not in the dry-weIl but in the reactor building, and
308 Real time fault monitoring of industrial processes
reeall that sueh an event was assumed to be reeognized as the exeeption in the present
exercise. In tbis example, the isolation valve for the main steam shall be elosed due to a
deerease in pressure on the steam line, and the reactor may be stopped if the pressure in
the reactor vessel deereases. Consequently the temperature of the atmosphere rises and
the flow rate of the sump inereases in the building. However, the flow rate of the main
steam line deereases rapidly beeause of the main steam isolation valve being elosed.
The symptoms "pressure decrease in the main steam line", "high temperature in the
building" and "an increase in the flow rate of the sump in the building" are sharply
observed. The values b4t = 1.0, bat = 1.0 and b9t = 1.0 were substituted for the lower
bounds, as these symptoms. The solutionA j (i = 1, ... ,21) is obtained as:
A 6a = [1, 1], A7a = [1, 1], Aga = (0.6, 1],
A lla =(0.6,1] A I5a = (0.6, 1], A I6a = (0.6, 1],
A 17a =(0.6,1], A20a = (0.6, 1], A21a = (0.6, 1],
A ja = 0 (for other ;'s).
The solution iodieates aeeurately a leakage in the main steam line inside the building by
reading the a-eut of A6, A7 and Ag. At the same time, it suggests that a warning should
be given about leaking in the buildiog of the residual heat removal system (A 11), the
steam line of reaetor eore isolation eooliog systems (A 15, A 16), the water line of reaetor
eore isolation eooling system (A 17) and the feed water system (A 20 , A21 ).
Participation of exception in dia gnosis. It is not easy to deeide how exeeption as pre-
sented, should participate practieally in identifying a failure. It is important for the diag-
nostie system to use a large amount of fragmentary information eoneemed with partieu-
lar events. A small number of sharply aggregated implieations beeome mueh more effi-
eient when mixed with eonsideration of the fragmentary information, wbieh serves as the
exeeption in the present sense.
Fig. 4.12 indieates a proeedure ofreinforeement ofthe diagnostie system by mixing ex-
eeptions with implieations. Substitution of the reeognized symptoms ioto eqs. (4.15),
(4.16) generally yields a failure in terms of linguistie truth values. When an engineer fails
to identify the failure firmly andlor an alternative failure weighs on her(bis) mind, (s)he
should eonsider the exeeptional proposition. In other words, the truth value of Pj of eq.
(4.9) must be elose to true, and eonsequently (s)he may fall baek on the exeeptions. The
next example elaborates tbis proeedure.
Example 4.3. "The flow rate of the building sump has slightly increased and the
differential flow rate of the clean-up system for the reactor water becomes large".
Lower bounds bu = 0.6 and b9t = 0.6 were adopted for these symptoms. The present
method tries a failure inside the dry-weil with the eonventional fuzzified implieation
written by eqs. (4.15), (4.16). The solution is ofthe form,
A ja = [0, 0.1]; ;= 1, ... , 5
Automatie expert process fault diagnosis and supervision 309
which indicates "no leakage exists inside the dry-weil". This is a case where no failure
is obtained although several symptoms are recognized. It is found from calculation that
the antecedent of eq. (4.10) is deduced to be very true for the proposition
(-,3 k(Ak 1\ Rkl)a = (0.9, 1]; k = 1, ... ,5 This enables the engineer to decide that
there rnight exist a failure elsewhere and steps may be taken to exarnine by exception;
then the solution is ofthe form,
A 17a = (0.6, 1], A 19a = [1, 1], A21a = (0.6, 1], Aia = 0; i = 1, ... , 21 and i :;t:17, 19,21.
Reasoning Failures
by ~
End of Diagnosis
The above equation reveals that a leakage in the clean-up system for the reactor water
inside the building (A 19) exists, and that of the water line of the reactor isolation cooling
system (A 17) and the feed water system (A 21 ) in the building are found by a possible
grade.
310 Real time fault monitoring of industrial processes
Figure 4.13 shows the interactive procedure of the present diagnosis on a CRT tenninal
of a personal computer. The truth value of bu = 0.6 serves as a symptom for "the
differential flow rate increased in the clean-up systemalittle ... This method builds up
a hypothesis from the symptom that the clean-up system is leaking in the dry-weil, and
asks the engineer to ascertain whether the hypothesis generates the various symptoms or
not. But, no symptom is recognized being based on the hypothesis (fig. 4.13(a», and
then no failure is identified by the implications (fig. 4.13(b». In the case that he is able
to recognize the symptom "the flow rate increased in the building sump", the input
9r
b 0.6 and all symptoms yield the solution on the CRT (fig. 4.13c)
Start
Flow rate increase in dry-weil sump **0.0
Flow rate increase in air condenser drain **0.0
Pressure increase in dry-weil **0.0
Pressure decrease in steam line **0.0
Flow rate increase in main steam line **0.0
Flow rate increase in residual heat rem oval system **0.0
High differential flow rate of reactor water clean-up system **0.6
Possibly leaking in
reactor water clean-up system in dry-weil
"Check folIows"
Flow rate increase in dry-weil sump **0.0
(a) First segment of a session with diagnostic system. User responses follow the double asterisks.
**** Kind of failure ****
**** Possibly ****
main steam line in dry-weil [0.00.1)
residual heat rem oval system in dry-weil [0.00.1)
steam line or reactor core isolation colling system in dry-weil [0.00.1)
reactor water clean-up system in dry-weil [0.00.1)
feed water system in dry-weil [0.00.1)
(b) Inferred failures by implication listed on CRT tenninal with their truth values.
**** Possibly ****
water line of reactor core isolation cooling system in building [0.60.1)
reactor water clean-up system in building [1.01.0)
feed water system in building [0.61.0)
(c) Final segment of a diagnosis. Inferred failures by exception are displayed with their truth
values.
Ouring a complex industrial process (e.g. power system, chemical plant, etc.) distur-
bance, many alarms are presented to the operator making it difficult to determine the
cause of the disturbance and delaying the corrective action needed to restore the power
system to its normal operating state.
In order to provide continuous real-time analysis of the alarms generated by a SCADA
(Supervisory, Control And Data Acquisition) system, a knowledge-based system, being
immune to emotional factors, can be used to assist the operators in analyzing a crisis
situation so that an optimal solution may be found as rapidly as possible.
A knowledge-based alarm processor can replace a large number of alarms with a few
diagnostic messages that describe the event(s) that generated the alarms. It may also pre-
sent advice to the operators when certain situations occur. The knowledge-based system
performs much of the analysis that a power system operator would have to perform.
Since it can quickly analyze the alarms and present a small number of concise messages,
the operator is given a clearer picture of the condition or conditions that caused the
alarms making it easier for the operator to take corrective action in a timely manner.
Because the system operator (in a power system for example) is very busy during a dis-
turbance, a basic requirement of a knowledge-based alarm processor is that it may not
query the operator for any type of information. Since the SCADA system is also busy
processing alarms, coUecting disturbance data and performing its normal functions such
as Automatic Generation Control, the knowledge-based system should not strain the
computer resources of the SCADA system. The knowledge-based system must be able
to handle multiple, independent power system disturbances, presenting diagnostic mes-
sages to the operators within a short period of time. Also, a diagnostic message must be
retracted ifthe conditions that caused the message to be generated are no longer valid.
Two basic approaches are possible in incorporating a knowledge-based alarm processor
(KBAP) ioto an Energy Management System (FMS) or other complex industrial process
environment, an embedded approach and an appended approach. In an embedded ap-
proach, the knowledge-based system is incorporated in the SCADA system. In an ap-
pended approach, a separate computer is used with a data link connecting the KBAP
with the SCADA computer.
The appended approach is selected here, mainly because a knowledge-based system is
processor and memory intensive. By implementing the KBAP on a separate computer,
312 Real time fault monitoring of industrial processes
Rule
File
SCADA
Alarm
Processor
Data Base
Management f--J-~~
System
Knowledge
Base
Explanation
Facility
_______________ I
The processing speed of the KBAP depends on both the hardware that the KBAP is im-
plemented on as weil as on the rate that the SCADA system supplies alarms to the
KBAP. In other words, the limitation ofthe KBAP in presenting diagnostic messages to
the operators is mainly due to the limitation of the SCADA system in detecting and
314 Real time fault monitoring of industrial processes
generating alarms. The SCADA limitation is a result of varying RTU (Remote Terminal
Units) scan rates as weIl as power system relay actions.
The collection ofknowledge in the KBAP is referred to as the knowledge base. One way
of organizing the knowledge is to form rules and facts. The rules contain accumulated
knowledge in the form ofIF-THEN constructs. The facts in the knowledge base are col-
lected pieces of information related to the problem at hand. Rules express the relation-
ships between facts. Using the current facts, the Inference Engine decides how to apply
the rules to infer new knowledge. It also decides the order in which the rules should be
applied in order to solve the problem.
The rules in the KBAP may be fired using forward chaining or backward chaining (see
Section 4.2.1). The difference between the two approaches is the method in which the
facts and rules are searched. In forward chaining, the Inference Engine searches the IF
portion ofthe rules. When a rule is found in which the entire IF portion is true, the rule is
fired. Forward chaining is a data driven approach because the firing ofthe rules depends
on the current data. In backward chaining, the Inference Engine begins with a goal that is
to be proved. It searches the THEN portion of the rules looking for a match. When a
match is found, the IF portion of the rule is established. The IF portion may consist of
one or more unproven facts. These unproven facts become separate goals that must be
proved. Backward chaining is goal driven because the order of the firing of the rules is
done in an attempt to prove a goal.
Metalevel control rules improve system performance by selecting the object-Ievel rules.
The object-Ievel rules in the KBAP are the forward and backward chaining rules de-
scribed above. Metalevel actions provide context switching and goal selection. Fig. 4.15
shows a conceptual representation of how the metalevel control rules are implemented in
theKBAP.
When a SCADA alarm arrives, the metalevel control rules are used by the Inference
Engine to generate one of two metalevel actions. A context switching action selects data
driven, or forward chaining rules. That is, the metalevel control rules are used to select
the proper context based on the incoming alarm. When the context is selected, metalevel
control rules are used to produce one or more goal selection actions (ifpossible). A goal
selection action selects a hypothesis that the goal driven, or backward chaining rules,
attempt to prove. As can be seen in figure 4.15, object-Ievel actions result from applying
the forward and backward chaining object-Ievel rules. For the KBAP, two object-Ievel
actions are possible, diagnostic messages and advice for the operators.
The metalevel control rules contain heuristics that guide the Inference Engine in forming
hypotheses. Heuristics are rules of thumb that an expert uses when solving a problem.
Heuristics are used to narrow the search space for a solution. Backward chaining object-
level rules are generic in nature and are not related to any particular station for the case
of a power system, or process subsystem, in the general case.
Automatie e":pert process fault diagnosis and supervision 315
Alarms
Object - roles
Fig. 4.16 shows how the metalevel contral rules are internally organized. The order of
the metalevel control rule nodes is the same order of the rules as found in the rule file.
Each metalevel node is the head of a Iinked list of one or more premise nodes. Apremise
node contains apremise clause. All of the premise clauses of a rule must be true in order
for the rule to fire. The Inference Engine scans the metalevel contral list beginning with
"metahead", the head node. If all of the premise clauses of a rule are true, the Inference
Engine fires the metalevel control rule triggering one of the two metalevel actions, con-
text switching or goal selection. In the case of goal selection, the metalevel node con-
tains the goal, or hypothesis, that the object-Ievel rules attempt to prove. The station
node, for the case of a power system, shown in fig. 4.16, has two working hypotheses.
Each hypothesis is represented by a hypothesis node linked to the station node. A
hypothesis node points to the metalevel node that it is associated with. When a working
hypothesis is proved to be valid by the Inference Engine, the hypothesis node is linked to
the station node's validated conclusion chain.
Each change of state detected by the SCADA system is reported by the SCADA alarm
pracessor to the KBAP Alarm Prepracessor. Astate change includes points going into
alarm, points returning to normal, and supervisory contral actions. The SCADA change
of state packets are converted by the Alarm Preprocessor into a form suitable for the
Inference Engine. Incoming packets are queued until the Inference Engine is able to
pracess them. The configuration point nodes contain the current symbolic value of the
316 Real time fault monitoring of industrial processes
point. When a change of state occurs on a point, the value of the point in the node is
pdated.
metahead
Premise
Node
station or
Hypothesis process
subsystem
node
no valldated
conclusions _
Figure 4.16 The interna\ organization or the metalevel control rule node.
validated conclusion may result in one or more new hypotheses being formed. A change
of state may also result in a previously validated conclusion becoming no longer tme.
When tbis occurs, the Conclusion Processor is invoked to inform the SCADA system
that a previous message is no longer valid. Tbis is similar to an alarm returning to
normal.
When an operator requests information on how a particular conclusion was reached, the
SCADA system manlmacbine interface sends the request over the data link to the to the
KBAP. The Explanation Facility is invoked to process the request (see fig. 4.14). The
Explanation Facility passes information back to the SCADA system showing the mIes
that fired and the facts that caused the mIes to fire for presentation to the operators.
System operation.
Tbis part of the section describes the operation of the KBAP for processing different
system components. Power system and general industrial plant components, as motor
pumps and rotating macbinery are discussed.
Low voltage bus and electricity supply networks. For tbis exampIe, the low voltage
bus has three bus sections. Fig. 4.17 shows some of the object-Ievel mIes wbich can be
used for fault diagnosis.
Object-Ievel mle names are enclosed in colons and metalevel control mies are sur-
rounded with double colons. Comments are enclosed in /*and*/. Backward chaining is
used on the object-Ievel mIes in fig. 4.17, in an attempt to prove the goal "06". If all of
the bus sections on a bus have low voltage and the bus breakers are closed, a single low
bus voltage message is presented to the operators. On the SCADA system, the operators
would receive aseparate low voltage alarm for each bus section and possibly other
alarms such as under voltage relay alarms. Tbis simple exampie illustrates how a single
message can be presented in place of numerous SCADA alarms. As weIl as reducing the
number of messages that the operators receive, the KBAP also contains mIes that diag-
nose the situation(s) that triggered the alarms.
Tesch et al., (1990), present a case study of a KBAP implementation for Wisconsin's
Electric Energy Management System. The KBAP, written in the C Programming
Language, uses a configuration database that contains the structure of the power system
as weIl as symbolic data values for each point monitored by the SCADA system. The
knowledge-based system continuously analyzes and processes SCADA alarms in real-
time, presenting diagnostic messages to the power system operators without requiring
any information from the operators.
Brailsford et al., (1990), present a prototype KBAP system, named FAUST, for use in
132 kV and 33kV eIectricity supply networks. All items of the electricity distribution
network, especially those at the bigher voltages, are telemetered. Each telemetered item
is polled regularly (5+20 secs) and any changes of state and alarms are reported to the
318 Real time famt monitoring of industrial processes
/* If the first circuit breaker on the device is closed, establish conclusion "C1" as a fact.
The rule name is "Rule 1". The underscore character denotes a blank. The word "first'
is a position indicator.*/
: Rule-l: If (first circuit-breaker=closed) then Cl;
/* Establish conclusion "C2" if the second circuit breaker on the device is closed. */
:Rule-2: If (second circuit-breaker=closed) then C2;
/* This rule is fired if the first bus section voltage is low and conclusion "Cl" is an
established fact·/
:Rule-3: If (first bus-voltage=low) and (Cl) then C3;
/* The rule order is not important for general rules. Rules are entered free format.
Comments may appear anywhere. */
:Rule-6: If (C4) and (C3) then C6;
/* Fire this rule if the third bus section voltage is low and conclusions "C2" and "CS"
are established facts. */
:Rule-4: If (third bus-voltage=low> and (CS) and (C2) then (C4);
/* If the second bus section voltage is low, establish conclusion "CS". */
:Rule-S: If (second bus-voltage=low> then CS;
/* Fire this rule if the bus voltage on the bus at the opposite end of a transmission line
is low and the bus voltage in the current station is normal*/
:Rule-1S: If (opposite bus-voltage=low> and (adjacent bus-voltage=normal> then
R1S;
Figure 4.17 General object level rule examples for a low voltage bus.
Power transmission substations. Several additional benefits can be obtained when the
sub station integrated control and proteetion system (ICPS) is used as part of the overall
Energy Management System (EMS). A number of substation ICPSs are being developed
Automatie expert process fault diagnosis and supervision 319
around the world today, where the protective relaying, control, and monitoring functions
of sub station are implemented using microprocessors. In this design, conventional relays
and control devices are replaced by clusters of microprocessors, interconnected by mul-
tiplexed digital communication channels using fiber optic, twisted wire pairs or co axial
cables. The ICPS incorporates enhanced functions of value to the utility and leads to
further advancement of the automation of transmission sub stations. More powerful proc-
essing capabilities can be established if an ICPS is used instead of the conventional
SCADA Remote Terminal Units at the substation. In addition, an extensive data base can
be available at the sub station level. This data can be used to assist dispatcher, protection
engineer and maintenance personnel during an emergency.
Fault diagnosis is carried out by operators using information of active relays and tripped'
circuit breakers. The faulty components are inferred by imagining a protective relay
sequence related to the incident and simulating backwards the relay sequence from the
observed data. An expert system will be very useful for these type of tasks, since the
problem involves a mass of data and uncertainties and cannot be described by a weil
defined analytical model. For example, a rule to identify a failed breaker is:
Rule?
{
&1 (Relay operated = yes; considered = yes;);
&2 (Breaker name = &1.br1; open = no; failed = no; status = on);
®
modify &2 (failed = yes);
};
This rule implies that, if a relay has operated and one of its corresponding breakers con-
nected in the circuit has not opened, identify this breaker as a failed breaker.
Stavrakakis and Dialynas, (1991), describe models and interactive computational tech-
niques that were developed to model and detect automatically the available restoration
operations following a fault on sub station equipment or a sub station abnormality. These
abnormalities can be detected before the component enters into the failure stage using
the computer-based diagnostic techniques described in Chapter 3. The developed com-
puter-based scheme can be instalied easily through a rule-based expert system in power
sub station ICPSs in order to determine the optimal switching operations which must be
executed after a component fault has been detected. The development of a data base
containing all the necessary information concerning the component failure characteristics
and the average repair or replacement and switching times is possible. A supply reliability
index of sub station load-points is also evaluated to quantify the reliability performance of
the substation.
320 Real time fault monitoring of industrial processes
non-linear problem. It also high dimensional, since power systems are by essence large
scale.
In general, on-line transient stability assessment (TSA) aims at appraising the power sys-
tem robustness in the inception of a sudden, severe disturbance or fault, and whenever
necessary at suggesting remedial actions. A measure of robustness is the critical clearing
time (CC1); this is the maximum time a disturbance may act without causing the irrevo-
cable loss of synchronism of the system machines. In particular, on-line TSA is used in
real-time operations and aims at performing on-line analysis and preventive control.
Indeed, because the transient stability phenomena evolve very quicldy (in the range of
very few seconds), a good way to face them efficiently is to prevent them.
Wehenkel et al., (1989), proposed a new approach to on-line transient stability
assessment of power systems, suitable for implementation in the SCADA system. The
main concern of this approach has been the application of an inductive inference method
in conjunction with analytic dynamic models and numerical simulations to the automatic
building of decision trees (DTs).
A DT is a tree stmctured upside down. It is composed of test and terminal nodes, start-
ing at the top node (or root) and progressing down to the terminal ones. Each test node
is associated with a test on the attribute values of the objects, to each possible outcome
of which corresponds a successor node. The terminal nodes carry the information re-
quired to classify the objects. The methodology developed there is based on inductive
inference and more specifically on ID3 which is a member ofthe TDIDT (top-down in-
duction of decision trees) family (see Section 4.2.1). Most of the inductive inference
methods infer decision mIes from large bodies of reclassified data sampies. The TDIDT
aims at producing them in the form of decision trees, able to uncover the relationship be-
tween a phenomenon and the observable variables driving it. Adapted and applied to on-
line transient stability assessment, the method intends to uncover in real-time the intrinsi-
cally very intricate relationships between static, pre-fault and/or pre-disturbance condi-
tions of a power system and their impact on its transient behavior in order to discover
the appropriate control actions needed.
Industrial motor pumps.
To investigate the applicability of expert systems in the area of equipment diagnostics of
industrial plants, the principles of a knowledge-based system for real-time diagnosing of
motor pump malfunctions is presented.
The diagnostic approach used in the development of the knowledge base is based on the
method of decision tree (DT) analysis (see previously, Wehenkel (1989». Using infor-
mation derived from the equipment maintenance manual and mechanical-electrical
drawings, and through observation of the procedures used by efficient engineering me-
chanics, adecision tree can be developed to mimic the way a human expert makes deci-
sions and arrives at conclusions. The decision tree is then translated directly into the IF-
THEN mIes which make up the expert system's knowledge base. The manner in which a
322 Real time fault monitoring of industrial processes
decision tree can be translated into production rufe language (PRL) mIes is illustrated in
fig. 4.18 which shows a smaIl portion of adecision tree developed to diagnose a faulty
pump starting switch and the corresponding mIes written in PRL.
Using on-line DT analysis, the system leads the user through the appropriate procedures
required to quickly identify the faulty pump circuit component. Graphical displays can be
incorporated within the system to assist the user in locating the various components and
test points. Once the faulty component has been isolated, the system is capable of access-
ing a database which can provide personnel with information concerning specific
component part numbers, the availability and location of spare parts and the proper re-
pair action to be taken.
Rotating machinery. During the past few years, condition monitoring of gas turbine
and other industrial engines has blossomed into an economically viable activity for some
of the larger constmctors and commercials. These gains have been spurred to a signifi-
cant degree by the development of sophisticated software algorithms for the interpreta-
tion of the limited available sensors.
Software packages are being employed to realize many of these gains using artificial in-
telligence techniques (Doel, 1990). Classical tools that are currently being employed for
commercial engine condition monitoring as EGT (exhaust gas temperature) margin
trending, vibration monitoring, oil monitoring, under cowl leak detection etc. can be
expressed analytically and performed automatically using the statistical aids and signal
analysis techniques described in Chapter 1.
These basic fault occurrence knowledge sources can be used for the creation of the
knowledge base and the inference engine of an expert fault diagnosis tool for rotating
machinery.
Before designing a failure diagnostic system, damage statistics of the specific system
should be regarded, indicating these sections, in which failures are occurring with a cer-
tain frequency. Taking the rotating system of a turbomachine as an example, the defects
on moving blades occupy the highest percentage, concerning the number (37%) and the
repair costs (26%), followed by casing failures (24%), where the financial effect is not as
severe (13%).
In contrary, less frequent failures of bearings (5%) are classified at a high cost level
(27%). During the last years therefore, research in monitoring and diagnosing of systems
focuses on the improvement of detecting damages of the rotating system, including
blades, rotor unbalance, cracks, bearings and others.
Pattern recognition methods have also proven to be very effective in failure detection
(see Section 1.3.2.C). For diagnostic purposes, time signals are utilized, which are ob-
tained at the machine with special sensors, like pressure transducers mounted directly
above moving or stator blades, accelerometers placed at the casing of the machine, and
shaft displacement measuring systems, which detect the shaft oscillations.
Automatie expert process fauIt dia gnosis and supervision 323
RUlE #1
IF specific pump problem IS pump will not start from machine
AND control Dower is available
AND switches are set properly
AND NOT red LED on remote box lights when trying to start pump
AND pump will start from remote box
THEN probable detective pump starting switch
AND end of pump diagnosis
RUlE #2 RUlE #3
IF operating node IS machine IF 12 VDC LED is ON
AND main control switch is ON OR 110 VDC LED is ON
AND cutter switch is OFF THEN control power is available
THEN switches are set properly
Problem S!!!!!!IO!!!!
..
_, ......
/I
.peaoc ..... J)IOIIIem lS
tanIDI DOWef 11 IWIIIIbIe
ballMJllb _
ptIIIDWIIstllltom_bal
I"ag ., 1II1II PIII1D
........ -_I1IfIII9!"M1d1
.... OIp_ClllQM...
n
DpenIIIIQ __ IS _
RlA.E I'J
_CDIIIIJIrMl:llosON F 12 YDC lfllll ON
_,ore,et_rtr
,,*,!"Mldl1I OfF OA
MN
1I0YDC lfD 1$ ON
CDIIIIJI_IS.,_
Figure 4.18 Partial decision tree diagrarn and corresponding PRL roles for a motor pump fault
diagnosis.
324 Real time fault monitoring of industrial processes
Consider a vibrating part of a rotating machine. The behavior of the part can be modeled
in real-time by the auto spectral density function of the signal of an accelerometer
attached to it. The goal is to make diagnosis about the state of the vibrating part. Two
diagnoses can be made locally for this part using the peak on the spectra representing its
behavior approximately at 10.5 Hz. For example ifthis peak is,
a) missing, then the part is probably broken;
b) too large, then the part is excessively vibrating.
The following mies can be defined for the part:
<rule-l
if (null? peak)
then (set! peak (peak-nearest-to
(spectral-peaks
(apsd <signal> <parameters>)) 10.5)))
<rule-2
if (and peak
(or « (frequency peak) 10)
(> (frequencypeak) 11))
then (conclude ' (broken part)))
<rule-3
if (and (not (broken ' part))
(> (amplitude peak) amplitude-limit))
then 'conclude " (excessively-vibrating part)))
where,
• peak and amplitude-limit are instance variables ofthe part;
• (apsd <signal> <parameters» and (spectral-peaks <spectral function» are exter-
nally defined global functions;
• (peak-nearest-to <spectral peaks> <frequency», (frequency <peak», and
(amplitude <peak» are internal functions (i.e. methods) connected to instance vari-
able peak,
• (broken part) and (excessively-vibrating part) are assertions.
This example is intended to illustrate how numerical models (in the present case a
spectral density function of a signal) presented in Chapter 1 can be integrated in mies
aimed at solving a rotating machine diagnostic problem. For this purpose two steps are
taken:
1. The model is activated (i.e. the spectral function is computed) under specific condi-
tions (in the present case ifpeak has no value).
Automatie expert process fault diagnosis and supervision 325
2. The results (having numerical nature) are compared to limit values (frequency and
amplitude limits in the present example) resulting in Boolean values, which can be
combined with the assertions.
Given these features, it is not difficult to envision very powerful, linked diagnostic and
maintenance systems to diagnose engine problems using all available input in assisting
the maintenance process.
Detecting and masking transient /ailures 0/ the SCADA system itself.
The effect of lightning poses significant problems to computer systems and electronic
equipment used in process supervision due to the possibility of a direct strike. Protection
against a direct strike is very expensive and seldom used (Noore et al. , 1992).
When lightning strikes the plant, it produces both electric and magnetic fields that vary
with distance, frequency, and time. These fields can couple into power transmission and
communication lines and adversely affect the performance of computers and other
electronic equipment. Protection from equipment damage and power supply interrup-
tions through shielding, insulation, lightning arrestors, and filtering circuits help in reduc-
ing the effects due to lightning but are not completely effective. The field-induced volt-
ages and currents cause transients that can create an upset in a digital or computer sys-
tem. Problems due to transients have not been a significant area of research until re-
cently. The reason for this is that electronic equipment that use vacuum tubes or power
semiconductor devices is relatively immune to some transients. In addition, with these
devices, the short duration of its occurrence usually is not of serious consequence, but
the rapid advancement of technology in recent years has led to the emergence of several
complex VLSI devices used in the design of computer-based data acquisition systems,
process monitoring systems, surveillance systems, and control systems to support both
critical and non critical industrial plant activities. With the attendant problems that are
exacerbated as a direct consequence of the submicron geometry of advanced microelec-
tronics semiconductor technology, transients of smaller magnitudes and shorter time
duration can have a dramatic effect and cause a system upset in computers.
To detect transient faults efficiently within a short time period, the proposed detection
mechanism focuses on a set of hardware modules and control paths that are utilized by
the microcomputer system. For example, if a transient fault alters the contents of a regis-
ter that is not used by the microcomputer for perforrning process monitoring or control
operations, it is not necessary to detect this fault. On the other hand, if a transient fault
causes a change in the contents of a register used by a specific application, then it is im-
perative to detect this critical fault. Thus, by monitoring only the control paths of the
application program and selected modules used by the microcomputer, the proposed
approach detects all critical transient faults and maintains a high fault coverage.
Let the global memory space M(O, n-1) be partitioned into several disjoint contiguous
physical blocks ofmemory M(O, i), M(i+l,j), M(j+1, ... ), ... M( ... , k), M(k+1, n-1). Each
memory partition M(x, y) is bounded by the lower memory address x and the upper
326 Real time fault monitoring of industrial processes
memory address y, and the size of each memory partition is y (= x+ 1). Each physical
memory block M(x, y) is logically mapped to a set of instruction sequences I(x, y) that
are executed within the boundary (x, y). Depending on the type of application program,
the logical execution of the instruction sequences can, in general, be globally distributed
in a noncontiguous manner. Each instruction sequence I(x, y) is modular and represents a
subtask of the application program or a subroutine. Although instructions in the instruc-
tion sequence I(x, y) are physically contained in M(x, y) sequentially, the program can
logically execute severalloops within the boundary (x, y). Instructions contained in I(x,
y) can either be of single word length or multiple word lengths. The time 1{x, y) for exe-
cuting I(x, y) is exactly computed from the type of processor used in the design, the op-
erating frequency of the dock, the response time of the memory or peripheral device
selected, the number of cydes required to execute a single or multiword length condi-
tional or unconditional instruction, the number of loops in I(x, y), the manner in which
the loops are structured, and the number ofpasses per loop.
Given an application-specific program that is structured as described, the starting ad-
dress x or the lower boundary of each 10gica1 set of instruction sequence I(x, y) can be
extracted. Associated with each I(x, y), the execution time 1{x, y) can be obtained from
the program structure. This value is stored in the EPROM, whose address is given by x.
Similarly, al1 execution times are stored under the corresponding addresses. Addresses
that do not correspond to the starting address of the instruction sequences are used to
store all Os in the EPROM. The transient fault-detection circuitry is shown in fig. 4.19,
and the detection algorithm is outlined as folIows:
lethe system dock-out frequency ofthe microcomputer system is denoted by fsystem and
the countdown timer dock frequency is denoted by ftimer> then the length of the counter
is given by:
For real time monitoring and control operations, rollback of either instructions, segments
of programs, or entire programs is necessary after a transient fault is detected. A rollback
operation makes use of the concept of checkpoints. Acheckpoint is a preselected point
in the execution sequence when the state of the system is saved. Program rollback forces
the execution to restart at the last checkpoint. The selection of the number of check-
points is important. If too many checkpoints are used, the program execution time is in-
creased since the status of the processor has to be saved often. On the other hand, if the
checkpoints are fewer in number, the recovery time is increased since the program has to
be executed, starting from the previous checkpoint. Strategies for introducing optimal
checkpoints in a program can be found in Noore et al. (1992). Furthermore it is shown
that transient fault detection and recovery can be performed in real time, which is an
important factor when monitoring, processing, and controlling critical industrial
processes operations. The approaches presented have a wider range of applications and
can be extended to real-time, high-risk, and life-critical systems used in the aerospace,
banking, military, chemical and nuclear industries.
ADDRESSBUS
data in
MICROCOMPlJfER EPROM
SYSTEM
dataout
fsystem
COUNTDOWN
11MER
ftimer
4.3.3. Expert systems for quick fault diagnosis in the mechanical and
electrical systems domains
of function to diagnose problems on devices that may not be familiar but that are similar
to one that he/she has expertise in.
Once the device has been represented to the computer using these functional primitives,
the resulting model can be used in two ways. First, it can be used as a basis for qualita-
tively simulating the device. Given specific values for the states of certain components in
the system (i.e., switch settings), it can determine the effects on the device by simulation.
This can be done both for correct and malfunctioning behavior. Malfunctioning behavior
is handled by providing symptoms of the failure at the beginning of the simulation (i.e.,
the switch is on, but the light is not). This simulation is based on the substances and their
qualitative levels as input and output to the components of the device.
The second way that the model can be used is to guide diagnosis. To do this, some very
general diagnostic reasoning mies to examine the functional model and isolate the prob-
lem to a specific functional unit or units are used by the functional expert system. These
mies are based upon the values of the inputs and outputs of the functional units. They
are:
1. If an output from a functional unit is unknown, ask the user;
2. If an output from a functional unit appears incorrect, check its input;
3. If an input to a functional unit appears incorrect, check the source of the input;
4. If the input of a functional unit appears to be correct but the output is not, assume
that something is wrong with the functional unit being examined.
Once the problem has been isolated, more specific analysis can proceed in the same man-
ner at a lower level, since the knowledge representation can be hierarchical. When a fault
hypothesis is confirmed with enough confidence, an attempt is made to repair the be-
lieved problem. If the repair fails, then the expert restarts the diagnostic process. This
process is followed no matter what type of knowledge is guiding the human at any given
point in time, whether it be experiential, fundamental, common sense, or some other
type.
In Yoon and Hammer, (1988), an aiding approach has been described and evaluated for
novel fault diagnosis in complex electromechanical systems. This approach is an alterna-
tive to the one presented in Sectiom 4.3.1.2. The emphasis is on novel rather than rou-
tine faults. It contains a qualitative model that may correspond to the human's internal
model of the system. This model represents knowledge only of how the system behaves.
Therefore, this aiding approach does not rely on proceduralized knowledge. The qualita-
tive model is the basis for much of the aiding that takes place.
Although the definition of novel fault diagnosis was rather narrow in this research, its
results may find generalization in any diagnostic task that involves causal reasoning
based on deep knowledge. The experimental results (fault diagnosis of an orbital
refueling system to refuel orbital satellites with hydrazine) confirmed that a deep-
reasoning diagnosis can be aided, without disturbing the human diagnostic procedure, by
providing relevant information.
330 Real time fault monitoring of industrial processes
However, the results also suggested that the aiding information should be compatible
with the human information processing. Thus, to design a successful information aid,
human diagnostic activity should be first understood from the viewpoint of human infor-
mation processing.
In diagnosis tasks, system behavior in different modes is investifated: normal behavior,
actual behavior, and hypothetical behavior. The results suggest that normal system be-
havior, while intuitively important, should not be presented prominently. This is perhaps
because the human is interested in actual behavior. Abnormal system behavior was found
to be very important in diagnosis. Simply presenting integrated, actual (observed) system
behavior topologically has the comparable effect of improving diagnostic performance.
There are several applications of troubleshooting expert systems which are currently
used in the industrial electrical-mechanical sector to perform quick and accurate diagno-
sis. By examining some of the practical existing systems, one can acquire information
which can aid the development of new fault diagnosis systems (see Chen and Ishiko,
1990).
Conceivably, General Electric's (GE's) Computer-Aided TroubleShooting (CA TS-l),
mainly used at locomotive minor maintenance repair shops, is the earliest and most weil
known expert system in troubleshooting. In its current status, CATS-l limits its trouble-
shooters to the ideas and disciplines of the expert whose knowledge was programmed in
the system. What GE intends to add to the system in future development, is a rule editor
that facilitates the creation of new rules for the technician without having to concern
himself with knowledge engineers. AT&T Bell's Automated Cable Expertise (ACE), a
knowledge-based, cable-trouble analysis system to diagnose cable troubles is used in the
Texas facilities at Dallas and Houston. The ACE was designed to analyze, autornatically,
the large amounts of daily complaints and maintenance reports and help the managers to
investigate the data contained in the complaint system. The Testing Operations
Provisioning Administration System-Expert System (TOPAS-ES) is a real time,
distributed, multi-task expert system for switched circuit network maintenance. TOPAS-
ES is one of the most eminent troubleshooting expert systems because of its ability to
handle the volume of trouble reports, perform its analysis, and continuously support
proper trouble resolution. AT&T Bell's Interactive Repair Assistant (fRA) is a
knowledge-based (KB) system that provides expert troubleshooting advice to the tele-
phone company's mobile field technicians. Unlike most KB systems, !RA is capable of
delivering real time expert advice to many remote users.
For the development ofthe following three expert systems, the vendors chose commer-
cial expert system shells. Termi-point Expert Maintenance System (TEMS), is an expert
system for supporting wire-bonding machine maintenance. TEMS can diagnose certain
machine problems in few minutes while the human experts solve the same problems in an
hour or more. Coordinate Measuring Machine Professional (CMM-PRO), is used by
Rockwell International in Colorado, to assist in the precise calibration and servicing of
state-of-the-art computer-controlled coordinate measuring machines. When CMM-PRO
Automatie expert process fault diagnosis and supervision 331
was put in use, it became a powernd training and learning tool for novice engineers. The
COOKER, Cooker sterilizer expert system, used by Campbell Soup Co., can check large
sterilizer cookers for malfunctions and can guide the start-up and shutdown procedures
of the sterilizer operations. Since the system has been instalied in eight plants, COOKER
has been used day and night by the plant personnel to diagnose cooker problems without
the need of consulting human experts.
EXACT(Expert systemjor Automobile air-conditioner Compressor Troubleshooting) is
an expert system developed by Toyoda Automatic Loom Work to support field engi-
neers to conduct quickly the effective diagnosis of compressor troubles. EXACT runs
under three software environments in the PC: EXSYS, FOXBASE+, and AutoCAD.
The expert system module is handled by the EXSYS environment, EXSYS shows the
results of the diagnosis using an executable external program created in QuickBASIC.
The FOXBASE environment handles the user interface and database management fea-
tures. On the results screen of the expert system, a pictorial information module is pro-
vided under the AutoCAD environment for locating the trouble area in the compressor.
Fifty real-world cases have been used to evaluate the system performance (Chen and
Isbiko, 1990).
The ALLfTM Plant Monitoring and Diagnostics (M&D) system. The integration of
various monitoring and diagnostic systems with plant process parameters is realized by
the Westinghouse ALLyTMPlant Monitoring and Diagnostic System. It is the envi-
ronment where diagnostic evaluations are made on plant-wide equipment and systems
and the mechanism where information is managed and displayed.
Monitoring and diagnostics (M&D) systems are weIl recognized as a vital tool to help
maintain structure, system and component (SSC) integrity and reliability. These systems
are generally designed to measure key SSC parameters such that an assessment of their
performance can be made. Some of the more advanced systems have the added capability
of performing a diagnosis and determining equipment malfunctions and associated
contributing factors. For example, not only is it important to detect equipment perform-
ance deviations in a timely manner but it is equally necessary to have the information ex-
plaining the occurrence. Then, and only then, can proper maintenance actions be pre-
scribed and future failures/degradations be eliminated.
In order for such systems to be effective, they need to have access to a relatively large
and distributed set of information. Information directly acquired trom the M&D system's
local sensor set is necessary but usually insufficient by itself. Global plant process pa-
rameters measuring the dynamic behavior of other associated equipment and systems are
equally necessary for a complete diagnosis and root cause analysis. Although tbis infor-
mation has always been available through plant computers, accessing tbis information has
been difficult because these computers do not typically have sufficient 110 for diverse
M&D networks. Moreover, most M&D systems are designed without a common arcbi-
tecture thereby making the connection to the plant computer information systems custom
engineered and uneconomic. Tbis difficulty often results in a compromise between
332 Real time fault monitoring of industrial processes
standardization and cost. The M&D systems, networked together and linked to the plant
information systems, represent a tremendous wealth of information. So much, in fact,
that the distributed architecture of the M&D network becomes a necessity, simply be-
cause of the quantity of data it must process.
The basic design of ALLyTM implements a Blackboard Architecture, in which all M&D
systems and knowledge sources revolve around a common data base of information, the
Blackboard. Like the name suggests, it is a mechanism where different knowledge
representations of the plant exist and are available for any application. When a
knowledge source executes, it reads the Blackboard for its necessary inputs, processes
them, and puts the results back on the Blackboard. Other knowledge sourees, in turn,
use the newly acquired information to further enhance the current model of the plant.
The Blackboard architecture consists of the Blackboard itself, a Control Mechanism,
Knowledge Sourees, and Graphical User Interface.
Along with time based functionality, the system has been designed with high perform-
ance and fast execution in mind. This is the means by which the end users are provided
with time relevant information. While most other commercial knowledge sources had
their beginnings in Lisp versions and then converted to C, ALLyTM plant monitoring
and diagnostic system has been exclusively written in C and C++ (an object oriented
programming language) with performance as a primary objective. Dnly standard UNIX
system calls, sockets for inter-processor communication, shared memory for intra-proc-
essor communications, and signals for interrupts are used. ALLyTM is currently imple-
mented on both Sun Spare and Hewlett Packard 9000 UNIX platforms. Requirements
are: 32 Mb of RAM, 100 Mb swap space, and 600 Mb hard disko As a benchmark, the
HP version of the rule based knowledge source has been measured to fire over 8000
simple rules per second.
JET-X is a PC-based expert system, developed by GE, for support of the TF34-GE-IOO
engine on the A-IO aircraft. JET-X was designed for use with the existing TF34 Ground
Station GEMS IV. GEMS provides for the display of trend and event data acquired by
the on board Turbine Engine Monitoring System (TEMS). Both GEMS and TEMS are
capable of generating alarms that may indicate engine problems. Maintenance personnel
are expected to respond to these alarms by diagnosing potential problems and
performing appropriate maintenance actions. JET -X's principal role is to capture the
diagnostic procedures in an interactive environment that does not limit the size or
complexity ofthe decision trees used (Doel, 1990).
HELIX (Helicopter Integrated Expert) is a helicopter real-time diagnostic expert system
using a causal model of the helicopter's engines, transmission, flight control and rotors.
At the heart ofthe HELIX program is a Qualitative Reasoning System (QRS). The QRS
is a general mechanism to support the creation of hierarchical device models and reason-
ing about device behavior using qualitative physics. The HELIX qualitative model is rep-
resented as a set of constraints that define the normal behavior of the engines, transmis-
sion, flight control, and rotors of the helicopter. Aircraft health is assessed by
Automatie expert process fault diagnosis and supervision 333
determining whether observations (sensor readings and pilot control inputs) are
consistent with the constraints of the model. If an inconsistency is detected, a process of
systematie constraint suspension is used to test various failure hypotheses.
Critical to the efficient operation of the HELIX program is the hierarchical model repre-
sentation, which enables reasoning at various levels of abstraction. Using a top-down
approach, the diagnostic process exploits the hierarchy by beginning fault isolation with
the most reduced form of the model. To refine the diagnosis, a branch of the hierarchy
may be expanded until a component-level diagnosis is made. The hierarchy also greatly
reduces the complexity of multiple failure diagnosis. Rather than considering combina-
tions of failures in allieaf components, the diagnosis can be restricted to combinations of
branches in the hierarchy.
HELIX has been successfully tested on a variety of simulated failures. By representing
only the normal behavior of the helicopter and testing hypotheses by constraint suspen-
sion, HELIX has been able to diagnose single or multiple failures without prior knowl-
edge of failure modes. The approach represents a promising technique for automating
the qualitative reasoning required to diagnose novel failures and may form the basis for
extensive automation both in airbome and ground-based diagnostic systems (Hamilton,
1988).
ENGEXP is an integrated environment in PC- TurboPascal for the development and ap-
plieation of expert systems to the quick fault diagnosis and repair assistance of
equipment and engine. The main components of ENGEXP are the user interface, the
expert system core, the feedback and knowledge acquisition tool, and a help facility for
the user. As usual, the expert system core contains the inference engine, the knowledge
base(s) and the explanation subsystem. A particular feature ofthe inference engine is that
it can resolve situations with multiple causes. The knowledge is organized into three
layers as is done by the human experts. The qualitative analysis of the results obtained in
a large sampie of applications showed a remarkable performance, while the quantitative
analysis showed a success rate over 95%. The response time is very good (always ~ 2
sec., in many cases ~ 0.5 sec.) despite the overhead caused by the user-friendly user
interface (Tzafestas and Konstantinidis, 1992).
FAKS is an expert system for diagnosis and recommendations reports produced on board
the ship for the Wartsila Diesel VASA series of ship engines. At the first step in the di-
agnosing process FAKS creates two mathematical models, one of the "Ideal Engine"
based on laboratory data and one of the "Real Engine" based on data from the Diesel
engine monitoring system. Every 15 minutes a comparison between these two models is
made and the result is "flashed out" through the entire knowledge base. Diagnosis is the
result of the automatie continuous evaluation of the engine and contributes to safe, con-
tinuousand economic operation. Good possibilities are also given to improve the safety
at sea (Ahlqvist, 1990).
334 Real time fault monitoring of industrial processes
SCAR and SCAR-2 are rule-based systems for assisting electricians in diagnosing faults in
a shuttle car quickly. A shuttle car is a vehicle used to transport coal from the working
face, where mining takes place, to the primary haulage system, such as a conveyor belt.
The shuttle car is a key element in the mining cycle, since no coal can be mined if a car is
not available. Insight 2+, a microcomputer-based expert system development tool by
Level Five Research Inc. is used created the system. The program requires the user to
specify the initial symptoms of the failed machine, and the most probable cause of failure
is traced through the knowledge base, with the software requesting additional
information, such as voltage or resistance measurements, as needed. A causai-reasoning
approach was used to develop the production rules. Generalized systematic procedures
for creating and organizing the knowledge base, the incorporation of on-screen
presentations of the shuttle car circuit schematic, the development of reference and
tutoriai programs and a microcomputer implementation that resists moisture, oil and dust
are some features ofthe more advanced SCAR-2 system (Novak et al., 1989).
HEPHAESTOS is an interactive expert system for quick fault diagnosis of electric ma-
chines. Electric machines playa very important part in furnishing power for a11 types of
domestic and industrial applications. Any fault on a electric machine insta11ed in a line
production process, results off-line repairing which disturbs the line production process.
HEPHAESTOS is a rule-based expert system with efficient and quick reasoning,
containing as much knowledge as possible concerning the fault occurrence, repairing and
maintenance of electric machines. The knowledge required to build this expert system
has been acquired from expert engineers working for years at workshops of the Public
Power Corporation of Greece, construction companies, many instruction manuals,
maintenance handbooks and trouble charts. One year real life tests of the system have
been performed. The knowledge base can be expanded according to the comments of the
users (protopapas et al. , 1990).
IDM (Integrated Diagnostic Model) is a hybrid expert system for real-time fault diagno-
sis and repair assistance of mechanical and electrical devices which integrates shallow
and deep knowledge. The shallow knowledge is stored in the experientiai knowledge
base and the deep knowledge is stored in the physical knowledge base. The physical
knowledge base contains functional models of subsystems. Each of these two types of
knowledge is structured and represented in a way that is natural for that type of knowl-
edge. Each of the two knowledge bases has its own inference engine. Thus, each is an
expert system in its own right. These two expert systems are then integrated into a single
expert system via an executor that can draw on either expert system in diagnosing a par-
ticular problem. The executor contains its own knowledge base that maintains aglobai
representation of what is known and unknown so far about the device under diagnosis.
This knowledge base also provides a means of "translating" the knowledge between the
experiential and fundamental knowledge bases. The result is an expert system that can
solve problems even when no experiential knowledge exists to handle the situation, re-
sulting in a more graceful degradation of capabilities at the periphery of its knowledge. It
can also provide different levels of explanation. Two prototype systems have been im-
Automatie expert process fault diagnosis and supervision 335
plemented, the first for fault diagnosis in a simple thermostat-controlled gas heating sys-
tem, the second for fault diagnosis in the electrical system of an automobile. Recent
work on the IDM involves implementation to handle truly analog devices (Fink and
Lusth, 1987).
Finally, DESPLATE is an expert system designed by Ng et al., (1990), to diagnose
abnormal plan view shapes of steel plates in a plate mill. Abrief overview of the
production process in the plate mill and the problems that motivated the development of
DESPLATE are provided. The goal ofDESPLATE is to locate the possible causes such
as electrical failures, mechanical breakdoWDs or wear, operational errors, and pre-rolling
conditions for a particular abnormal shape, and to suggest appropriate remedies. Some
of the issues arisen in developing DESPLATE are addressed, including the use of
graphics for user interface, forward and backward chaining techniques, and knowledge
acquisition methodologies from multiple experts from different disciplinary backgrounds.
4.3.4. Automatie expert fault diagnosis for maehine tools, robots and
CIMsystems
objects specified by features into a set of different sub sets each with special importance.
In the case of complex systems, one should use all possible information about the system.
This is primary information from measurements and expert consultation and secondary
information that is a result of signal processing (FFT, regression analysis, statistical aids,
parameter estimation and so on). All information together is calledfeatures.
To higherlevel
MONITORING
OPERATOR
_ _ _ _ _ _ _ _ _ _ _ _ ...J
Sensor
PROCESS
The capabilities of classification are fundamental ordering and structure building. In the
structure building step one gets a cluster conflguration. Clusters found by cluster
methods are of a natural type. Classical non-fuzzy clustering methods have the disadvan-
tage that the classifier yields only crisp membership values (0 or 1) and also a weil de-
fined result for an object with a great distance from every class. This is very risky be-
cause there could be an unknown class. The fuzzy concept is then suitable for the
description of overlapping classes and is able to refuse objects. This is possible because
the membership of every class in such a case is very small. The essence of the fuzzy view
is that the membership value of a set element cannot only be 0 or 1 but lies between 0
and 1 (in diagnosis this is the grade of membership to the normal of a special failure
class). In Computers in Industry, (1986), a wear fault diagnosis system of conveyor belt
rolls based on fuzzy classification is described.
Automatic monitoring is not limited to cell and plant monitoring as explained in the pre-
ceding section. In fact there are already some embedded monitoring functions in actual
machine tools such as:
Automatie expert process fault diagnosis and supervision 337
A second type of detection which can be designed, is the signal processing of the
acoustic signal (see Chapters 1 and 6). It has been shown that it is relatively easy to ana-
lyze the cross-correlation existing between the frequency of the signal due to the parts
turning and the cutting force measurement. The conclusions of such analysis can be a
basis for implementing an adaptive algorithm to improve the quality of machining, since
the forming of parts turnings is an important parameter (Sahraoui, 1987, Neumann,
1990).
Tool wear can also be detected through power consumption using the empirical formula,
kWb=T·P
where kWb is the power required, T is the material removal rate (in 10-6 m3 ) and P is the
amount of removed material per unit time.
The monitoring of sequential steps can be ensured by using a Petri net-based
specification. Three modules MI, M2 and M3 can be envisaged: MI to monitor the
initialization of the machine, M2 for monitoring during operations and M3 for exception
handling.
Faults in the machine tool motor (load, electric components) can be detected via pattern
recognition and spectrum of current (see also Section l.3.2). The quotient of amplitudes
at the odd harmonics of 50 Hz in the current spectrum shows the form of the magnetic
field. Changes in load or geometric changes in the machine influence the waveform ofthe
magnetic field. Furthermore, the current amplitudes give some information about the
momentary load and position in the working cycle. Faults in other mechanical compo-
nents (gear box, belt drive, excenter) can be detected by evaluation of rotation speed and
parameter estimation (see Chapter 3).
A rule-based system can be designed to detect on-line tool wear and machine malfunc-
tion from such information. The action on the machine tool is to correct the tool in ad-
vance and to invoke an emergency shutdown. These issues and implementation aspects
are fully discussed by Arreguy et al. , (1990), and Monostori et al. (1990).
Freyermuth, (1991), introduced a computer assisted incipient fault diagnosis system for
industrial robots, with the objective to detect and diagnose faults in the mechanical part
of the devices at a relative1y early stage to prevent subsequent damages. F or this purpose
a suitable combination of analytical and heuristic tools was developed. The analytical
symptom generation procedure comprises of the detailed mathematical modeling of the
robot's different axes. The parameters of the models directly represent characteristic
physical quantities (process coefficients) being identified by specific continuous-time
parameter estimation algorithms. The estimated quantities then undergo a statistical
evaluation. Deviation of coefficients from nominal values are considered as symptoms
(see also Chapter 3). The subsequent heuristic evaluation processes these symptoms
based on specific fault-symptom-trees and knowledge of fault statistics and process
history using a specific inference mechanism (see also Appendix 4.A). The developed
diagnosis system can be realized on a PC with a process interface.
Automatie expert process fault diagnosis and supcrvision 339
Pouliezos and Stavrakakis, (1989), proposed a fault detection mechanism for a ro-
botIcontroller combination to acbieve optimum robot performance at all times. A detec-
tion mechanism based on logarithmic likelihood ratios combined with parameter estima-
tion through RLS with forgetting factor, making it essentially a moving window estima-
tor, has been shown to work quite satisfactorily (see Chapters 2 and 3). Fault symptom
trees can then be elaborated to built an ES fault diagnosis tool as previously (see
Appendix 4.A).
An application example concerning the finding of cause(s) for an observed symptom in a
CIM setting, using hybrid expert diagnosis procedures (see Appendix 4.A) is described
by Lee, (1990). As shown in fig. 4.21, tbis CIM system consists of3 workstations (i.e.
machines), 1 main conveyor, 3 subconveyors, 2 material handling robots, 1
programmable controller, and 1 computer terminal. The three machines are a CNC
(Computer Numerical Control) machine, a GCA robot and a General Electric P50 robot.
The two material handling robots are a Unimation PUMA 761 and a General Electric
MH33.
I funpu~r I
Terminal L....-~_naoIJ_er_Ic
.......1 d Convcy« 2
Main Convcyor ij
OMH33
111111--.
Convcyor3
GCA PSO
As the raw material comes in on Conveyor 1 (in wbich there is no pallet), the PUMA
picks it up and places it on the CNC. The CNC processes the raw material into the spe-
cific pattern by performing the turning operation. When the CNC finishes its operation,
the PUMA picks it up and places it on a pallet on the main conveyor. The pallet arrives
at the GCA on the main conveyor and is first identified as type x or y depending on its
340 Real time fault monitoring of industrial processes
bar code to be read by the laser bar code reader located at the GCA. If the pallet is of
type x, then the GCA does operation A and releases the pallet x onto the main conveyor.
Similarly, ifthe pallet type is y, then operation Bis performed. At the P50, operation C is
performed if the pallet is of type y, and no operation if it is of type x. When the proc-
essed material (or part) arrives at the MH33, it is unloaded from the main conveyor to
Conveyor 2 if it is of type x, and to Conveyor 3 if it is of type y. Finally empty pallets
travel to the CNC (at which point the cycle begins) via the main conveyor.
For the purpose ofthe diagnosis model, the CIM system is modeled in the way shown in
fig. 4.21. Since there are 9 terminal nodes in the functional hierarchy, there are 9 shallow
KBs. One of these shallow KBs is shown in the lower half of fig. 4.22 for the GCA.
Every node in this shallow KB is represented by a pseudo-rule, defined as a rule of which
the consequent part is not explicitly specified. In this sense, this shallow KB is a pseudo-
shallow KB. To be complete, one has to specify an antecedent and a consequent. For
example,
Referring to the strategy outlined in Appendix 4.A, it will be explained how it works
through use offig. 4.21, which shows both the deep KB and the shaUow KB.
Suppose that one (or a sensor) observes a symptom indicating that apart is not being
processed on a machine. This leads to the deep KB starting at F oo (however, ifthere is a
good mapping heuristic, it leads to a specific node other than F oo ). At Stratum 0, using
the breadth-first search, q11 (=0.65) is greater than q12 (=0.35). So Machines (F 11) is the
next destination. By the same search, one arrives at the terminal node GCA (F32) via
Special-purpose Machines (F22 ). At this point, one moves on to the shallow KB attached
to F32 . That is, one is at R 1l1 , at which point the entropy calculation begins. To calculate
the entropy at R2l1 (i.e. H2l1 ), the data for aU child nodes at Echelon 3 must be known.
H211 is given by,
H 2l1 = {w311P311 lnp311 + wmP312 lnp312 + w313P313 lnp313}
= - {(0.2) (0.2) In(0.2) + (0.2) (0.4) In(O.4) + (0.2) (0.4) In(O.4)}
= 0.2110
where,
W311 = test cost of R311 /maximum test cost of shallow KB
= 0.2
Automatie expert process fauIt dia gnosis and supervision 341
and w312 and w313 are obtained similarly. In the same manner, H 212 can be obtained (i.e.
H212 = 0.9130).
Thus the decision can be made at R 111 . Since H 21l is less than H 212 , the search moves
down to R211 . This procedure is repeated until the terminal node is reached. The results
of entropy calculations (and costlbelief ratios if there is a tie in entropy values) are
shown in Table 4.4 where (a) is for the R211 group and (b) is for the R212 group. Note
that the entropy for a terminal node is zero, because the terminal node has no children.
From Table 4.4, the first terminal node to be tested is R313 . Ifit turns out that there is no
bar code on the pallet after the test, then tbis pallet is faulty. The diagnostic session stops
at tbis point. However, ifthe problem still exists although tbis pallet is already taken care
of, the observed symptom may have multiple faults. In this case, the diagnostic session
resumes from R311 wbich has the next lowest entropy (actual costlbelief ratio) at the
same echelon. If the test result for R 313 indicates no faults, the diagnostic session also
resurnes from R311 . Should the entire shallow KB contain no faults, the diagnostic pro-
cedure has to go back to the terminal node ofthe deep KB (i.e. F32 ) and update Q32'
342 Real time fault monitoring of industrial processes
Node (R) Entropy(H) Ratio (r) Node (R) Entropy(H) Ratio (r)
211 0.2110 212 0.9130
311 0.0000 1.0 321 0.0000 2.0
312 0.3576 322 0.2220
313 0.0000 0.5 323 0.0000 3.3
421 0.0000 1.5 324 0.0000 2.7
422 0.0000 1.0 451 0.0000 0.9
423 0.0000 0.5 452 0.0000 0.7
By the updating heuristic in the foregoing section, one will find the parent node of F 32
(where b = 3 and e = 2) as shown in Table 4.5 (see Appendix 4.A).
The second subscript GN(2) is 2. Therefore the parent node is F22 . Also,
EN(3) = e - (cumulativeNbefore GN(2» = 2 - 1 = 1
Based on this result, it follows that q'32 = O.
21 1 l<c=2
22 2 3>c=2
23 4 7
24 2 9
(since Nb-I, GN(b-I) = N 22 = 2 and t iI: EN(b) = EN(3) = 1, t = 2). Also considering
k il:e = 2, k = [e - EN(b)] + t = (2 - 1) + 2 = 3
qh = q3y(1 - q32) = 0.25/(1 - 0.75) = 1.0
At Stratum 2, the updated responsibility probability for the parent node F22 becomes,
qZ2 = q22 - q22 x (q32 - qh)
= 0.75 - 0.75 x (0.75 - 0) = 0.l875.
By applying the same methods, the parent node of F22 is F II . The updating for F2I is,
q21 = q21 + [q21 / (l-q22)} x q22 x (q32-qh>
= 0.25 + {0.25 / (1- 0.75)} x (0.75-0) = 0.8125
Automatie expert process fault diagnosis and supervision 343
Figure 4.23 Updated probabilities in deep KB for the CIM system diagnosis.
4.4 COllclusiolls
Methodologies for modeling failure knowledge about complex industrial systems and for
utilizing these models for real-time reasoning about, and diagnosing, failures have been
presented in this chapter. Their key component is the modular frarnework used for expert
knowledge-based failure analysis. This modularity permits easy implementations of fault
diagnosis programs, based on this framework, for different industrial systems, since the
only domain-dependent components that must be replaced are the specific failure models.
344 Real time fault monitoring of industrial processes
The necessary theoretical background regarding the more recent methods, is given in the
Appendices of the Chapter for the interested reader. Many application examples from a
wide industrial practice have been cited, with sufficient details giving the reader the
ability to develop (his)her own applications.
In parallel with the development of automatie expert diagnosis and knowledge-based
decision support systems (DSS), concern has been growing regarding the potential for
catastrophic errors created by these systems and, what is worse, the potential for
catastrophes whose causes cannot be established. Concern for the risks associated with
expert systems is now so strong, that it has spilled over into public discussion, such as a
British television program in which American and British practitioners and critics argued
the dangers of using expert systems in medical, industrial, military and other
applications. Also, there have been calls for restrictions on the deployment of
unsupervised or autonomous system in safety critical situations (The Boden Report,
1989).
Additional issues for the designers of ES, whether these are statistical, knowledge-based
or hybrid systems, fall into two main categories: performance and responsibility.
Performance issues.
1. The decision procedure used by the ES must perform weIl (make or recommend
good decisions), even in the face of degraded data. Robustness entails being able to
assess the reliability of information sources and to seek alternatives where necessary,
as weIl as merely cope with uncertainty.
2. Few practical situations involve just one c1ass of decision (e.g. diagnosis); ES theory
must surely address the problem of deciding what decision is required.
3. Many practical automata must face rapidly changing situations, not only in the infor-
mation available but also in the problem that needs to be solved. Expert systems must
incorporate capabilities for altering their decision goals as circumstances develop.
The central requirements for meeting these demands inc1ude the ability of an ES to be
rationally flexible, i.e.:
1. Recognize that adecision is needed.
2. Identify the kind of decision it iso
3. Establish a strategy for making it.
4. Formulate the decision options.
5. Revise any or all of the above in the light of new information.
An expert decision system should be capable of autonomously invoking and scheduling
these processes as circumstances demand. Classical ES theory offers little guidance for
developing the necessary techniques.
Automatie expert process fault diagnosis and supervision 345
Responsibility issues.
One must also aehieve a high level of eommunieation between human supervisors and/or
auditors wishing to examine and potentially to intervene in any aspect of the automatie
expert decision proeess.
1. If the automatie expert decisions lead to errors, it must be possible to establish the
reasons for those errors.
2. Where it is practieal and appropriate, provision should be made for a skilled supervi-
sor to exercise overriding eontrol.
In general an automatie expert deeision maker needs to be able to reflect on the decision
proeedure, to be able to examine the:
3. Decision options (what ehoiees exist).
4. Data (the information available that is potentially relevant to a ehoice).
5. Assumptions (about viability ofoptions, reliability ofdata ete.).
6. Conclusions (in light of data and knowledge ofthe setting).
Reflective eapabilities should extend to the decision process itself, including:
7. The goals ofthe decision (what is the decision supposed to aehieve).
8. The methods being pursued (what justifies the eurrent strategy).
9. Charaeteristies of speeifie procedures (applieability eonditions, reliability, eomplete-
ness ete.).
A theory of expert rational decision making must aeknowledge these requirements.
Classieal expert deeision procedures may be optimal in the sense that they promise to
maximize the expeeted benefits to the decision maker, but they must be viewed as unsat-
isfactory in other ways. It seems that connectionist models are drawing inereasing inter-
est as useful tools for expert decision making whieh ean aecomplish the above require-
ments. These methods allow an automatie clever combination of different type knowl-
edge bases for a speeific application, thereby reducing the effort required for an equiva-
lent expert system development. The conneetionist models also resemble the type of fine-
grained parallelism of eonventional symbol proeessing. The theory and practice for their
applieation to robust real-time fault diagnosis are given in Chapter 5.
Expert systems can be written in conventionallanguages like C, FORTRAN or Pascal,
whieh have the advantage of being widely known and of being the language in whieh
other programs are written, so that no integration problems oecur. However, they are
poorly adapted to the expression and the handling of knowledge expressed by words.
Languages speeifie to artificial intelligence, like Lisp or Prolog, do not have this disad-
vantage, but they are not yet widely used, and their standards are not weIl established.
All these languages have the eommon drawback that they lack programming aids. In
contrast, expert system building tools, especially shells, have the advantage that the
knowledge representation, the inferenee engine, and many other faeilities, such as re-
ports, windows, menus and forms are largely preprogrammed, faeilitating the accom-
346 Real time fault monitoring of industrial processes
plishment of the above requirements. In the early days, these tools were written in special
languages, but nowadays eonventionallanguages are used more and more often.
References
Gruber T. R. and P.R. Cohen (1987). Design for aequisition: Prineiples ofknowledge
system design to faeilitate knowledge aequisition. International Journal of Man-
Machine Studies, 26, p. 143.
Gupta A., Forgy C. and A. Newell (1989). High-speed implementations of mle-based
systems. ACM Transactions on Computer Systems, 7, 2, p. 119.
Hamilton T. P. (1988). HELIX: a helicopter diagnostic system based on qualitative
physics. Artificiallntelligence in Engineering, 3, 3, p. 141.
Hiekman F. R. et al. (1989). Analysis for knowledge-based systems. A practical guide
to the KADS methodology. Ellis Horwood Ltd., West Sussex, England.
Hudlicka E. and V. Lesser (1987). Modelling and Diagnosing problem-solving system
behavior. IEEE Transactions on Systems, Man and Cybernetics, 17,3, p. 407.
Isermann R. and B. Freyermuth (1991). Process fault diagnosis based on process model
knowledge - Part I and Part 11. ASME Journal of Dynamic Systems, Measurement and
Control, 113, p. 62l.
Johannsen G. and L. Alty (1991). Knowledge engineering for industrial expert systems.
Automatica, 27, 1, p. 97.
Johnson L. E. and N.E. Johnson (1987). Knowledge elicitation involving teachback in-
terviewing. In A. Kidd (Ed.), Knowledge Acquisition for Expert Systems: a practical
Handbook, Plenum Press, New York, p. 9l.
Kaiser G. E. et al. (1988). Database support for knowledge-based Engineering
Environments. IEEE Expert, Summer 1988, p. 18.
de Kleer J. and B.C. Williams (1987). Diagnosing multiple faults. Artificial
Intelligence, 32, p. 97.
de Kleer J (1990). Using emde probability estimates to guide diagnosis. Artificial
Intelligence, 45, p. 38l.
Klein G. A., Calderwood R. and D. Mac Gregor (1989). Critical decision method for
eliciting knowledge. IEEE Transactions on Systems, Man and Cybernetics, 19, p. 462.
Kuan K. K. and K. Warwick (1992). Real-time expert system for fault location on high-
voltage underground distribution cables. lEE Proceedings-C, 139,3, p. 235.
Lee W. Y., Alexander S. M. and JH. Graham (1990). A hybrid approach to a generic
diagnosis model. Proceedings, Fourth International Conference on Expert Systems in
Production and Operations Management, May 14-16, Hilton Head Island, SC, p. 264.
Luger G. F. and W.A. Stubblefield (1989). Artifieial Intelligence and the design of
Expert Systems. Benjamin/Cummings, N. Y.
Automatie expert process fault diagnosis and supervision 349
Roth E. M. and 0.0. Woods (1989). Cognitive task analysis: an approach to knowledge
acquisition for intelligent system design". In G. Guida and C. Tasso (Eds), Topics in
Expert System Design, North Holland, Amsterdam.
Rouse W. B., Hammer 1. M. and C.M. Lewis (1989). On capturing human skills and
knowledge: algorithmic approaches to model identification.IEEE Transactions on
Systems, Man and Cybernetics, 19, p. 558.
Sahraoui A. E. K. et al. (1987). Combining Petri nets and AI techniques for monitoring.
Proceedings, IEEE Conference on Robotics and Automation, Raleigh, U.S.A., April
1987.
SIGART-Newsletter (April 1989). Knowledge Acquisition Special Issue, 108.
Soumelidis A. and A. Edelmayer (1991). Modeling of complex systems for control and
fault diagnostics: a knowledge based approach. In. S. Tzafestas (Ed.), Engineering
Systems with Intelligence, Kluwer Academic Publ., The Netherlands, p. 147.
Stavrakakis G. S. and E.N. Oialynas (1991). Effident Computer-based scheme to im-
proving the reliability performance of power substations. International Journal of
Systems Science, 22, 9, p. 1527.
Takahashi R and Y. Maruyama (1987). Practical expression of exception to diagnosis.
Bull Research Laboratory for Nuclear Reactors, Tokyo Institute of Technology, 12, p.
50.
Tesch O. B. et al. (1990). A knowledge-based alarm processor for an energy manage-
ment system. IEEE Transactions on Power Systems, 5, 1, p. 268.
Torasso P. and L. Console (1981). Oiagnostic problems solving: Combining Heuristic,
approximate and causal reasoning. Van Nostrand Reinhold, N. Y.
Trave-Massuyes L., Missier A. and N. Piera (1990). Qualitative models for automatie
control process supervision. IFAC 11th World Congress, Tallinn, Estonia, p. 211.
Tsoukalas L. H, Upadhyaya B. R, and N.T. Clapp, N. T., (1991). Hypertext-based in-
tegration for nuclear plant maintenance and operation. Proceedings, AI91 Confernce,
September 15-18, 1991, Jackson Lake, Wyoming, U.S.A.
Tzafestas S. G. (1987). A look at the knowledge-based approach to system fault diag-
nosis and supervisory control. In S. Tzafestas et al. (Eds), System fault diagnosis, reli-
ability and related knowledge-based approaches, O. Reidel Publishing Company.
Tzafestas S. G. (1989). Knowledge Engineering approach to system modeling, diagno-
sis, supervision and control. Syst. Anal. Model. Simul, 6, 1, p. 3.
Tzafestas S. G. (1991). Second Generation Oiagnostic Expert Systems: Requirements,
Architectures and Prospects. In R Isermann and B. Freyermuth (Eds.), Fault detection
supervision and safety for technical Processes, IFAC Symposia Series, No 6, !992.
352 Real time fault monitoring of industrial processes
The diagnosis model developed here consists of a single Deep Knowledge Base (DKB)
and a number of Shallow Knowledge Bases (SKBs) attached to each of the terminal
nodes in the DKB. The DKB is constructed by viewing the whole system under consid-
eration as a hierarchical system. The SKB is also organized hierarchically, based on the
level of decision complexity. The reasoning process employed is a D-S type of hybrid
reasoning as shown in fig. 4.24.
Hybrid Reasoning
every node in the deep KB and, in turn, "no test" implies "no decision-making". For this
reason, every node is connected downwards by a one-way arrow. Except for F oo which
represents the given system as a whole, it is assumed that functional blocks are inde-
pendent of each other within each stratum. Every terminal node has its own shallow KB.
With the exception ofF00, a functional block F ij is defined as folIows:
• i indicates the stratum number, i=l, 2, .. , v
• j indicates the "overall" element number,j=l, 2, ... , n
where v and 11 are arbitrary integers. This definition of a functional block is shown in fig.
4.25.
Lethj denote a failure probability of anode Fij" Anhj value is obtained from a historical
failure data set from a system under consideration. However, one can convert F ij to qij'
termed a responsibility probability, that is, a probability that F ij will be responsible for
causing the symptom. Since the hj value represents an absolute probability from the his-
torical data, the summation of failure probabilities of all the children nodes of any node
may not be unity. Thus, for convenience in probability updating (explained later in this
chapter), this sum is forced to unity by normalization of/;j into qi/
Lqij = 1 for 1= 1,2, ... , v and j = 1,2, ... ,n
jeK
where K is the set of children nodes which have the same parent node.
SttalUmO
SttalUml
A"_____
FII Fil ............................. F,.n
SttalUm2
/+~
F
21F
Ir\
F F F ... F
22•·· ................
/t~
F .... F 2n
Sttatum v
I
FY1
\ F\1,
Shallow knowledge base. A shallow KB, attached to each of the terminal nodes in the
deep KB, is organized in multiechelon form because every role involves its own decision-
making. This decides whether or not the role is responsible for the observed symptom
after testing the consequent ofthe role. For this reason, every role is connected by a two-
354 Real time fault monitoring of industrial processes
wayarrow. Each level is called an echelon, as shown in fig. 4.26. Unlike F ij' a role Rijk is
defined as folIows:
• i indicates echelon i; i = 1, 2, ... , e.
• } indicates group};} = 1, 2, ... , s.
where s is the total number of elements in echelon i-I and k indicates element k; k= 1, 2,
... , t; e, s, and t being arbitrary integers.
In the shallow KB, two attributes are associated with every role Rijk at and below eche-
lon 3 (i.e. R ijk where i~3, and) and kare arbitrary). One is Pijle> defined as a degree of
belief, e.g. obtained from the options of experts, which acts as a probability that Rijk is
believed to be responsible for the observed symptom. Furthermore, similar to that of qij
in the deep KB, Pijk is designed to have the foUowing property:
t
LPijk =1
k=1
Echelon I
Echelon 2
Echelon 3 R 4'*'
311
.... R
3\.
R
32\
... R
321
11\
R
3.\
.... R
311
I!
Echelon e ROll Reil
In other words, at and below echelon 3, the sum of the degrees of belief of all the ele-
ments within any arbitrary }th group is unity. The other is cYK ' defined as a test cost (in
dollars), which accounts for the cost oftesting the consequent part of Rijk.
Automatie expert process fault diagnosis and supervision 355
Shallow {
which the entropie measure applied here is based. Given the degrees ofbelief and the test
costs, the entropy of a rule Rijk is defined as folIows:
t
Hijk = Hijk(W,P) = - LWi+1,k,xPi+l,k,xlnPi+1,k,x, i ~2
x=l
where,
t
~
L..J Pi+l ,k ,x = 1.
x=l
Here j just indicates the group number from the immediate higher echelon and x denotes
the element k from echelon i+ 1. Moreover, note that the two units are non commensu-
rate, i.e. Cijk is in dollars and Pijk is non-dimensional. For this reason, cijk is normalized as
folIows:
Cijk
W ijk =---"""--
max
··k
{c·~}
1)
I,j,
So, the tie is broken by selecting the lowest ratio which gives the lowest test cost. A tie
at other rules can be broken "arbitrarily" since these rules do not directly involve the
testing (i.e. the entropy of each of these rules is not zero). The overall feature of the
search in the shallow KB is similar to the best first search because the search is per-
formed in increasing order of entropy.
In Step 8, if after going through the shallow KB, a fault cannot be found, one goes back
to the deep KB and update qij. The next subsection describes an updating heuristic (Lee
et al., 1990).
Heuristic for updating probabilities. Since the search path in the shallow KB did not
yield any faults, the responsibility probability of the terminal node (called F bc) which
358 Real time fault monitoring of industrial processes
owns this shallow KB changes trom the eurrent value to zero. This enables one to update
the probabilities in the deep KB.
In the light of the test results, the updated responsibility probability of F bc, q'be> becomes
O. The ripple effect of this updating spreads out to the rest of elements within K, that is,
the set whose elements haVe the same parent node. This K can be interpreted as a group.
Then, in this fashion, the updating is propagated upwards through the parent nodes until
the root is reached. In the following this updating heuristic is described.
To apply the updating heuristic, the number of elements of each ofthe nodes (i.e., how
many children each node has) in the deep KB must be known. So let,
Noo = Number of elements that F00 owns
N;j= Number ofelements thatFij owns, where i=I,2, ... , andj=I,2, ... ,n.
This data can be obtained when the hierarchy ofthe deep KB is constructed.
Finding the parent node: First, the last element (called Fb_l,g) at Stratum b-l IS
searched, by summing Nb-2.j over all j. That is,
for t = I, 2, ... , Nb-I,GN(b-l) and t::F- EN(b). Here, q'bk is proportionally increased, de-
pending on its original weight (i.e. qbk), and thus normalized.
Updating at Stratum b-l: The parent node of F bc will be less responsible for the ob-
served symptom since one of its children tumed out to be innocent. Thus the heuristic
equation tht represents this amount of reduction in responsibility (prob ability) is given
below:
qb-I,GN(b-l) = qb-I,GN(b-l) - qb-I,GN(b-l) x (qbc - q'bc)'
Apply the methods described above to determine the parent node and the element num-
ber at Stratum b-I. Let this parent node be Fb-2,GN(b-2) and this element number be
EN(b-I).
Since one of elements is innocent, the rest of the elements at Stratum b-I will be more
suspect. That is, the amount of reduction used in updating the parent node of F bc will be
proportionally distributed to the rest of the elements at Stratum b-I with the positive
sign. Thus the following equation is given:
where k = [GN(b-l) - EN(b-I)] + t for 1=1,2, ... N b-2,GN(b-2) and t 7:. EN(b-I).
Updating at Stratum u (b-2~~1): Similarly to the methods above, one has,
, _ + qu,k
qu,k - qu,k 1- x qu,GN(u)x(qU+l,GN(U+l)-q~+l,GN(u+l»,k#JN(u)
qu,GN(u)
The developed diagnostic model is applicable to many domains. Once the bierarcbies for
both a deep KB and its dependent shallow KBs are constructed, the diagnostic strategy
offig. 4.27 can work.
The prob ability updating is simple and straightforward relative to other techniques, such
as Bayes' Rule and Dempster's Rule.
Multiple faults can be handled by multiple diagnosis sessions. In addition, it is desirable
to have a scheme to classify symptoms when multiple symptoms are observed.
The main difficulty is that it is not obvious how to delimit the level of abstraction in a
deep KB (i.e. how many strata are needed). The number of strata in the deep KB can
affect the size (in terms of the number of rules) of each shallow KB because, as the deep
KB is specified in more detail, the degree of specificity of the shallow KBs might de-
crease. Tbis, in turn, implies the smaller size of the shallow KBs. There may be a trade-
offbetween the larger deep KB (the smaller shallow KBs) and the smaller deep KB (the
larger shallow KBs).
One of the critical factors with respect to the speed of diagnosis here is a mapping
scheme wbich relates a symptom to a certain node ofthe deep KB. Tbis node will serve
as the start node in the entire diagnostic process. The lower the stratum of the start node,
the faster the speed of diagnosis. Unless a mapping scheme is provided, the diagnostic
process begins with the root node ofthe deep KB (see also Section 4.2.1.4.).
Appendix 4.B Basic definitions olplace/transition Petri nets and their use
lor on-line process lai/ure diagnosis
i~N = 0 (4.5)
With the help ofthese S-invariants a new principle for fault detection in complex systems
can be formulated.
362 Real time fault monitoring of industrial processes
Supposing it is possible to map the structure of the total process as a pt net, the transport
ofthe physical conservation quantity is represented by the firing oftokens. Ifthe conser-
vation quantity takes only a few discrete values and the signals measuring the number of
tokens are not noisy, the process monitoring is easy: Using eq. (4.4) it can be tested at
each scanning time point if the actual marking vector M(k) beginning from the initial
marking M(O) is reachable. If M(k) is not reachable it can be ooncluded that an error has
occurred. The algorithmical evaluation of tbis failure detection criterion is simple.
Keeping in mind that the marking vector is integer valued, eq. (4.4) is therefore a linear
diophantic equation system. It is sufficient to test the existence condition of(4.4) at each
time step. Examples for systems govemed by such weil defined unnoisy physical quanti-
ties are industrial production systems or automatic shunting yards.
In the case of plant fault monitoring, the measurement signals are noisy and their domain
of definition is much larger than in the former case. Because of tbis the simple evaluation
of eq. (4.4) must fail.
Multiplying (4.4) with the transpose of the S-invariant and taking (4.5) into account,
yields:
i~M(k) =i~M(O) (4.6)
For applications in power and other industrial plants, it is correct to assume that the net
token flow across the envelope surface of the total process under consideration vanishes
or is zero in the mean. Otherwise continuous plant operation is not possible. Moreover it
is assumed that the transitions fire without changing the number of tokens, in other
words, the sum of arc-weights in front and after a transition should be equal:
L W(s,t) = LW(t,s) (4.7)
SE *t SE t*
with *t:={seS:(s, t)eF}, t*:={seS:(t, s)eF}. Eq. (4.7) is a conservation law for the fir-
ing of tokens. Under both these conditions it is clear that an S-invariant exists wbich
does not contain any other elements than 1 because each column sum vanishes. Such an
S-invariant is called hereafter a covering S-invariant. Therefore equation (6) can be rear-
ranged as
i=/S/ i=/S/
L Mi(k)- LMi(O) = 0 (4.8)
i=l i=l
The second sum in eq. (4.8) must be calculated only once at the initial time k=O.
Taking the noisy nature of the measurement values into consideration, a new fault crite-
rion for continuous total processes can be formulated:
i=/S/
L[Mj(k)-Mj(O)] < e (4.9)
j=l
Automatie expert process fault diagnosis and supervision 363
Eq. (4.9) is weil suited for on-line process monitoring. The actual number oftokens per
placeM,{k) is compared with the initial token content ofthe total process. This is possi-
ble because the continuous total process is naturally an initial boundary problem in con-
trast to the partial processes of the analytical redundancy methods. Therefore each slow
varying fault can be detected as soon as it surpasses the threshold €. The height of € de-
pends on the sensor noise and can easily be determined in an initiallearning period.
Ifeq. (4.9) does not hold one ofthe following reasons must be true:
(i) A sensor fault has occurred, and one of the measured token numbers is erroneous;
(ii) Inside the total system a source or sink of tokens has arisen, which means the
structure of the pt net has changed; or
(iii) The net token flow across the envelope surface of the total process is no longer zero
mean, and the operation of the plant has become discontinuous.
Which exactly of these different faults has occurred cannot be recognized by using eq.
(4.9), that is, a fault location is not possible. For this to be done more knowledge about
the total process under consideration in the form of quantitative or qualitative physical
models is needed. Details for the rule-based techniques and the way that can be com-
bined with the present technique can be found in Section 4.2.1. It should be noted here
that the process description in terms of pt nets is analogous to the state space formula-
tion
M(k)=AM(k-l)+ Bu(k-l) (4.10)
With the assumption that the state M is totally measurable and that it only represents
physical quantities of the same kind (i.e. only masses or only temperatures) the state
transition matrix A is equal to the identity matrix. Eliminating all previous time points
k-l,k-2, ... , 1 ineq. (4.10)onegets,
k-l
M(k) = M(O) + B Lu(j) (4.11)
j=o
This condition of observability (4.11) will be equivalent to the condition of reachability
(4.4) if the sum of the control vector in (4.11) is set to a vector v and the input matrix B
is named by incidence matrix N. The analogy demonstrates that it is suitable to use the
Petri net description to problems of process fault monitoring. Moreover, the possibilities
of the methods of Chapter 3 could be exploited through on-line monitoring of the evolu-
tion ofthe vector M(k).
364 Real time fault monitoring of industrial processes
Appendix 4.C Analytical expression for exception using fuzzY logic and
its utifization for on-fine exceptional events diagnosis
Definition 0/ exception. Human thinking is characterized by the argument that the sym-
bol (~) always appears in the universal proposition while the symbol (/\) occurs in the
existential proposition. This can be represented formally by
vx [P(x) ~ Q(x)] (4.12)
3 x [P(x) /\ S(x)] (S(x)"j; Q(x» (4.13)
where P(x), Q(x) and S(x) are appropriate dictative functions. These equations give the
interpretation that human thinking depends dominantly on the principle and simultane-
ously permits inconsistent remarks.
Conventional fuzzy diagnosis consists of two implications which take the form of eq.
(4.12), viz., "For all) there exists a failure Xi that correlates with a symptom y;., if Yj is
recognized" and "For all i there is a symptom Yj such that should be observed if Xi ap-
pears". The sets of failures and symptoms X, Y are given in the form,
(4.14)
where xi> Yi are elements of X, Y respectively. Here the diagnosis system is given by the
propositions,
Pi B(y)~3xi(A(xi)andR(xi'Y);)= 1, ... ,n (4.15)
p i/ A(Xi)~3Yj(B(y); i=I, ... ,m;)=I, ... ,n (4.16)
where the function A(xi) means that the failure Xi appears, and B(Yi) gives that the
symptom y;. is being recognized. R(x;. Y) indicates the correlation between Xi and YI
The exception is defined by the logical form of eq. (4.13). The aphorism "Exceptio pro-
bat regulam" means actually that the exception can serve as an examiner of the rule. This
emphasizes that the exception has a large capability of testing rules in diagnosis. This
aphorism motivated Maruyama and Takahashi, (1985, 1987), to introduce the exception
into simply-structured diagnosis while neglecting a hierarchy on rules and/or a classifica-
tion of the failures. How should the exception be represented and utilized to reinforce
the diagnosis?
Suppose a fully-experienced engineer who makes the comment "there can be special
logic for finding a failure while it is identified generally by such implications as eq. (4.12)
or eqs. (4.15), (4.16)". When he was averse to use eq. (4.12), bis logic might be trans-
formed in the following manner:
--, (V x [P(x) ~ Q(x)]) = 3 x --, ( --, P(x) v Q(x» = 3 x [P(x) /\--, Q(x)] (4.17)
where "--," denotes negation.
Automatie expert process fault diagnosis and supervision 365
Vx(P(x).Q(x)) 3 ) ())
/ x(R(x 1\5 x
(~ Yn
For simplicity, derivation of eq. (4.22) has been discussed on the binary logical point of
view. Here, fuzzification of these equations shall be performed in accordance with fuzzy
set theory, since recognition depends on the subjective tasks of a human. Defining the
fuzzy sets on the spaces X, Yand XxY:
f= (set ofappearing failures)
S = (set ofrecognized symptoms) (4.23)
e = (set ofrelations (x;,Y)
One can obtain fuzzy propositions Ai , Bj and Eij from A(xi), B(y) and E(x;, Jj) in the
form,
Ai=(xiisj), Bj=(yjiss), Eij.=((Xi,Yj)ise) (4.24)
where the truth values of these propositions are represented by linguistic truth values of
Ai , Bj and Eij.. The linguistic truth values are fuzzy sets defined on the truth values
space. Substituting eq. (4.24) into eq. (4.22), the fuzzification of eq. (4.22) is written in
the logical form,
~: 3j [Bj I\3 i (A i I\Eij)]; i=l, ... ,M, j=l, ... ,N (4.25)
(4.27)
where aib rij/and Fijl stand for the lower bound ofthe linguistic truth values A ia , Rija
and P ija respectively and bju is the upper bound of ~·a. However, only when an engineer
fails to find out the failure by eqs. (4.26), (4.27), the exceptional proposition of eq.
(4.25) should be utilized.
An effective technique for inferring failure should be introduced to deal with exceptions
in the actual condition. The cancellation law will be applied as a tool, since eq. (4.25)
takes the form P I\Q basically. The cancellation law is written
P, Q
(4.28)
Q
which is then read practically as "If P I\Q is true then Q must be true". Defining the truth
values of P, Q and PI\Q by P, Q and T respectively, eq. (4.29) is transformed into the
fuzzified expression,
(4.29)
Automatie expert process fault diagnosis and supervision 367
(4.32)
I, Pu =I
q ={ (4.34)
u 0, Pu< I
te. Pe> t, Pu = I
{
qe = (te. I], Pe = t, Pu = I (4.35)
0, otherwise
Diagnosis by utilization of exception. Assuming that the exception of eq. (4.35) holds,
the truth value of "very true" is given by,
~a=(hj!' 1] (4.36)
and the failure, symptom and fuzzy relation are written,
A ia= (ai" aiu), Bja= (bj" bju)' Eija= (eij" eiju) (4.37)
then, substitution ofeqs. (4.33)-(4.35) into eq. (4.25) generates the solution,
m
'!Caj /\ eij)u = 0
{I, bju = I (4.38a)
1=1 ' bju<I;j=I, ... ,N
368 Real time fault monitoring of industrial processes
(4.38b)
Equations (4.37) and (4.38) show that the truth value ofthe failure becomes meaningful
only when Bj is c10ser to "completely true" than~. Satisfaction ofthis condition allows
one to obtain Ai by solving the inverse problem of the form (pappis and Adamopoulos,
1992),
m
V(AjI\Eij)=H j ;j=l, ""N (4.39)
j=!
This can be iIIustrated conceptually in fig. 4.29. The failure is calculated from
V~!(Aj 1\ Elj') at the given smooth line of"true"~. and the broken line of"very true"
Bj , but there is no solution for the failure, since V~:!l(Aj 1\ Eij)=0 at the given
broken line of "more or less true" Bj .
5.1 Introduction
The theory and practical applications of Artificial Neural Networks (ANNs) are expand-
ing with very high rates, and the fields of application are increasing. It is not surprising,
therefore, that fault diagnosis is one of the main areas that ANNs have been used with
promising results, along with similar progress in control and identification of non-linear
dynamical systems.
Fault diagnosis using neural networks has the same structure as model-based methods: a
set of signals, carrying fault information is fed into a neural machine which outputs a
vector fault signal indicating normal or faulty system operation. Thus, it can be seen that
the main difference between the two approaches is in the diagnosis engine. The selection
of the input set (calIed training set), neural machine and output signal/classification
method will be the central theme of this Chapter and will be examined in detail in subse-
quent sections.
ANN-based fault diagnosis is aimed at overcoming the shortcomings of model-based
techniques. These techniques require mathematical process models that represent the
real process satisfactorily. The model should not, however, be too complicated, because
calculations easily become very time-consuming. In methods relying on state-variable
estimation, state variables are seldom measurable and so nonmeasurable state variables
have to be estimated. For estimation, a non-linear dynamic process model must be lin-
earized around an operating point. This approach requires a relatively exact knowledge
of the parameters of the linearized or linear model. In addition the process must operate
near the point where linearization was done because the model is valid only in the neigh-
borhood of the operating point.
In fault detection based on parameter estimation the process model parameters have a
complicated relationship to physical process coefficients. Malfimctions usually effect the
physical coefficients and the effect can also be seen in the process parameters. Because
370 Real-time fault monitoring of industrial processes
not all physical process parameters are directly measurable, their changes are calculated
via estimated process model parameters. The relationsbip between the model parameters
and the physical coefficients should be unique and preferebly exactly known. This is
seldom the case.
The performance of model-based methods depends strongly on the usefulness of the
model. The model must include every situation under study. It must be able to handle
changes in the operation point. If the model fails, the whole diagnostic system fails. The
sensitivity to modeling errors has become the key problem in the application of model-
based methods. On the other hand, ANN methods do not require an analytical model of
the process but need representative training data. The idea is that the operation of the
process is classified according to measurement data. F ormally tbis is a mapping trom
measurement space into decision space. Thus, data play a very important role in this
method.
Model-based methods are usually computationally demanding, especially when nonlinear
models are used. On-line modeling ofthe process requires even more computational ex-
penditure. ANN methods are usually computationally easier but the calculation task
depends very much on the data and the actual problem.
Model-based methods are mostly difficult to change afterward. The build-up of a model-
based diagnostic system requires a lot of eifort and changing one equation leads easily to
changes in many other equations or parameters. ANN methods are more flexible in the
sense that changing the data means the same as changing the properties of the diagnostic
system. On the other hand changing the data may lead to repeating the diagnostic task all
over again.
ANN-based fault diagnosis can be seen as a Pattern Recognition (PR) problem.
Traditional pattern recognition and classification can be divided into three stages: meas-
urements, feature extraction, and classification. First the appropriate data is measured.
Then a feature vector is computed. The extraction should remove redundant information
from measurement data and create simple decision surfaces. Finally the feature vector is
classified into one or more classes. When fault detection and diagnosis are combined, the
classes are the following: normal operation, fault number 1, fault number 2, etc.
Traditionally, pattern recognition concentrates on finding the classification of features.
The problem is, how to calculate the features. There is no apriori basis for the choice of
the calculation. It's difficult to know which of the features are essential and wbich are
irrelevant. Inappropriate choices lead to the need for complex decision mies, whereas
good choices result in simple mies (Himmelblau, (1978), Pao, (1989».
A human being has an amazing skill at recognizing patterns. A human often uses very
complex logic in recognizing patterns and in classifying them. A human can pick up
some examples of the classes but cannot determine any formal law for the classification.
Mathematically this can be seen as an opaque mapping trom pattern space into class-
membership space. In computer pattern recognition, the opaque mapping has to be re-
placed by an explicitly described procedure - a transparent mapping (Pao, (1989».
Fault diagnosis using ANNs 371
Like human beings, neural networks are also trained with a group of examples. When a
classification is realized with neural networks, the whole mapping from measurement
space into decision space is done at the same time and the mapping is leamed by training
examples.
It should be evident from the introducing comments that ANN-based fault diagnosis
methods were first applied to complex, non-linear processes where previous attempts
using conventional methods had failed. Chemical processes are such an example:
Hoskins and Himmelblau, (1988), illustrated an artificial neural network approach to the
fault diagnosis of a simple example process composed of three continuous-stirred-tank
reactors in series. Watanabe et af. (1989) presented a network architecture to estimate
the degree of faulures. They used an example system of three measurements and five
faults. Venkatasubramanian and Chan (1989) used a binary-input network to diagnose
faults of a fluidized catalytic cracking process. They also compared the neural network
approach with a knowledge-based approach. Sorsa et af. (1993) applied radial basis net-
works to detecting deactivation of catalysts in jacketed reactors.
In the Computer Integrated Manufacturing (CIM) field, relevant work has been reported
by Barschdorff et af. (1991) who applied back propagation and condensed nearest
neighbour techniques to wear estimation and state classification of cutting tools. Miguel
et af. (1993) investigated the applicability of an Adaptive Resonance Theory (ART)-3
based neural network to the detection and diagnosis of rotative machine failure. Syed et
af. (1993) used Kohonen-maps for the real-time monitoring and diagnosing of robotic
assemblies. Chow et af. (1991) implemented a three-Iayer feed-forward neural network
for the real-time condition monitoring of induction motors. Yamashina et af. (1990)
considered neural networks as a failure diagnosis tool for servovalves. Suna and Bems
(1993) considered a back-propagation structure for pipeline fault diagnosis.
In the aerospace industry Rauch et af. (1993) performed fault detection, isolation and
reconfiguration of an F-22 aircraft using neural networks, while Feng et af. (1993) ap-
plied an ART -2 neural network for automatie diagnosis of a variable thrust liquid rocket
engine. Passino et af. (1989) used a multilayer perceptron, as a numeric-to-symbolic
converter in a failure diagnosis application on an aircraft example.
Finally, Naidu et af. (1990) and Konstantopoulos and Antsaklis (1993) implemented a
neural-model of a four-parameter controller suitable for sensor and actuator failures.
The outline of this chapter is as folIows: in the first sections the theory of neural net-
works is outlined together with the basic topologies of ANN-based fault diagnosis. This
is followed by a detailed examination of the principles of ANN-based fault diagnosis.
Finally specific applications from representative fields are presented.
372 Real-time fault monitoring ofindustrial processes
The connectivity of a neural network determines its structure. Groups of neurons could
be locally interconnected to form clusters that are only loosely, weakly, or indirectly
connected to other clusters. Alternately, neurons could be organized in groups or layers
that are directionally connected to other layers.
Several different generic neural network structures, are useful for ANN-based fault de-
tection. Examples are:
• The Pattern Associator (PA). Tbis neural implementation is exemplified by
feedforward networks. A sampIe feedforward network is shown in Fig. 5.la. In section
5.4.2.1 this type of network structure is explored in detail. Its leaming (or training)
mechanism (the backpropagation approach and the generalized delta rule) are considered
and the properties of the approach are explored.
• The Content-Addressable Memory or Associative Memory model (CAM or AM).
Tbis neural network structure, is best exemplified by the Hop.field model. A sampIe
structure is shown in Fig. 5.lb.
• Selj-Organizing Networks. These networks exemplity neural implementations of
unsupervised learning in the sense that they typically cluster, or self-organize input pat-
terns into classes or clusters based on some form of similarity.
Outputs
Figure 5.1
(a) Feedforward NN structure (b) CAM/AM neural network structure
Although these network structures are only examples, they seem to be receiving the vast
amount of attention.
The feedback structure of a recurrent network shown in Fig. 5.1 b suggests that network
temporal dynamies, that is change over time, should be considered. In many instances
the resulting system, due to the nonlinear nature of unit activation-output characteristics
and the weight adjustment strategies, is a bighly nonlinear dynamic system. Tbis raises
concerns with overall network stability, including the possibility of network oscillation,
instability, or lack of convergence to stable state. The stability of nonlinear systems is
often difficult to ascertain.
374 Real-time fault monitoring ofindustrial processes
II _ _ _ _W
...:.
1---..
•
•
I D _ _ _ _W,-=D'--_-.I
I--_f(a)
WD+I = bias
In+1 = 1
Ca)
Output functions. As shown in Fig. 5.4, a variety of functions that map neuron input
activation into an output signal are possible. The simplest example is that of a linear unit,
where,
0= f(a) =a (5.1)
One particular functional structure that is often used is the sigmoid characteristic,
where,
1
o=f(a)=-- (5.2)
l+e- la
Equation (5.2) yields 0 e[O,l]. Then Ä. is an adjustable gain parameter that controls the
"steepness" of the output transition as shown in Fig. 5.4e. Typically, Ä.=I and (5.2) is
376 Real-time fault monitoring of industrial processes
1-----;~o={O,I} o={O,I}
T: Threshold
Inputs Output Inputs Output
d
F2T;I=O LXjWj <T 0
j=1
F2T;l>O 0
d
E<T;l=O o LXjWj ~T
j=1
E<T;l>O 0
(E: sum of activated excitory inputs
I: sum of activated inhibitory inputs)
(a) (b)
Figure 5.3 Neuron activation characteristics.
(a) McCullouch-Pitts model (b) Linear weighted threshold model
Another particularly interesting class of activation functions are bilevel mappings or thre-
sholding units. For example,
I a~0
o=l(a)= { (5.4)
o a<O
Fig. 5.4b shows tbis unit characteristic using a general threshold, T. Alternatively,
Fault diagnosis using ANNs 377
(a)
o Upper threshold
o
T a a
Threshold, T
tower tbreshold, t 1
(b) (c)
r--~--)">1
),,=1
/}
)..<1
t==~"'~--- Ä= 0
a
o U
(d) (e)
+l a ~O
o=l(a)= { (5.5)
-1 a<O
Thresholding units may be viewed as a limiting case of the sigmoidal unit characteristic
of (5.2). This is shown in Fig. 5.4e. In addition, thresholding units may be used to com-
pute Boolean functions.
The characteristics of the previous activation functions suggest an instantaneous activa-
tion-to-outpu mapping. A more realistic model would involve delay or dynamics in the
unit response. A model to incorporate dynamics might be,
daj(t) 1 1 .,.
- - =--aj(t) + -aj (t) (5.6)
dt Tj Tj
where aj(t) is the activation, aHt) is the actual input activation and Oj is the time con-
stant of the ith unit. Equation (5.6) constrains the time change of individual unit states
and enables a local "memory." Biases may be used, for example to selectively inhibit the
activity of certain neurons.
As noted earlier, an ANN can be regarded as a collection of processing nodes and con-
nections. However, almost all neural networks have considerable structure beyond this
simple representation. For example, most ANN architectures group the processing nodes
into disjoint subsets, called layers, in which all the processing nodes have essentially the
same transfer function. Processing nodes can send connections to other processing nodes
in the same layer as weil as to processing nodes on other layers.
Many ANN topologies have been proposed (Hopfield, 1982; Feldman and Ballard, 1982;
Rumelhart and McClelland, 1986; Kohonen, 1984). Each differs in the number and
character of the processing nodes, the connections, the training procedures, and whether
the input/output values are continuous or discrete. An extensive review of fundamental
developments in feedforward artificial neural networks from the past 30 years is given by
Widrow and Lehr (1990).
Supervised learrung requires the pairing of each input vector with a target vector repre-
senting the desired output; together these are called a training pair. Usuallya network
is trained over a number of such training pairs. An input vector is applied, the output of
the network is calculated and compared to the corresponding target vector, and the dif-
ference (error) is fed back through the network and weights are changed according to an
Fault diagnosis using ANNs 379
algorithm that tends to rninirnize the error. The vectors of the training set are applied
sequentially, and errors are calculated and weights adjusted for each vector, until the
error for the entire training set is at an acceptably low level.
In this section a neural network with a layered, feedforward structure and error gradient-
based training algorithm is presented. Although a single-Iayer network of this type,
known as the perceptron, has existed since the late '50s (Minsky and Papert, 1969), it did
not see widespread application owing to its lirnited classification ability and the lack of a
training algorithrn of the multilayer case. Furtherrnore, the training procedure evolved
from the early work of Widrow (Widrow and Hoff, 1960) in single-element, nonlinear
adaptive systems such as ADALINE (Widrow and Lehr, 1990).
The feedforward network is composed of a hierarchy of processing units, organized in a
series of two or more mutually exclusive sets of neurons or layers. The first, or input,
layer serves as a holding site for the values applied to the network. The last, or output,
layer is the point at which the final state of the network is read. Between these two ex-
tremes lie zero or more layers of hidden units. Links, or weights, connect each unit in
one layer to only those in the next-higher layer. Fig. 5.5 illustrates the typical feedfor-
ward network. The network as shown consists of a layer of d input units (LJ, a layer of c
output units (L o )' and a variable number ofinternal or hidden layers (L h;) ofunits.
The topology of the multilayer, feedforward network is not its essential point. What is
more important is the role by which the topology acquires intelligence, i.e. its learning
rule. In fact, the delay in developing a proper rule can be blarned for the early stalling of
neural network research. Even though an explicit learning formula was not developed
until around 1986 by Rurnrnelhart and McCleIland, an existence proof by Kolmogorov
laid the theoretical foundations for such a rule.
Kolmogorov's Mapping Neural Network Existence Theorem: Given any continuous
function qJ:Id~JlC, ~,)=o where] is the closed unit interval [0, I] (and therefore ]d is the
d-dimensional unit cube), qJ can be implemented exactIy by a three-Iayer neural network
having d processing elements in the input layer, (2d+ 1) processing elements in the
(single) hidden layer, and c processing elements in the output layer. The processing ele-
ments in the hidden layer implement the mapping function,
d
Z k = ~:>~ k",Uj + Ek) + k (5.7)
j=!
where ~ are the network inputs and the real constant A, as weIl as the continuous real
monotonie increasing function '" are independent of qJ (although they do depend on d).
The constant Eis a rational number, 0<t:S8, where 8 is an arbitrarily chosen positive
380 Real-time fault monitoring of industrial processes
constant. Further, it can be shown that '" can be chosen to satisfY a Lipschitz condition
11JI(i) -IJI(0)1 ~ cli - olP for any 0 < ß 5;, 1.
I ... L., L,
\~ _ _ _ _ _ _ _ _ _ _ _ _y - _ _ _ _ _ _ _ _ _ _ -JI
Interna) or "hidden" layers
(L",)
where the functionsgj, i = 1,2 ... , c are real and continous (and depend on IJ' and &).
The utility of this result is somewhat limited however, since no indication of how to
constroct the '" and gj functions is given. For example, it is not known whether the
commonly used sigmoidal characteristics even approximate these functions.
The Genaralised Delta Rule (GDR) or back-propagation leaming role is the most
widely used procedure for "tuning" multilayer, feedforward networks. This was pro-
posed, as stated earlier, by Rummelhart and Mc Clelland in 1986. At about the same
time, Mecht-Nielsen reintroduced Kolmogorov's Theorem, thus opening a new era in
neural network research. Since then, a number of variants of the GDR have been pro-
duced, aiming at overcoming certain shortcomings of the original approach or trying to
fine-tune the algorithm to specific tasks such as system identification, control or pattern
recognition. The back propagation a1gorithm belongs to the dass of back-coupled error
correction procedures, which use a term that has to be computed from the networks'
output mode in question, and is then transmitted back to adjust the weights.
Before giving the details of the role, let us describe the basic operations in step format:
1. Apply input (stimulus) vector to network.
Fault diagnosis using ANNs 381
where,
(5.9)
and w is the vector containing the network weights. Given a training input vector xp ' Yp
is the output vector generated by the forward propagation of the activation through the
network, and dp represents the desired output vector associated with xp '
Whether tbis objective function is appropriate in specific situations, e.g. in c1assification
problems, is debatable, and modifications may be made if tbis is not the case.
The objective function is minimised by a gradient descent technique. Applying minimisa-
tion criteria to the output node yields the following weight adjustment rule:
(5.10)
where L1pwji is the change in the weight from node i to node j after training sampie p, 11 is
a gain term usually called the learning rate and a is a momentum term that smoothes the
effect of dramatic weight changes by adding a fraction of the weight change, L1p-Iwji ,
from the previous training sampIe p-l. The error signal öpi is a measure of the distance
from the activation level of node j to its desired level after training sampie p.
The GDR provides two rules for calculating the error signal of anode. For an output
node,
(5.11)
where dpj is the desired activation level for node j with respect to activity generated by
the input pattern p.
The major contribution of the GOR lies in its formula for computing the delta values for
the bidden nodes, since before the GOR, there was no learning procedure for multilayer,
feed-forward networks. For an arbitrary node j in a bidden layer, the rule for calculating
the error signal is,
382 Real-time fault monitoring ofindustrial processes
where the summation is over all the k nodes to which the node j sends output.
As the name back propagation suggests, the basic idea behind this computation of error
signals for the hidden nodes, it to propagate errors back through the system that are
based on observed discrepancies between the values of output nodes and the expected
output for a training pattern. The error signal focuses on adjusting connection strengths
that are most responsible for the output error. Then, if,
1
fj(a) = -a.
l+e J
comings are briefly discussed. Apart from this, a back-propagation network has poor
memory. When the network leams something new it forgets the old one. Despite of its
shortcomings the back-propagation is very widely used.
Recently, there has been a concentrated effort towards the design and analysis on leam-
ing algorithrns that are based on the Lyapunov stability theory. Polycarpou and Ioannou
(1992) present a general formulation for modeling, identifying and controlling nonlinear
dynamical systems using various neural network architectures is presented, and analytical
results concerning the stability of these schemes are obtained. Gaussian radial-basis-
function networks have also been used for adaptively controlling dynamical systems with
unknown nonlinearities.
A special architecture of RNN is that instead of having so-called hidden layers, one can
enhance the input pattern with additional high-order terms and then usually find that a
flat net with no hidden layers suffices for the purpose. This kind of architecture is called
Recurrent High-Order Neural Networks (RHONNs).
High-order networks are expansions of the first-order Hopfield and Cohen-Grossberg
models (described later) that allow higher-order interactions between neurons. RHONNs
have a superior storage capacity, while stability properties of these models for fixed
weight values have been proved. Furthermore, several authors have demonstrated the
feasibility of using these architectures in applications such as grammatical inference and
target detection.
Kosmatopoulos et al. (1992) present efficient learning algorithrns for recurrent high-or-
der neural models and analyze their stability properties.
The performance of RHONNs during the learning phase is superior to the performance
of conventional architectures (backpropagation, etc). Convergence is achieved in much
less iterations, e.g., for a task of learning 75 input/output pairs (with each input to be a
pattern of 30 real number feature values and each associated output a single scalar), a
backpropagation architecture needs 50-80 iterations, while a RHONN architecture only
6 iterations in order to converge with a system error in the order of 10-3.
Recurrent neural network (RNN) models are characterized by a two way connectivity
between neurons. This distinguishes them from feedforward neural networks, where the
output of one neuron is connected only to neurons in the next layer. In the simple case,
the state history of each neuron is determined by a differential equation of the form
x·1 ==-a.x·+b·~w
1 1
.. y.
1~ 1) J
j
where x j is the state ofthe i-th neuron, aj' bj are constants, wif is the weight connecting
the j-th input to the i-th neuron, and Yj is the j-th input to the above neuron. Each Yj is
384 Real-time fault monitoring ofindustrial processes
either an external input or the state of a neuron passed through a sigmoidal function, i.e.,
Y.J,=S(Xj) where S(.) is a sigmoidal nonlinearity.
The dynamical behavior and stability properties of neural network models of tbis form
have been extensively studied by Hopfield, as weil as other researchers. These studies
showed encouraging results in application areas such as associative memories, but they
also revealed limitations of tbis simple model.
In a recurrent second-order neural network the total input to the neuron is not only a
linar combination of the components Y.J" but also of their products YjYk. Moreover, one
can pursue along this line to incIude bigher-order interactions represented by triplets
Y.J'YkY/, quadruplets, etc. Tbis cIass of neural networks forms a recurrent bigher-order
neural network (RHONN).
Consider now a RHONN consisting of n neurons and m inputs. The state of each neuron
is governed by a differential equation of the form
x· = -a·x· +b'[~W"
1 1 1 ~ 1)
1
n ydj(k)(t)]
j
k=1 jeIk
d
where {/1' h,.·.,I is a collection of L not-ordered sub sets of {I, 2, ... , m + n}, aj • bj
are real coefficients, W jj are the (adjustable) synaptic weights ofthe neural network, and
~(k) are non-negative integers. The state ofthe i-th neuron is again represented by xi'
and y = [Yl Y2 '" Y m+n]T is the vector consisting ofinputs to each neuron, defined
by,
S(xl)
Yl
Yn S(xn )
y=
Y n+l
=
Ul
Y n+m Um
where u =[Ul U2 ••• Um] T is the external input vector to the network. The function
S(.) is a monotone increasing, differentiable sigmoidal function ofthe form
1
S(x) = a -ß-Y
l+c x
where a, b are positive real numbers and g is areal number. In the special case that
a=b=l, g=O, we obtain the logistic function, and by setting a=b=2, g=1, one obtains the
hyperbolic tangent function; these are the sigmoidal functions most commonly used in
neural network applications.
Fault diagnosis using ANNs 385
The weights of the RHONN are adjusted using the learning algorithm proposed by
Kosmatopoulos et al. (1992),
The multilayer, feedforward and the Hopfield neural networks both exemplify supervised
learning. In this section, neural networks based on unsupervised learning are shown.
Specifically, networks that are used to determine natural clusters or feature similarity
from unlabeled sampies are explored. The "cluster discovery" capability of such
networks leads to the descriptor se/j-organizing.
Fundamentally, unsupervised learning algorithms (or "laws") may be characterized by
first-order differential equations (Kosko, 1990). These equations describe how the
network weights evolve or adjust over time. Often, some measure of pattern associativity
or similarity is used to guide the learning process, which usually leads to some form of
network correlation, clustering, or competitive behavior.
typifies an ART1 architecture. The FA subsystem may be viewed as the "bottom" layer
that both "holds" the input pattern and, through the bottom-up weights bij (the
interconnection strength from FA unit j to FB unit i), forms the FB layer excitation.
Neural subsystem FB
Neural subsystem FA
Input pattern, x
w ..
1)
={1
-E
1=)
where Oj = fi(aj) and fi( ) is a unit activation-output mapping function that must be
monotonically nondecrreasing for positive aj and zero for negative aj, and is a fun-
damental part of the MAXNET structure (Lippman, 1987). Thus, only one pattern e1ass
is designed to "win" if the overalI network converges for a given input pattern. The
reader will note some similarity of this local competition or inhibition-based structure
with the Kohonen structure discussed subsequently. The overalI ARTl architecture,
then, is a cooperative-competitive feedback (recurrent) structure.
-E _---_
.1
FB
The previous static network architecture description only partially characterizes the
operation of the network. Because of the feedback structure, temporal dynamics are an
important component ofboth recognition (recall) and learning (encoding). These actions
are governed by the additional control signals shown in Fig. 5.7 and enable two phases of
operation: an attentional phase engages FA units only when an input pattern is
presented; an orienting phase successively "thins" units in FB, until a winner, or pattern
e1ass is found. If it is not possible to determine a winner, an uncommitted FB unit is used
to represent this new pattern e1ass, thus facilitating learning.
When presented with an input pattern, the ART network implements a combined
recognition/learning paradigm. If the input pattern is one that is the same as, or e10se to,
388 Real-time fauIt monitoring of industrial processes
one previously memorized, desired network behavior is that of recognition, with possible
reinforcement of the FB layer on the basis of this experience. The recognition phase is a
cydic process of bottom-up adaptive filtering (adaptive since the weights bjj are
changeable at each iteration) from FA to FB, selection of a stored pattern dass in FB
(the competition), and mapping ofthis result back to FA, until a consistent result at FA is
achieved. The top-down feedback of the competition winner output from FB to form FA
activations that may be viewed as encoded or learned expectations. This is then the
network state of resonance and represents a search through the encoded or memorized
patterns in the overall network structure. If the input pattern is not recallable, desired
behavior is for the FB layer to adapt or learn this dass, by building or assigning a new
node henceforth representing this pattern dass.
An algorithm that accomodates binary input {-I, I} features is as folIows:
1. Select e, p and initialize the interlayer connections as folIows:
to. = I
1) Vi , J' (5.17)
o I
b··=- (5.18)
1) 1+ n
Equations (5.18) and (5.19) are specific cases ofmore general constraints that must
be placed on the initial values of tij and bij' Equation (5.17) satisfies the so-called
lemplate learning inequality, whereas (5.18) satisfies the direct access inequality.
2. Present d-dimensional binary pattern x = (xl> x2, ... , xd)T to the FA layer.
3. Using biJ" determine the activations of the FB layer units, that is, each unit has
activation,
B =ai j Lbijx (5.19)
j
Related to the winner unit in (5.20) is the function mU), used for weight updates
and shown in (5.25).
5. The top-down verification phase begins. Using the winner unit found in step 4, this
result is then fed back to FA via the top-down or lij' interconnections using,
aJFA =t ..o FBwiD
1) } (521)
•
Fault diagnosis using ANNs 389
for each unit in FA. The fed-back FA unit activations (or outputs) are then
compared with the given input pattern. This is an attempted confirmation of the
winning unit dass found in step 4. Numerous comparisons are possible, with the
overall objective of determining whether the top-down and input activations are
sufficiently dose. For example, since the inputs are binary, the comparison
(5.22)
may be used. In ART!, Ilxl =LilxJ Here pis a design parameter representing
"vigilance" ofthe test, that is, how critically the match should be evaluated.
6. If (5.22) is true, that is, the test succeeds, the bif and lif interconnections are up-
dated to accommodate the resuIts of input x using discrete versions of the slow
leaming dynamic equations:
m(j) = {I 'f
I
- FBwin
0J - 0J (5.25)
o otherwise
If the test of (5.22) faUs, this unit is ruled out and step 4 is repeated until a winner
can be found or there are no remaining candidates.
Equations (5.23)-(5.25) represent one example of a learning strategy in the ART
approach. Carpenter and Grossberg (1987), provide separate "slow" and "fast" learning
procedures. Parameters al and a2 control the rate at which the system learns or adapts
and must be chosen carefully. Learning rates that are too slow, yield systems that are
rigid (or nonadaptive in the extreme case). Conversely, learning rates that are too fast
cause the system to display chaotic (or what is termed "plastic") behavior. In the
extreme case, the system tries to learn every input pattern as new dass. Thus, a trade-off
390 Real-time fault monitoring ofindustrial processes
exists between system insensitivity to novelty (truly new patterns) and an overly plastic
behaviof. Simplified versions ofthese updating strategies are shown in Pao, (1989).
Choosing the dimension of the feature map involves engineering judgment. Some PR
applications naturally lead to a certain dimension; for example, a 2-D map may be
Fault diagnosis using ANNs 391
developed for speech recognition applications, where 2-D unit clusters represent
phonemes. The dimensions of the chosen topological map may also influence the training
time ofthe network. It is noteworthy, however, that powernd results have been obtained
by just using l-and 2-D topologies.
Once a topological dimension is chosen, the concept of an equivalent dimension
neighborhood (or cell ofbubble) around each neuron may be introduced. An example for
a 2-D map is shown in Fig. 5.9. Tbis neighbourhood, denoted Ne' is centered at neuron
ue, and the cell or neighborhood size (characterized by its radius in 2-D, for example)
may vary with time (typically in the training phase). For example, initially Ne may start as
the entire 2-D network, and the radius of Ne shrinks as iteration proceeds. As a practical
matter, the discrete nature of the 2-D net allows the neighborhood of a neuron to be
defined in terms of nearest neighbors; for example with a square array, the 4 nearest
neighbors of ue are its N, S, E, and W neighbors; the 8 nearest neighbors would include
the "corners." In I-D, a simple distance measure may be used.
Each unit uj in the network has the same number of weights as the dimension of the input
vector and receives the input pattern x in parallel. The goal of the self-organizing
network, given a large, unlabeled training set, is to have individual neural clusters self-
organize to reflect input pattern similarity. Defining a weight vector for neural unit Uj as
w,=(wjl> wi2, ... , wjd)T, the overall structure may be viewed as an array ofmatched filters,
wbich competively adjust unit input weights on the basis of the eurrent weights and
goodness of match. A useful viewpoint is that each unit tries to become a matched filter,
in competition with other units.
Assurne that the network is initialized with the weights of all units chosen randornly.
Thereafter, at each training iteration k and for an input pattern x(k), a distance measure
d(x, w j ) between x and wj , Vi in the network is computed. Tbis may be inner product
measure (correlation), Euclidean distance, or another suitable measure. For simplicity,
the Euclidean distance is adopted. For a pattern x(k), a matcbing phase is used to define
a winner unit ue, with weight vector W e, using,
Thus, at iteration k, given x, c is the index of the best matching unit. This affects all units
in the currently defined cell, bubble or cluster surrounding uc' Nik) through the global
network updating phase as folIows:
Wj(k)+a(k)[X(k)-Wj(k)]; i eNe(k)
~~+D= { . (5.27)
wj(k) ; 1!l: Ne(k)
Note that (5.27) corresponds to a discretized version ofthe differential adapation law:
Clearly, (5.28) shows that d(x, wi ) is decreased for units inside Nc , by moving wj in the
direction (x-wi)' Therefore, after the adjustment, the weight vectors in Nc are left
unchanged. The competitive nature of the a1gorithm is evident since after the training
iteration, units outside Nc are relatively further from x. That is, there is an opportunity
cost ofnot being adjusted. Again, ais a possibly iteration-dependent design parameter.
The resuiting accuracy of the mapping depends on the choices of Nik), a(k), and the
number of iterations. Kohonen cites the use of 10,000-100,000 iterations as typicaI.
Furthermore, a(k) should start with a value c10se to 1.0, and gradually descrease with k.
Similarly, the neighborhood size Ne(k) deserves careful consideration in algorithm design.
Too small a choice of Ne(O) may lead to maps without topological ordering. Therefore, it
is reasonable to let Nc(O) be fairly large (Kohonen suggests 1/2 the diameter of the map)
shrioking NeCk), perhaps Iinearly, with k to the fine-adjustment phase, where Ne(k)
consists only ofthe nearest neighbors ofunit uc' Of course, a limiting case is where Nik)
becomes one unit.
As pointed out in the introduction, an ANN-based fault monitoring scheme has, in a way,
the same structure as a corresponding model-based scheme: a number of signals, which
are deemed typical of the state of the process that is being monitored, is fed into a neural
machine. The neural machine outputs a fault vector which is manipulated by adecision
logic to decide if a fault has occured and possibly isolate it and estimate its size.
propagation) learning rule are used in the majority of published applications. Its elegant
structure, is however offset by two factors:
• It may not converge.
• It has a slow convergence of O(NJ) where N is the number of weights.
The first problem can be usually overcome by multiple starts with different random
weights and by a low value of the learning rate 11 (Lippman, 1987). To accelerate the
learning procedure dedicated parallel hardware can be used for the computations. The
extent of both drawbacks seems to depend on parameters 11 (learning rate) and a
(momentum factor). Unfortunately, their optimum values cannot be determined apriori
and furthermore they may change during the training (i.e. they are time-varying). Their
adaptive setting is the subject of ongoing research (Cho and Kim, 1993).
Since the appropriate choice of a network topology cannot be made a-priori, it is good
practice to compare the performance of various topologies and choose the best
performer. This procedure is not straightforward, however. Not only one has to
compare different topologies, but also different configurations ofthe same topology. In a
feedforward, multilayered ANN the values have to be found empirically. Moreover, the
number of hidden layers and the number of nodes per hidden layer must also be found by
experiment. This is a weil known drawback in implementing this kind of ANN. Node
activator functions must also be chosen amongst the class of possible alternatives
(threshold, sigmoid, hyperbolic, Gauss etc.). It follows therefore that a logical procedure
for optimum network topology is to search each proposed topology for its best
configuration and then choose the best amongst the best. Sorsa et al. (1991) have used
this idea in comparing three topologies: a single-Iayer perceptron, a multilayer
perceptron and a counter-propagation network which combines a Kohonen layer for
classification with an ART architecture for mapping. Results were obtained on a
simulated model of a heat exchanger and a continuous stirred tank reactor. Their results
will be detailed in the examples section.
What is really asked from an ANN-based fault diagnosis system is to recognize fault
patterns inherent in signals carrying fault information. Thus, as already pointed out, a
fault diagnosis problem can be viewed as a pattern recognition problem.
Neural networks have been used for pattern recognition for sometime and there exist
some powerful theorems in this area. In fact, Mirchandani and Cao (1989), have shown
that in a d-dimensional space, the maximum number of regions that are linearly separable
using h hidden nodes is given by,
d
M(h,d) = L(~) if h > d
j=o
394 Real-time fault monitoring ofindustrial processes
=2h ifh ~d
This theorem holds for hard-limiting nonlinearities, i.e. binary outputs. However, the
conclusions can be extended to other types ofnonlinearities (sigmoid etc). The behaviour
of such networks is of course more complex, because the decision regions are bounded
by smooth curves instead of straight line segments.
In traditional pattern recognition techniques the pattern classification is carried out
throught aseries of decision functions. A classification of d-dimension pattern space with
M clusters, may be viewed as a problem of defining hyperplanes to divide the d-
dimensional Euclidean space into M decision regions. More complex decision functions
will be needed for linearly unseparable decision regions. Moreover, probability models
are often employed under the premise of prior probabilities, because perfect typical
reference pattern examples are not easy to obtaine. How to select the suitable decision
function forms and how to modify the concerned parameters of the decision functions are
not easy to be determined for the traditional pattern recognition methods.
Similarly to traditional decision theory, neural networks perform the classification by
creating decision boundaries to separate the different pattern classes. However, unlike
traditional classifiers, when a classification is realized with neural networks, the decision
functions are not needed to be given beforehand. The whole mapping from sampIe space
into decision space is developed automatically by using the learning algorithrn. The
knowledge of fault patterns is stored distributively in the highly interconnected nonlinear
neuron-like elements. Moreover, it is these nonlinear activations in the network that lead
to the strong classification ability of the artificial neural networks for the high dimension
pattern space.
The usual pattern vector employed in ANN-based fault diagnosis has dimension equal to
the number offaults that must be detected. In theory, an 1 in the ith position indicates an
ith type fault, while a zero pattern vector signals normal operation. In practice, however,
the network is trained for the values of 0.9 and 0.1 for fault and no-fault cases
respectively, since 0 and 1 are limiting cases for sigmoid activators (usually employed),
thus stalling the learning procedure if used. After training, a fault of type i is declared if
the ith element of the output pattern vector exceeds a threshold. This threshold must be
defined considering false alarm rates and it is usually calculated by simulation. A value of
0.5 is a safe guess.
Note that with this formulation, multiple faults can be detected, if the network has been
trained with this situation. An alternative procedure, mimicking parameter estimation
techniques, would be to produce a system parameter vector as an output pattern vector.
In this way, the neural network would act as a parameter estimator. Fault decision would
then be accomplished, using any of the methods discussed in Chapter 2. This decision
phase could also be implemented by a neural network. Thus the inputs to this second
network would be the parameter estimates, while its output would be a pattern vector
having the structure discussed previously.
Fault diagnosis using ANNs 395
The appropriate selection of input training data is a very important stage in the de-
velopment of an ANN-based fault diagnosis system. There is little guidance in the
literature regarding the choice of representative sets of examples for training with
continuous inputs, because most studies involve binary inputs. Most studies also used a
c10sed set of possible inputs. Training dynamical systems, however, requires continous
signals.
The first step of the procedure is to decide on the system parameters that are repre-
sentative of the system's condition. Tbis is of course application-dependent, but it may be
safely assumed that the input/output signals of a state-space representation of the plant
will be adequate. It may be neccessary to do some pre-processing on the input signals,
such as scaling or filtering. The total number of sampies needed depends on the
network's characteristics, i.e. topology, activator functions, learning rules etc. It is
evident that a small training sampie is a desired system characteristic.
The training sampie must contain signals from every possible fault situation of the plant
and in a representative range of values. Tbis may be impractical or even dangerous in
certain situations of critical faults (eg. nuclear reactors, aircraft) and simulation data is
then needed. This in part offsets the comparative superiority of neural networks
regarding the point ofmodel necessity. Even more, it is true that most published research
in ANN-based fault diagnosis relies on simulated process models. Is tbis a sign that tbis
approach is not implementable? Tbis question cannot be answered now, since it is
acceptable to use simulated models in early stages of development of new ideas.
In tbis section, it is hoped to clarity many of the points discussed earlier and iIlustrate the
applicability of the various methods.
The cited examples span a considerable part of industrial fields where ANN-based fault
diagnosis is proposed as an alternative to other techniques. The presentation is structured
in such a way as to highlight and enlighten the following crucial points:
• process model and fault models
• network topology, configuration and learning rule
• input training signals
• output patter vector
• results
The examples that folIoware only a representative sampie of available literature and
additional references are cited at the end ofthe chapter.
396 Real-time fault monitoring ofindustrial processes
The field of Chemical Engineering is especially suited for applying ANN-based fault
diagnosis systems. The nature of chemical processes, i.e. nonlinear, nonstationary and
uncertain dynamic plants, can be accomodated by neural network structures.
Because modem chemical plants are extremely complex, they are susceptible to
equipment malfunction and operator error. The complexity hampers the operator's ability
to diagnose and eliminate 'potential process upsets or equipment failures before they can
occur (Himmelblau, 1978). Hence, a continuing question in chemical engineering is how
to use the process state vector to make or aid decisions about possible action or control
at each time increment. Current techniques rely on expert systems, modeling using
classical techniques in the time or frequency domains, and statistical analysis.
STREAM I
Measurements
Network architecture. Fig. 5.11 shows the network architecture used for fault
detection and diagnosis. It consists of six inputs corresponding to the six state variables
of the system, three hidden nodes, and six output nodes corresponding to the six
respective process faults listed in Table 5.1. The GDR rule with various learning rates
was used to train the network (0.25<17<0.9, a=0.99).
Input training signals. Twelve measurement patterns were used to train the network:
the 11 measurement patterns listed in Table 5.2, plus one measurement pattern for a
normally operating system. These measurement patterns were obtained from a digital
computer simulation program designed to model the dynamics of an arbitrary
combination ofreactors and/or vessels. A second-order reaction (2A----+2B) was assumed
to occur between components A and B in each tank with a frequency factor of 5. Ox 1014
(Ibmole)/(ftJ)(min) and an activation energy of 4.47xI()4 (Btu/lbmole). Because the
398 Real-time fault monitoring ofindustrial processes
values of the sensor readings in the input patterns to the network are factors in the
equations that update the values of the weights in the learning procedure, measurements
with larger magnitudes exert a greater influence on leaming. To remove tbis bias, the
simulated measurement data were scaled (as indicated by the preprocessing boxes
depicted in Fig. 5.11) so that the inputs to the network varied continuously over the
range of -1.0 to 1.0.
Outputs
A B c D E F
Inputs
Since one of the handicaps that fault detection and diagnosis must overcome is imprecise
sensor measurements random noise was added to the inputs, modifying them to,
x' =x + D
where 11 was a noise vector in wbich each element contained a random real value in the
range of 0-1 0% of the input, and x' was the input vector with the added noise.
Fault diagnosis using ANNs 399
Output pattern vector. During training, the target values used for the output nodes
were set to 0.1 and 0.9 rather than 0 and 1. A learning criterion ofO.01 for each pattern
error Ep (ofEq. 5.9) was set to terminate the learning process.
Results. Noisy data require more complex decision regions. The learning results are
shown in Fig. 5.12. By using a sigmoid discriminant function, a multilayer feedforward
ANN and the GDR learning procedure, the network could properly classify inputs. The
plots in Fig. 5. 12 exhibited the same general trends as in the linearly separable case.
Note, however, that the range of the number of hidden nodes for low convergence rates
becomes smaller, especially for the high learning rates.
To demonstrate the generalization capabilities of the ANN, the percent rnisclassification
was exarnined for this example with the three hidden nodes and the learning rate
parameter set to 0.40. To test the ability ofthe network to recognize new input patterns,
35 measurement patterns not used in the training process were presented to the trained
network. These new measurement patterns were chosen to be representative of the
measurement space for the six faults and the normal operating system. Fig. 5. 13 shows
the percent of correct classifications as a function ofthe number of input patterns (with
added noise) used to train the network.
4000
•
3000
0.25
2000 0 0
• •
0.40
1000 0.65
6. 0.90
I:::.
Time
Steps 11
90
80
70
so
40
14 16 18
Number of hidden nodes
Perfect generalization occured when only two measurement patterns in the training set
bad been used for each fault group (Le., 12 input patterns total in the training set, two of
which were used for the normal operating system). Increasing the number of
measurement patterns maintained the same excellent performance level. However, some
failure to generalize efficiently was observed for the training set containing only seven
training patterns because the training set was so restricted that it was not representative
of the general mapping. This result was not surprising since the seven-input pattern
pattern training set included no example patterns that were representative of the
crossover between the normal and faulty region .
• •
/
WO
90
% COJTect HO
Responses on
NcwInput
Patterns 70
20
10
Figure 5.13 Generalization
0
4 R 12 16 20 24 28 32 capacity vs. training set size.
Number of Input Patterns in Training SeI
Additional references. Venkatasubramanian (1989, 1990) and his research team have
reported experiments on a sitnilar CSTR. They have used a feedforward neural network
with backpropagation and compared two methods of presenting input patterns: raw time-
series data and moving-average data values. Two methods of discretizing the desired
output were also compared: a linear and an exponential. Extensive experimentation
showed better performance of the "linear" discretization, while raw time-series data input
produced slightly better results.
Sorsa et al. (1991) conducted a comparison study on a simulated process consisting of a
heat exchanger and a CSTR. They compared the performance of a single-Iayer
perceptron, a multilayer perceptron and a counterpropagation network consisting of a
Kohonen and an ART layer. The process had 14 noisy measurements and 10 typical
faults. The multilayer perceptron with 4 hidden nodes, using a hyperbolic tangent as the
nonlinear element was able to correctly identify the faults in all cases. The same group
(Sorsa et al. 1993) have also investigated the use ofradial basis function ANNs to fault
diagnosis for dynamical processes not in steady state operation. An orthogonal least
squares algorithm developed by Chen et al. (1991) is used to train the network. A
simulated CSTR with set-point changes is used to test the validity of the proposed
approach and prelitninary results are protnising.
Fault diagnosis using ANNs 401
Due to their efficient problem solving capabilities, parallel processing model, and ability
to spontaneously react to environment changes, neural networks have prompted interest
in their application to various real-time dynamic manufacturing systems (Moon, (l990),
Lo and Bavarian, (1991), Gien et al. (1993)).
Diagnostic problem solving methods, based on either deep reasoning (from the first
principles) or shallow reasoning (Davis, 1984), are considered to be unsuitable for
domains with changing and short-lived processes (Reed et al., 1988). In a typical
application of a robotic assembly, a sequence of short-lived processes (typical of robot
operations) brings about continuous changes to the state of the assembly components.
Such characteristics necessitate flexible and adaptable solutions with efficient real-time
response capabilities for the detection of and the recovery from unexpected problems
during execution. The commonly used expert system solution for monitoring and
diagnosing can be inefficient and inflexible (Schutte et al., (1987), Zeilingold and Hoey,
(1990)), particularly when it involves a large number of mies (Ieading to a large and
computationally expensive search space) which require frequent updates due to the
environment changes.
Specifically, the monitoring and diagnosing of assembly execution errors, a1though
recognized as an important problem in robotic assembly (de Mello and Sanderson,
(l990), Kusiak and Finke, {I 988), Chang and Wee, (1988)), has not been addressed
adequately. Solution to this problem is generally difficult due to the real time assembly
constraints and the complex dynamic interactions between various components, such as
robot, conveyor system, tools, and parts. The real-time constraints are particularly
critical to the solution when computationally intensive sensory information (tactile,
vision, etc.) is to be processed and accessed for the purpose of monitoring and
diagnosing during each assembly step.
Syed et al. (1993) have proposed a neural network approach to the solution of tbis
problem, by implementing an unsupervised map, namely the Kohonen map, in a robotic
subassembly involving a fastener in a dishwasher power module.
Process and fault models. As just mentioned, a subassembly involving a fastener in a
dishwasher power module is considered. A tactile sensor is attached to the end-effector
ofthe robot arm. A generic robot operation, such as (pICKUP fastener FROM table),
may have several execution instances for the same part. These instances can differ from
one another in terms ofthe robot-part surface contact point and/or the approach angle of
the robot end-effector. Therefore, each operation instance has its own particular part
handling error characteristics.
To constmct a neural map, one needs to establish a correlation between the error
characteristics and the observed assembly interaction data. F or example, a mean part
surface contact area of between 0.0 and 0.5, measured using a tactile sensor, is cor-
402 Real-time fauIt monitoring of industrial processes
related with the possible error in the execution of the (LIFT part) operation. The size of
the surface contact area indicates how properly the part is grasped.
The input to each dimension in the neural map must be represented numerically. The
output data from sensory systems are normally in numerical form, and, thus, can be
directly used as input to neural maps. However, robot operation identifiers have to be
converted into corresponding numerical values. This can be accomplished by assigning
each operation an equal numerical interval. Table 5.3 summarizes the numerical
representation for a 2-D type-lI map, when the fastener is used in the subassembly. For
the input vector (~l> ~2) the value of ~l denotes the operation type and ~2 represents the
mean measurement value.
In Fig. 5.14, the shaded areas in the 2-D space correspond to the abnormal regions. The
abnormal regions are defined by all possible values ofthe input vector (~l' ~2)'
--
'. - ..I
<:> <...>
CI
.B
§
Co.) §0.75 ~
~ 0.5
.s 0.2 .s
~
~ ~
0.25
c:::
CI
Cl.)
E
•o 0.2 0.4 0.6 0.8
c:::
CI
E
Cl.)
O..:r.-........-r-'T"""""'""T"""""'"
o 0.2 0.4 0.6 0.8
operations types operations types
Figure 5.14 Network training inputs Figure 5.15 Intermediate node positions
Results. The final map obtained through the training is utilized during the assembly
process to monitor the robot operation. A winning node is determined, whenever the
sampie of sensory input vectors (~l> ~2) are applied to the network. The correctness of
the robot operation is deduced from the winning node j, the density of nodes (d) in its
neighborhood region, and the density threshold ~.
The threshold value and the neighborhood region are defined heuristieally by an op-
erator. An example size for the neighborhood region is a rectangle ofO.08xO.lS relative
to the ~l and ~2 axis, respectively. For tbis example, the value of density threshold, ~" is
set to 10. With these values for threshold and neighborhood region, let us consider two
different operation instances from Table 5.3. For the operation instance in column 2, an
input vector (0.38, 0.5) will indieate an execution error since the ~ will be 15. On the
other hand, an input vector (0.25, 0.9) for the operation instance in column 4 will not
show an execution error as the value of ~ will be zero.
Additional references. Yamashina et al. (1990), used a feedforward, multilayer ANN to
diagnose failures in a pneumatic servovalve used in automated production systems. A
time-series vibration signal is monitored by an accelerometer and resulting data are
summarized by six characteristic parameters. Four types of failures were considered, and
a separate neural network was designed for each case. A eonjugate gradient method
eoupled with a variable metrie method was seen to produee more reasonable seareh
404 Real-time fauIt monitoring of industrial processes
directions and avoid oscillations. The diagnosis perfonnance was very promising,
reaching false alann probabilities ofO.Ol and lower.
U 1
...8
g 0.7
u
.~ O.
U
...8
c:::
o
o~~~ ..........~..,
Cl.>
E o 0.2 0.4 0.6 0.8
operations types Figure 5. J6 Final node positions.
Fault diagnosis in rotating machines has also been an application area of ANNs. Chow et
al. (1991), have again used a feedforward, multilayer ANN to study the diagnosis oftwo
of the most common types of incipient faults in a single phase squirrel-cage induction
motor: stator winding fault and bearing wear. Accuracies of97,3% were reported using
a network of 16 hidden nodes, trained from 35 training data patterns by the back-
propagation algorithm.
Tinghu et al. (1993), have used similar techniques to analyze and diagnose five types of
typical faults in rotating machinery (unbalance, seal rub, misalignment, rotor crack, oil
whirl) based on the standard frequency spectrum waveform features which are
represented by power ratios in nine different frequency intervals.
Barschdorff et al. (1993), have investigated the ability of neural networks to diagnose
tool wear in cutting processes like tuming or grinding. They have used cutting force
components and vibrations of the workpiece holder as suitable indicators of tool wear. A
typical feedforward network is compared to a developed Condensed Nearest Neighbor
(CNN) network (Barschdorff and Bothe, 1991). Results showed some benefits of the
CNN against back-propagation networks. They also indicated that process parameters
can be used as inputs to increase the variety of cutting conditions under which the system
operates efficiently.
Fault location in power systems is defined as the identification of a fauIt or double fauIt
from the system components such as trasmission lines, buses, transfonners and circuit
breakers in substations through analyzing the on/off status of several relaying systems or
tripped order of circuit breakers. The difficulties in the estimation are derived from the
malfunction of relaying systems or circuit breakers themselves, that is they sometimes do
not operate when they should do, or they do the switching when they should not do.
Fault diagnosis using ANNs 405
Conventional methodologies which have been applied so far to the problem in power
systems, include:
• Logical expression (Wake and Sakaguchi, 1984).
• Expert systems (Matsumoto and Sakaguchi, 1983).
• Parameter estimation techniques (Stavrakakis and Dialynas, 1991).
Ogi et al. (1991), have used a modular neural network approach for power system and
equipment diagnosis. Despite its shortcomings the GDR was used as the learning rule in
this application as weH.
Plant and fault models. An example system with six buses, two transformers and two
transmission lines with their protective relaying systems is used as shown in Fig. 5.17. A
fault location related to buses, transmission lines and transformers must be estimated
from the on/off status of relaying systems or circuit breakers in addition to the hypothesis
of malfunctions in the relaying systems or circuit breakers themselves. The following is a
list ofthe components ofthe sampie power system:
• Relays (A. m• B. m C. m• Tl.' T2•• LI •• LI.) 26
• Circuit breakers (CB .• ) 11
• Fault components given 10
Bus (Ab A 2, B b B 2, Cb C2) 6
Transmission line (L b L2) 2
Transformer (Tl> T2) 2
The names of relays include a suffix which has the following meanings:
m: Main protective
p: Primary backup protective
s: Secondary backup protective
t: Third backup protective (which covers opposite direction of s)
A 3-layer feedforward network was used for fault location. Its input layer received the
on/off status of relays and circuit breakers and its output layer indicates which
components have failed. This indication is shown by the largest element of the output
vector.
Training. To train the network, a back-propagation learning algorithm with epoch
training was used, that is weight updates were performed after presentation of the entire
data set. The convergence criterion used was the usual absolute maximum error between
desired and ~iual output. Training patterns which satisfied the criterion were
progressively excluded from the weight update sequence. The training patterns consisted
of 41 input/output pairs. The first 10 of them were concerned with normal operation, 22
with a circuit breaker malfunction and 9 with a relaying system malfunction (Table 5.4)
Results. To test the efficiency of the network, its response to non-trained patterns was
examined. Two experiments were conducted: in the first, two fault components with all
relays and circuit breakers operating normally was simulated while in the second a single
406 Real-time fault monitoring ofindustrial processes
fault with two circuit breakers malfunctioning was applied. The results showed that
ANNs with more than 50 hidden units were able to classify the non-trained patterns
correctly.
Table 5.4 Input patterns
A2 --;---..,.---'-1 Al
TZ.
T2p
T2,
T2I
81
C2 --"'I"'-....L..--.-I Cl
I I I I
~t-_J I
--~-~
The four parameter controller, (Nett, 1988), illustrated in Figs. 5.18 and 5.19, is a
generalization ofthe familiar two-parameter controller (Antsaklis, 1992). This controller
has two vector inputs and two vector outputs, resulting in a controller with four matrix
parameters. Its various elements are: r is the reference input, a the diagnostic controller
output designed to reproduce the failures, Ye the controller output that is manipulated by
the plant and should be considered the ideal actuator input, ue the manipulated controller
input, n a is an exogenous input accounting for unmodeled sensor signals, u is the actual
actuator output, Y is the plant output fed into the sensor, z is the plant variables not used
by the controller and w is the unmanipulated plant input.
The linear controller can be described by the following relation,
[Ya] [KK KK r]
c -
l1
21
12 ][
22 UC
The objective of the additional controller output a is to identify and reproduce sensor
and actuator failuresJs andfa' Thus, the overall control objectives are:
• Achieve set point tracking
408 Real-time fault monitoring ofindustrial processes
COMMANDS
-----+-------r------~
r-____~)_----~A~CTU~ATOR
COMMANDS
S_T_IC;.,.S-+______~
DIAGNO.... }oo-------I
MEASUREMENTS
r z
controller
a
w
These requirements lead to certain conflicts (Nett et al., 1988): reproducing sensor or
actuator failures at the diagnostic output contradicts the requirement for noise and
disturbance rejection. Also, sensor diagnostic performance has to be traded against
actuator diagnostic performance.
Relying on the fact that a nonlinear controller should outperform its linear version,
Konstantopoulos and Antsaklis (1993) implemented a four-parameter controller in a
neural network. The general structure of a system designed for actuator failures is
shown in Fig. 5.20.
The neural controller has two main inputs: a reference signal r and the output of the plant
Yp' Experience has shown that delayed system signals enhances training performance.
For this reason, delayed reference inputs, plant outputs and controller outputs were input
to the neural network. It also has two outputs: a diagnostic output aac!, and y c' which
Fault diagnosis using ANNs 409
can be considered the ideal actuator input. The controller was trained with the following
objectives:
• To achieve set point tracking and isolate and reproduce actuator faults.
diagnostic a
output act
reference
input r Yp
PLANT
Yc u plant
output
r ac~uator
nOlse
failure
fact' nact
reference tracking was achieved, whereas better reproduction of actuator failures was
obtained by assigning larger weight to the diagnostic output.
1~--~~~----------------~-,
-0.
Additional references. Naidu et al. (1990) have developed a neural network sensor
failure detection system along the lines suggested by Nett's work on the four-parameter
controller just discussed. The back-propagation topology was used and compared to
Finite Integral Squared Error (FISE) diagnostics as weIl as to the nearest neighbor
cIassifier for an Interna! Mode! Contro! (IMC)-controlled system involving an uncertain
linear, time-invariant, first-order plant and linear or nonlinear plants that lie within the
model uncetainty bounds. Detailed studies produced promising resuIts.
Historically, utilities and other operators of nuclear plants have relied on human op-
erators to monitor the plants and to diagnose any problems which occur. With the
notable exceptions ofThree Mile Island in the United States and Chernobyl in the former
Soviet Union, this approach appears to have worked reasonably weil. However, there is
clear evidence linking these accidents and a number of troublesome operational incidents
("near accidents") over the years to "operator error". One possible solution is to
automate the plants and "take the operator out of the loop". For a variety of reasons
(legal, regulatory, and other), such a solution is not practical at the present time. The
alternative approach of "backstopping" the operators by providing them with the resuIts
of automated surveillance (including diagnostics) of the overall plant, is considered here.
Fault diagnosis using ANNs 411
The large number of process parameters and system interactions pose difficulties for the
operators, particularly during abnormal operation or emergencies. During such
situations, individuals can be affected by stress or emotion which may influence their
performance in varying degrees. Taking some ofthe uncertainty out oftheir decisions, by
providing real-time diagnostics and assistance, has the potential of increasing plant
availability, reliability and safety by avoiding errors that lead to trips or that endanger the
safety of the plant. The emerging technology of ANNs offers a method of implementing
real-time monitoring and diagnostics in a nuclear power plant. The various advanced
technologies, generally regarded as being within the scope of artificial intelligence,
especially neural networks, are believed to be appropriate for these tasks. The overall
objective is to provide the operator with necessary information about the power plant in
a way that would be useful, timely and non-intrusive. Special emphasis is given to the
early detection of abnormalities and deviations from normal operation, with the intent
that the operator could take corrective or compensating action, if appropriate.
Generally, the developed technology involved three specific tasks that were undertaken
using a variety of methods .. These where,
1. Diagnostics based on pattern recognition in time-records and related
representations ofvariables, (e.g. spectral densities).
2. Feature detection based on recognition ofpatterns in data, and
3. Modeling of phenomena and systems with interpretation of input-output relation-
ships.
Many projects involved both pattern recognition and modeling. In most cases,
comparison of predicted results (based on models developed from data taken when the
system was working properly) or patterns (learned by neural network models from data
presented to it ) with actual results or patterns is involved. Often, data had to be
preprocessed to put it into an acceptable form (e.g., a fast Fourier transformation ofthe
time-series to produce a spectral plot of the data) before it can be introduced into a
neural network. Onee a neural network has been trained to reeognize the various
conditions or states of a complex system, it takes only one recall cycle of the neural
network, typically a few milliseconds, to detect or identifY a specific condition or state. If
the neural network is implemented in hardware, the detection or identification is almost
instantaneous. Typically, the measured variables from the nuclear power plant systems
are analog signals that must be sampled, digitized, preprocessed, and normalized to
expected peak values before they are introduced into neural networks. The neural
networks are usually simulated on modern high-speed digital computers that carry out
the calculations serially. However, it is possible to implement neural networks using
specially designed microchips where the network calculations are truly carried out in
parallel, thereby providing virtually instantaneous outputs (microsecond response times)
for each set of inputs.
412 Real-time fault monitoring of industrial processes
Plant-wide monitoring using neural networks. The technique for sensor validation has
been expanded into a plant-wide monitoring system by Upadhyaya and Eryurek, (1992),
using an autoassociative neural network where the inputs and the outputs are the same
variables. The number of artificial neurons in the intermediate layer (usually about two or
three times the number of input nodes) was selected to minimize the training time and
maximize the ability of the neural network to generalize. The neural network was trained
over the range of operation· using the same data for the input vector and the desired
output vector. Backpropagation using a sigmoidal function with an adjustable coefficient
was used to train the network when the system was operating properly. Und er these
conditions, the neural network outputs represent estimates of the instantaneous values of
the output variables, and all of these estimates are virtually identical to the actual
outputs. When a sensor begins to drift or a failure is introduced into a data channel, the
actual value (neural network input) changes, but the corresponding predicted value
(neural network output) remains virtually unchanged. Hence, monitoring the differences
between the estimates predicted by the neural network (outputs) and the actual values
from the system (inputs) provides a method ofidentifying drift or instrumentation system
(or sensor) failure. An alternative interpretation of these differences might be that the
input-output relationship of the system from which the signals come, may have changed
due to system failure or changes of some sort in the system. Tbis technique was applied
to data from eighteen signals from the Experimental Breeder Reactor - 11 (EBR - 11)
during increase in power from 45% to 100%. Errors between estimates predicted by the
trained neural network and the actual values during normal operation were usually less
than 0.5%. When one of the sensors failed or an error was introduced into one of the
sensor outputs (the inputs to the neural network), the corresponding output ofthe neural
Fault diagnosis using ANNs 413
network changed only slightly. Hence, a difference between the network output (the
predicted value of the variable) and the actual variable identified that sensor or
instrumentation channel as the one with a problem.
Monitoring of check valves for operability. A1though there are many possible failure
mechanisms for check valves, the most common problems associated with check valve
failures are due to system flow oscillations or system piping vibrations which induce
check valve component wear and thus component failure. A technique involving the use
of a neural network for the analysis of acoustical data from check valves to evaluate their
status has been reported by Ikonomopoulos, Tsoukalas and Uhrig (1992). The power
spectral density (PSD) of the sampled time-series at a point on the check valve body
near the hinge pin is used as the input to the neural network, and the PSD of the sampled
time-series at another point on the check valve body near the backstop is the desired
output of the neural network. The network is trained while the flow varies over the
normal range of operation when the valve is known to be operating properly. The neural
network is then used in a monitoring mode to predict the output sensor PSD ±rom the
input PSD and a comparison is made between the predicted and actual output PSDs.
Deviations indicate that the interrelationship between the input and output signals has
changed due to a change (failure) of the valve. Analysis of time-records from two
piezoelectric accelerometers attached to the body of acheck valve on a large Boiling
Water Reactor Nuclear Power Plant has been used to demonstrate this process.
Comparison of spectra between identical 30-inch check valves (one broken and one
normal), operating under identical conditions clearly demonstrated that this technique
can identify the failed valve. The index of normal system behavior is the mean square
difference, obtained by summing the square of the difference in individual spectral values
between the predicted and actual spectra of the failed valve, divided by the number of
spectral values. Records for three 6-inch check valves (one normal and two that failed
for different reasons) operating und er identical conditions indicated that failures with
different degrees of severity give different values of mean square differences.
Loose parts monitoring. The detection of loose parts in the primary or secondary
system of a nuclear power plant is based on the identification of sounds produced by
tumbling parts hitting a pipe wall, the tube sheet of a steam generator, or other surfaces
that are part of the coolant system boundary. The spectrum of the sound is dependent
upon the size and shape of the loose part and the materials of construction of both the
part and the system boundary. Once the sound spectrum for parts of different sizes and
shapes (those most likely to break loose, e.g., the hinge pin of an upstream check valve)
has been measured, a pattern matching technique can then be used to identify the part or
parts. To date, neural networks have not been used for this purpose, but they have been
used in Germany to negate false alarms of a loose parts monitor. In one of the German
plants, there is a metal to metal contact (not caused by a loose part) that produces a
sound that is detected by the loose parts monitor. To overcome this problem, a neural
414 Real-time fault monitoring ofindustria1 processes
network was trained to identify the unique sound of the metal contact. Then when this
sound occurs, the neural network identifies it and enables the loose parts monitor alarm.
(i.e., the difference between the predicted DNBR and the limiting DNBR), it is a
parameter that is critically important for safety in nuclear power plants. The approach
used was to train a neural network to map the plant variables being monitored to the
DNBR as calculated by the computer code COBRA. The neural network was trained
over the range of input variables that was expected to occur during the fuel cycle. A fully
connected three-Iayer feedforward neural network was used for estimating DNBR
performance of the core. The output layer had a single processing element (PE)
representing the DNBR for the given plant operating parameters, which were the input
variables. A statistical sensitivity analysis relating the DNBR to the various parameters
indicated that the major parameters affecting the DNBR of a PWR during plant
operation were the core inlet temperature, the core power (ofheat flux), the enthalpy rise
peaking factor, the core inlet flow rate and the system pressure; hence, the input layer
had five PEs. The DNBR obtained from the neural network using data not used in the
training process showed that under steady state conditions, the resuIts agreed with those
obtained from COBRA caJculations within ±2.5% in virtually a11 cases.
Diagnosis of nuclear power plant transients. When a nuclear power plant is operating
normally, the readings of the instruments in a typical control room form a pattern (or
unique set) of readings that represents anormal state of the plant or system. When a
disturbance occurs, the instrument readings undergo a transition to a different pattern,
representing a different state that may be normal or abnormal depending upon the nature
of the disturbance. The fact that the pattern of instrument readings undergoes a transition
to a new state that is different, is sufficient to provide a basis for identifying the transient
or the change of the state of the system. In implementing such a transient diagnosis
system in a nuclear power plant a large number (perhaps 20 to 200) of output variables
from the plant are sampled simultaneously, normalized to expected maximum values,
preprocessed if necessary, and transmitted to the input layer of a neural network. The
unique pattern among these 20 to 200 variables represents the condition of the plant at
that particular instant. When the system is operating at a steady state or changing slowly,
the pattern of variables at each sampling instant remains the same or changes slightly,
and the output of the neural network remains the same. However, at a time At after a
transient begins, the sampled values form a different pattern (i.e., the relationship
between the variables has changed and continues to change as the transient progresses).
When the sampled values are fed to a trained neural network, it gives an indication of the
state of the system. Successive sets of sampled inputs will indicate the same transient is
under way if the pattern is adequately developed. Indeed, there is a whole group of pat-
terns associated with each unique transient that must be included in the training set.
Neural networks trained on simulator transients. Work by Bartlett et al., (1992), has
demonstrated the validity of the concepts discussed above. Data from the training
simulator at TVA's Watts Bar Nuclear Power Plant provided data for some 22 to 27
variables for seven different accident transients (loss of coolant in the hot leg of the
reactor coolant system (RCS), loss of coolant in the cold leg ofthe RCS, main steam-line
416 Real-time fault monitoring ofindustrial processes
break in containment, main feedwater line break in containment, total loss of off-site
power, control rod ejection, and steam generator tube leak). Simultaneously sampled
values of these time records at equally spaced time intervals, constituted the input
vectors to the neural network for training. An "auto-adaptive" stochastic learning
technique was developed by Bartlett and Uhrig, (1992), and applied to a special neural
network with a dynamic node structure (e.g., the number of nodes in each of the three
bidden 'layers used was optimized). A new method for the stochastic optimization of
these inter connections using a Monte Carlo training procedure was developed to train
tbis network to identify these seven different nuclear power plant transients. This general
approach has been continued by Kim, Aljundi and Bartlett, (1992), using data from the
San Onofre Nuclear Power Training Simulator.
Guo and Uhrig, (1992), simplified the diagnostic neural network by using a modified
backpropagation technique to train a neural network with extensive lateral inhibition in
the middle layer. Twenty two inputs and seven outputs (one for each transient used in
the training) were used. The middle layer had sixteen PEs arranged in a four by four
array with negatively weighted connections in both directions between adjacent
(including diagonal) PEs and self-feedback on each node with positive weights. In all
cases, tbis neural network was able to detect the transient before the plant tripped, even
in the presence of 2% noise. For fast transients, the diagnosis was almost instantaneous.
This work was extended through the use of a sensitivity analysis to determine the most
important input variables for each transient. Then individual modular neural networks
with the five or six most important input variables and a single output were used to
detect each transient. These modular networks were much simpler, did not require lateral
feedback in the middle layer, and gave equally good, if not better, results. The problem
was that it was necessary to develop the complex neural network with lateral feedback in
order to utilize sensitivity analysis to identify the most important variables for each
transient. To overcome this problem, a genetic algorithm optimization was performed to
select the most important variables for each transient.
Using neural networks to identify abnormal events. Neural network techniques have
been applied by Ohga and Seki, (1991), to identify an abnormal event that caused a trip
in a BWR in Japan. A primary feature of the system was that the result of the neural
network analysis was conflrmed using the knowledge base on plant status when each
event occurs. The neural network recognized the change patterns of the state variables
and output the event code corresponding to the abnormal event. The neural network had
three layers with 40, 4, and 3 nodes in the three layers. Five kinds of state variables were
used in the neural network. For each state variable, eight data values were acquired
before, at, and after a plant trip. Sampling times were different before and after the trip.
Data were normalized and sent to the neural network. The event identification method
was tested using a workstation. The test data were prepared, based on the simulated
results of a transient analysis program. Data were prepared for different plant
Fault diagnosis using ANNs 417
configurations (changing fuel burn-up, beginning or end of fuel cycle, and abnormal
progression speed of any variable, etc.). Test results showed that the neural network
could identify a trained event even when the plant conditions were different from those
used during training and when the data acquisition system contained noise.
Connectionist expert system. An expert system that has a neural network in its
knowledge bases is called a connectionist expert system. A backpropagation neural
network model was applied to a connectionist expert system for the identification of
transients in a nuclear power plant by Cheon et al. (1992). Connectionist expert systems
that incorporated neural networks into the diagnostic process yield great benefits in
terms of speed, robustness, and knowledge acquisition and demonstrate the feasibility of
connectionist expert system's applications to the identification of transients in nuclear
power plants. When a transient disturbance occurs, the sensor outputs or instrument
readings undergo a transient from the existing pattern to a different pattern that
represents a different state of the plant. The transient identification is approached from a
pattern-matching perspective in that an input pattern is constructed from symptoms, and
that symptom pattern is matched to an appropriate output pattern that corresponds to the
transient that occurred.
The connectionist expert system has significant advantages over the traditional rule-
based expert system. Results showed that once the network had been fully trained with
various patterns, it could identify the transient easily, even with incomplete or distorted
patterns. Furthermore, multiple transients were identified.
418 Real-time fault monitoring of industrial processes
The connectionist expert system approach is most appropriate for classification problems
in environments where data are abundant and noisy, and where humans tend to generate
brittle and perhaps contradictory IF-THEN rules. Since connectionist expert systems are
very fast, they are weIl suited for real-time applications.
Hybrid (neural-fuzzy) system for transient identification. A unique hybrid system has
been developed by Tsoukalas, Ikonomopoulos, and Uhrig, (1991), for the identification
of transients in complex systems. It couples a rule-based expert system using fuzzy logic
to a pretrained artificial neural network and uses a model-reference approach to help in
the identification of noisy data. The expert system performs the basic interpretation and
processing of the model data. A set of pretrained artificial neural networks provides the
models of the transients. Membership functions (from fuzzy logic) that condense
information about a transient into a form convenient for a rule-based identification
system and characterize the transients are the outputs of the neural networks. After
training the system is capable ofperforming faster than real-time. To demonstrate the use
of tbis system, two classical transients, (a) a rupture of the main steam line and (b) a
rupture of a main feedwater line, were simulated on a computer. Three parameters,
pressurizer pressure, hot leg temperature and steam level indication were chosen for
differentiating these transients. Time series of these three variables during the transient
were the inputs to the neural network. The output is a membership function of a fuzzy
system. Tests showed that this system is capable of differentiating between the two
accidents, even when the three inputs are corrupted with up to 20% random noise.
Fault diagnosis using ANNs 419
Control rod wear recognition. Wear of cladding on power eontrol rods of nuclear
power plants eause rod clusters to be replaeed prematurely. To mitigate wear by
repositioning the eontrol rods, it is neeessary to identifY the loeation of eaeh wear sear
and to measure its depth during inspeetions. Boshers et al., (1992), have deseribed a
method that eombines algorithmie, rule-based and neural network methods to perform
the inspeetion (eurrently performed by human operators) to identifY the wear sears and
find eertain quantitative properties of the sear. The funetions of the neural network based
software are:
1. To identifY and extraet wear seetions.
2. To determine wear information including wear position, peak wear depth, and
eross-seetional are loss, and
3. To organize this information into a data base.
Initial feature extraetion and neural network proeessing for wear reeognition had been
implemented. The neural network grouped all types of wear (single peak, double peak,
ete.), into one class for the initial reeognition purpose. The prototype neural network
used eight features, initially extraeted from the time reeords, as inputs, and the proper
values for these eight quantities as the desired outputs. A threshold was established,
whieh if exeeeded, indieated that a speeifie fault feature has been deteeted. In preliminary
tests, output values have exeeeded 50% of the threshold value, but there were no
misclassifieations.
Expert systems are espeeially useful in situations where the knowledge of an expert is ex-
plicitly aeeessibie. In other words, one must be eapable of translating the knowledge into
a model which consists of data and rules and which describe sufficiently the behavior of
the real world. However, if an expert cannot explain how (s)he solves a certain problem
(when intuition is used), then an expert system to model this knowledge is not of mueh
help. This is not the ease for neural networks: they are able to find by self-Ieaming the
solving path of the problem. However typical problems that exist when using neural
networks are:
(a) Finding the right arehiteeture and (b) the right pre-proeessing ofthe inputs.
Therefore it is reeommended before solving a eomplex problem using neural networks,
to look for appropriate literature on similar topies. In general though, neural nets ean be
quite sueeessful in eertain areas, and are especially, more flexible than other methods.
From this perspeetive one eould say that in eertain eases it would be appropriate to
partition the knowledge in "formalisable" (preeise) knowledge and in "unformalisable"
(vague) knowledge. The formalisable knowledge eould be modeled by an expert system,
420 Real-time fault monitoring of industrial processes
Praeticable Investigate
r
with neural nets?
L -_ _-..,._ _ _---' no other possibilities
yes
Integration
Implementation K - - - - - I Implementation
of the Neural Net
Expert System translation
in C-Code
Figure 5.22 Global development cycle for integrating neural nets in Expert Systems.
Fault diagnosis using ANNs 421
A number of reasons for justifying the need for real-time expert systems are the
following:
• There exist too many inputs to be monitored effectively by humans.
• There exist too many complex relations between the inputs, which are essential to
make adecision.
• Decisions should be taken faster than a human could do.
• The system should run 24 hours a day, 7 days a week without loss of quality.
• Rard to find qualified personnel.
• Too many personnel staffneeded to run system effectively.
Some real-world examples are:
• Monitoring ofthe Rubble Space Telescope (NASA) (6000 sensors).
• Network monitoring (Bank of Canada).
• Monitoring car manufacturing.
• Nuc1ear Power Plant Risk simulator.
• Satellite monitoring and control (ESA).
• Process monitoring and control of chemical processes.
• Trafiic Control.
One ofthe tools that can be used for these kind of applications is RTworks. RTworks is a
software development tool for real-time monitoring and control applications. RTworks
makes use ofthe c1ient/server concept: it breaks its major tasks into three types ofproc-
esses: inference engine processes, data acquisition pro ces ses and human interface proc-
esses (graphical interface). With a traditional expert system shell, the inference engine,
data acquisition, and user interface would be all grouped together into one large process,
potentially tying up resources, such as memory and CPU, and making it difficult to react
quickly to critical events. In this case, one could distribute processes over several
computers (e.g. LAN).
The inference engine of a real-time expert system should be adapted to copy with real-
time applications. In addition to the inference strategies' forward chaining and backward
chaining, the inference engine should also offer time-driven rules. For instance a rule
could run periodically (every 10 seconds). Another relevant needed feature is the use of
temporal reasoning in rules (e.g. if during the last 5 minutes the power has been
decreased for 15 seconds, then ... ). The speed of the inference engine should be fast
enough to be used by the real-time application. Typically, the inference engine of
RTworks pro ces ses about 12.000 rules per second.
Wh at does real-time mean? There exist many definitions of real-time. It is commonly
assumed to mean "fast", in the sense that a system is considered real-time if it processes
data quickly. A better definition states that "the system responds to incoming data at a
rate faster than it arrives". An overview oftypical response times is shown in Table 5.5.
422 Real-time fault monitoring ofindustrial processes
It shows that real-time has not to be real fast to mean real-time; many traditional business
applications have "real-time" elements to them.
The integration of RTworks (or any other relevant software) and N euralWorks can be
accomplished as follows: first the expert system performing the monitoring task is buHt.
Rules must be defined that will detect an eITor during monitoring.
Next a neural network is trained for the fault diagnosis task, which should be able to give
an analysis ofthe eITor (cause, impact etc.). To accomplish this, which inputs are needed
must be deterrnined first and second, relevant example data must be gathered. After
successful training one can integrate the net in RTworks (or any other relevant software)
by translating it in C-code. These next steps are as follows:
1. Translate the neural net in C-code.
2. Link the compiled C-code to the inference engine as a new user-defined function
(rtlinkie nn.c).
3. Build mIes that activate the diagnosing subsystem (function).
Some typical example mies are:
IF "deviated behavior" && Error found?
THEN error = TRUE; && yes
IF error && when error is detected
THEN cause = NEURALNETDIAG(input1, input 2... );
Info("_hci", Cause of the problem is:", cause); && send a message to the operator
In RTworks it would be furthermore possible to divide the incoming data in a monitoring
and a diagnosis datagroup. On startup only monitoring data will be received. In case
there is an eITor, only data relevant to analyse the problem will be received. In this way
one can reduce the data transfer.
References
Antsaklis P.J. (1992). Neural networks for the intelligent control ofhigh autonomy sys-
tems. Intelligent Systems Technical Report 92-9-1, Department of Electrical
Engineering, University ofNotre Dame.
Barschdorff D., Monostori L. and T. Kottenstede (1993). Wear estimation and state
classification of cutting tools in turning via artificial neural networks. Proceedings,
International Conjerence on Fault Diagnosis TOOLDIAG 93, Toulouse, France, April
5-7,669-677.
Baba N. (1989). A new approach for finding the global minimum of eITor function of
neural networks. Neural networks, 2.
Barschdorff D., Monostori L., A.F. Ndenge and G.W. Wöstenkühler (1991).
Multiprocessor systems for connectionist diagnosis of technical processes. Computers in
Industry, 17, 131-145.
424 Real-time fault monitoring ofindustrial processes
Bartlett E. and RE. Uhrig (1992). Nuclear Power Plant Status Diagnostics Using An
Artificial Neural Network. Nuclear Technology, 97.
Boshers J. A, Saylor C., Kamadolli S., Wood R and C. Isik. (1992). Control Rod Wear
Recognition Using Neural Nets. In DJ. Sobajic, (Editor), Proceedings of the 1992
Summer Workshop on "Neural Networks Computing for the Electric Power Industry,
Stanford, CA, August 17-19.
Carpenter G.A and S. Grossberg (1987). A massively parallel architecture for a self-
organizing neural pattern recognition machine. Computer Vision, Graphics and Image
Processing, 37,54-115.
Carpenter G.A and S. Grossberg (1987). ART2: Self-organization of stable category
recognition codes for analog input patterns. Applied Optics, 26, 3, 4919-4930.
Chang K. and W.G. Wee (1988). A Knowledge-based planning system for mechanical
assembly using robots. IEEE Expert, 18-30.
Cheon S.W., Kang G.S. and S.H. Chang (1992). Application of Neural Networks to
Connectionist Expert System for Identification of Transients in Nuclear Power Plants.
Proceedings, 2nd International Forum. Expert Systems and Computer Simulation in
Energy Engineering, Erlangen, Germany, March 17-20, pp 22-1-1 to 22-1-5.
Chen S., Cowan C.F.N. and P.M. Grant (1991). Orthogonal least squares learning al-
gorithm for radial basis function networks. IEEE Transactions on Neural Networks, 2,
2,302-309.
Cho S.B. and J.H. Kim (1993). Rapid back-propagation learning algorithms. Circuits,
Systems and Signal Processes, 12, 2.
Chow M.-y., Mangum P.M. and S.o. Yee (1991). A neural network approach to real-
time condition monitoring of induction motors. IEEE Transactions on Industrial
Electronics, 38, 6,448-453.
Cybenko G. (1989). Approximation by superpositions of a sigmoidal function.
Mathematics ofControl, Signals and Systems, 2, 303-314.
Davis R (1984). Diagnostic reasoning based on structure and behavior. Artifical
Intelligence, 24, 347-410.
Doremus R (1992). SAMSOM: Severe Accident Management System On-Line
Network. In DJ. Sobajic, (Editor), Proceedings of the 1992 Summer Workshop on
''Neural Network Computing for the Electric Power Industry, Stanford, CA, August 17-
19.
Feldman J.A and D.H. Ballard (1982). Connectionist models and their properties.
Cognitive Science, 6, 205-254.
Fault diagnosis using ANNs 425
Feng x., Zhang Y. and Q. Chen (1993). FauIt simulation of a variable thrust liquid
rocket engine based on neural networks. Proceedings, International Conference on
Fault Diagnosis TOOLDIAG '93, Toulouse, France, April 5-7, 787-791.
Gien D. et al. (1993). A neuro-fuzzy approach for real-time diagnosis on flexible manu-
facturing cells. Proceedings, Sixth International Conference on Neural networks and
other industrial and cognitive applications, Nimes, France, October 25-29.
Guo Y. and K.J Doodley (1992). Identification of change structure in statistical process
contro!. International Journal of Production Research, 30, 7, 1655-1669.
Guo Z. and RE. Uhrig RE. (1992). Using Modular Neural Networks to Monitor
Accident Conditions in Nuclear Power Plants. Proceedings of the SPIE Technical
Symposium on Intelligent Information Systems, Application of Artificial Neural
Networks III, Orlando, FL, April 20-24.
Guo Z. and R.E. Uhrig (1992). Use of Artificial Neural Networks to Analyze Nuclear
Power Plant Performance. Nuclear Technology, 99.
Himmelblau D.M. (1978). FauIt detection in chemical and petrochemical processes.
Elsevier Publishers, Amsterdam.
Hopfield J.J. (1982). Neural networks and physical systems with emergent computa-
tional abilities. Proceedings of the National Academy of Sciences (Biophysics), 79,
2554-2558.
Hoskins JC. and D.M. Himmelblau (1988). Artificial neural network models of knowl-
edge representation in chemical engineering. Comput. Chem. Eng., 12, 881-890.
Ikonomopoulos A, Tsoukalas L.H., Mullens JA and R.E. Uhrig (1992). Monitoring
nuclear reactor systems using neural network and fuzzy logic. Proceedings, 1992
Topical Meeting in Advances in Reactor Physics, March 8-11, Charleston, U. S.A
Ikonomopoulos A, Tsoukalas L. H. and RE. Uhrig (1992). Use ofNeural Networks to
Monitor Power Plant Components. Proceedings of the American Power Conference,
Chicago, IL., April 13-15, 1992
Kim K., Aljundi T.L. and E. Bartlett (1992). Confirmation of Artificial Neural Networks:
Nuclear Power Plant FauIt Diagnostics. Transactions of the American Nuclear Society,
66, Chicago, IL, November 15-20.
Kim HK, Lee S.H. and S.H. Chang (1992). Neural Network Model for On-Line
Thermal Margin Estimation of a Nuclear Power Plant. Proceedings of the Second
International Forum, Expert Systems and Computer Simulation in Energy Engineering,
Erlangen, Germany, March 17-20, pp 7-2-1 to 7-2-6.
Kohonen T. (1984). Self-organisation and associative memory. Springer-Verlag, Berlin.
426 Real-time fault monitoring ofindustrial processes
Konstantopoulos I.K. and P.J. Antsaldis (1993). The four parameter controller: A neural
network implementation. Proceedings, IEEE Meditteranean Symposium on New
Directions in Control Theory and Applications, Chania, Greece, June 21-23.
Kosko B. (1990). Unsupervised learning in noise. IEEE Transactions on Neural
Networks, 1, 1,44-57.
Kosmatopoulos E. B., Ioannou P. A and MA Christodoulou (1992). Identification of
Nonlinear Systems Using New Dynamic Neural Network Structures. Proceedings, 31st
IEEE Conference on Decision and Control, Tucson, Arizona, USA, December 16-18.
Kosmatopoulos E. B., Christodoulou M. A and P.A Ioannou (1993). Learning laws
that ensure exponential error convergence. Proceedings, 32nd IEEE Conference on
Decision and Control, San Antonio, Texas, USA, December 15-17.
Kusiak A and G. Finke (1988). Selection ofprocess plans in automated manufacturing
systems. IEEE Transactions of Robotics and Automation, 4, 4.
Lippmann R.P. (1987). An introduction to computing with neural nets. IEEE ASP
Magazine, 4, 4-22.
Lo A and B. Bavarian (1991). Scheduling with neural networks for flexible manufactur-
ing systems. Proceedings, IEEE International Conference on Robotics and Automation,
818-823.
Mirchandani G. and W. Cao (1989). On hidden nodes for neural nets. IEEE
Transactions on Circuits and Systems, 36, 661-664.
Moon y.B. (1990). Forming part-machine families for cellular manufacturing: A neural-
network approach. Internatioanl Journal of Advanced Manufacturing Technology, 5,
278-291.
Matsumoto K. and T. Sakaguchi (1983). Methods to determine the restoration plan of
power system by a knowledge based system. Transactions, lEE ofJapan, 103 B, 3.
de Mello L. S. H. and AC. Sanderson (1990). AND/OR Graph representation of as-
sembly plans. IEEE Transactions on Robotics and Automation, 6, 2, 188-199.
Miguel L.J., Baeyens E. and J.L. Coronado (1993). Application ofan ART-3 based neu-
ral network to fault diagnosis in dynamic systems. Proceedings, International
Conference on Fault Diagnosis TOOLDIAG '93, Toulouse, France, April 5-7, 713-717.
Minsky M. and Papert S. (1969). Perceptrons - An introduction to computational ge-
ometry. MIT Press, Cambridge, Mass.
Naidu R.S., Zafiriou E. and T.1. McAvoy (1990). Use of neural networks for sensor
failure detection in a control system. IEEE Control Systems Magazine, 10, 49-55.
Nett C.N., Jacobson CA and AT. Miller (1988). An integrated approach to controls
and diagnostics: the 4-parameter controller. Proceedings, 1988 American Control
Conference, 824-835.
Fault diagnosis using ANNs 427
Ogi H., Tanaka H. and Y. Akimoto (1991). Module neural network application for
power system/equipment diagnosis. Proceedings oj ESAP '91, April, 1991.
Ohga Y. and H. Seki (1991). Using a Neural Network for Abnormal Event Identification
in BWRs. Transactions oj the American Nuclear Society, 63, 110-111.
Pao Y. (1989). Adaptive pattern recognition and neural networks. Addison Wesley,
N.Y.
Passino K.M., Sartori MA and P.J. Antsaklis (1989). Neural computing for numeric-
to-symbolic conversion in control systems. IEEE Control Systems Magazine, 9, 44-52.
Peng Y. and JA Reggia (1989). A connectionist model for diagnostic problem solving.
IEEE Transactions on Systems, Man and Cybernetics, 19, 2, 285-298.
Polycarpou M. M. and PA Ioannou (1992). Neural Networks as On-Line
Approximators ofNonlinear Systems. Proceedings, 31st IEEE Conjerence on Decision
and Control, Tueson, Arizona, USA, December 16-18.
Rauch H.E., Kline-Schoder Rl, Adams lC. and H.M. Youssef (1993). Fault detection,
isolation and reconfiguration for aircraft using neural networks. Proceedings, AIAA
Conjerence on Guidance, Navigation and Control, August '93.
Rauch H.E. and D.B. Schaechter (1992). Neural networks for control, identification and
diagnosis. Proceedings, World Space Congress, Washington, D.C., August 28-
September 5.
Ray A.K. (1991). Equipment fault diagnosis - A neural network approach. Computers
in Industry, 16, 169-177.
Reed N.E. et al. (1988). Specialized Strategies: An alternative to first principles in diag-
nostic problem solving. AAAI, 364-368.
Roh M. S., Cheon S.w., Kim H.G. and S.H. Chang (1992). Prediction of Nuclear
Reactor Parameters using Artificial Neural Network Models. Proceedings 0/ the 2nd
International Forum on "Expert Systems and Computer Simulation in Energy
Engineering", Erlangen, Germany, March 17-20.
Rummelhart D.E. and lL. McClelland (1986). Parallel Distributed Processing -
Explorations in the microstructure of cognition, Volume 1: Foundations. MIT Press,
Cambridge, Mass.
Rummelhart D.E. and lL. McClelland (1986). Parallel Distributed Processing -
Explorations in the microstructure of cognition, Volume 2: Psychological and biological
models. MIT Press, Cambridge, Mass.
Schutte P. et al. (1987). An evaluation of a real-time fault diagnosis expert system for
aircraft applications. Proceedings. 26th IEEE Conjerence on Decision and Contro/.
428 Real-time fault monitoring of industrial processes
Sorsa T. and HN. Koivo (1991). Applications of artificial neural networks in process
fault diagnosis. Proceedings, IFAC Fault Detection, Supervision and Safety for
Technical Processes, Baden-Baden, Germany, September 10-13.
Sorsa T., Koivo H.N. and H. Koivisto (1991). Neural networks in process fault diagno-
sis. IEEE Transactions on Systems, Man and Cybernetics, 21, 4, 815-825
Sorsa T., Suontausta J. and H.N. Koivo (1993). Dynamic fault diagnosis using radial
basis function networks. Proceedings, International Conjerence on Fault Diagnosis
TOOWIAG '93, Toulouse, France, April 5-7, 160-169.
Stavrakakis G.S. and E.N. Dialynas (1991). Efficient computer-based scheme for im-
proving the reliability performance of power sub stations. International Journal oj
Systems Science, 22, 9, 1527-1539.
Suna R. and K. Berns (1993). Pipeline diagnosis using backpropagation networks.
Proceedings, oj Sixth International Conjernece on Neural networks and other indus-
trial and cognitive applications, Nimes, France, October 25-29.
Syed A., El-Maraghy H.A. and N. Chagneux. Application ofKohonen maps in real-time
monitoring and diagnosing of robotic assembly. Proceedings, International Conjerence
on Fault Diagnosis TOOWIAG '93, Toulouse, France, April 5-7, 780-786.
Tinghu Y., Binglin Z. and H Ren (1993). A neural network methodology for rotating
machinery fault diagnosis. Proceedings, International Conjerence on Fault Diagnosis
TOOWIAG '93, Toulouse, France, April 5-7, 170-178.
Tsoukalas L. H., Ikonomopoulos A. and R.E. Uhrig (1991). Hybrid Expert System-
Neural Network Methodology for Transient Identification. Proceedings ojthe American
Power Conjerence, Chicago, IL, April 29-May 1.
Wake T. and T. Sakaguchi (1984). Method to determine the fault components ofpower
system based on description of structure and function of relay system. Transactions,
lEE ojJapan, 101 B, 10.
Wasserman P.D. (1989). Neural Computing: Theory and Practice. Van Nostrand
Reinhold, N.Y.
Widrow B. and M.A. Lehr (1990). 30 years of adaptive neural networks: Perceptron,
Madaline and Back-propagation. Proceedings ojthe IEEE, 78, 9.
Upadhyaya B. R. and E. Eryurek (1992). Application of Neural Networks for Sensor
Validation and Plant Monitoring. Nuclear Technology, 97, p 170.
Vaidyanathan R. and V. Venkatasubramanian (1990). Process fault detection and diag-
nosis using neural networks: Dynamic processes. Proceedings, AIChE Annual National
Meeting, Chicago, U.S.A.
Venkatasubramanian V. and K. Chan (1989). A neural network methodology for proc-
ess fault diagnosis. AIChE Journal, 35, 1993-2002.
Fault diagnosis using ANNs 429
Watanabe K, Matsuura 1., Abe M., Kubota M. and D.M. Himmelblau (1989). Incipient
fault diagnosis of chemical processes via artificial neural networks. AIChE Journal, 35,
1803-1812.
Widrow B. and M.E. Hoff (1960). Adaptive switching circuits. 1960 IRE WESCON
Conv. Record, Part 4, August 1960,96-104.
Widrow B. and R.Winter (1988). Neural nets for adaptive filtering and adaptive pattern
recognition. IEEE Computer Magazine, March, 25-39.
Yamashina H, Kumamoto H, Okumura S. and T. Ikesaki (1990). Failure diagnosis of a
servovalve by neural networks with new learning a1gorithm and structure analysis.
International Journal of Production Research, 28, 6, 1009-1021.
Zeilingold D. and 1. Hoey (1990). Model for aspace shuttle safing and failure detection
expert system. Proceedings, 5th Conference on Artificial Intelligence for Space
Applications.
CHAPTER6
6.1. Introduction
The in-time failure prognosis and safety assessment of today's high risk industrial
structures implies the accurate estimation of the residual lifetime of the structure in the
course of its service. Reduction of the operation cost, estimation of the structural aging,
structure life extension, prevention of catastrophic accidents, environmental protection,
are some of the aspects that have to be considered in the management of complex high
risk industrial systems as nuclear power plants, chemical plants, off-shore structures,
marine structures, gas (LNG, LPG) installations etc.
To achieve a realistic safety assessment, the capability of modeling correctly the
uncertainties, of updating the estimates on the basis of any new data available and of
using field expert knowledge and heuristics is required.
Thus, a collection of data information has to be achieved on material properties, defect
distribution (position and size), degradation mechanisms affecting the structure, record
and forecast ab out loads and environment, assumptions about the states which are
considered as dangerous for the component.
A whole series of inspection instruments and techniques have evolved over the years and
new methods are still being developed to assist in the process of assessing the integrity
and reliability of parts and assemblies. Non-destructive testing (NDT) evaluation
methods are widely used in industry for checking the quality of production, and also as
part of routine inspection and maintenance in service.
Because of the obvious importance of the subject, and the fact that most of the
inspection methods are based on well-established scientific principles, there is a great
number of publications suitable for use in the engineering practice (see periodicals as
"NDT International", "Materials evaluation", etc.). In the present chapter the concept of
the in-time failure prognosis and realistic safety assessment of, mainly, metallic structures
will be first defined and the recent NDT methods with some of their representative
In-time failure prognosis and fatigue life prediction of structures 431
applications will be presented. The concept of inspection will also be darified. Analytical
modeling and expert knowledge modeling approaches will be described for damage
mechanism analysis and in-time failure prediction in structures with the ability to use
fresh data and information continuously or periodically coming from the component or
the structure during operation for modifying and improving the prediction.
Application examples from the nudear, marine, mechanical and manufacturing sectors
will be presented to delve into the matter.
6.2.1 Introduction
~Corrosion
fatigue
Figure 6. J Origins of some defccts found in materials and components.
tested. The basic principles and major features ofthe main non-destructive testing (NDT)
systems are given in Table 6.1
Table 6.1
Electrical methods Detection of surface de-fects Can be used for any metal
(Eddy currents, and some sub-surface defects.
acoustic emission) Can also be used to measure
the thickness of a non-
conductive coating, such as
paint, on a metal
The various non-destructive test methods can be used, in practice, in many different ways
and the range of equipment available is extensive. Compact and portable equipment is
available which can be used, both inside a test house or out on site, or the basic test
principle can be incorporated in some large inspection system dedicated to the
examination of large quantities of a single product or a small range of products or the
components of a structure.
Tbis applies to all the test methods described in tbis Chapter. When non-destructive
testing systems are used, care must be taken and the processes controlled so that not
only qualitative but quantitative information is received and that this information is both
accurate and useful. If non-destructive testing is mis-applied it can lead to serious errors
of judgment of component quality.
434 Real-time fault monitoring of industrial processes
The main non-destructive testing systems are briefly described in the next section.
Liquid penetrant inspection is a technique which can be used to detect defects in a wide
range of components, provided that the defect breaks the surface of the material. The
principle of the technique is that a liquid is drawn by capillary attraction into the defect
and, after subsequent development, any surface-breaking defects may be rendered visible
to the human eye. In order to achieve good defect visibility, the penetrating liquid will
either be coloured with a bright and persistent dye or else contain a fluorescent
compound. In the former type the dye is generally red and the developed surface can be
viewed in natural or artificial light, but in the latter case the component must be viewed
under ultra-violet light if indications of defects are to be seen. There are five essential
steps in the penetrant inspection method. These are:
Surface preparation. All surfaces of a component must be thoroughly cleaned and
completely dried before it is subjected to inspection. It is important that any surfaces to
be examined for defects must be free from oil, water, grease or other contaminants if
successful indication of defects is to be achieved.
Application 0/ penetrant. After surface preparation, liquid penetrant is applied in a
suitable manner, so as to form a film of penetrant over the component surface. The liquid
film should remain on the surface for aperiod sufficient to allow for full penetration into
surface defects.
Removal 0/ excess penetrant. It is usually necessary to remove excess penetrant from
the surface ofthe component. Some penetrants can be washed offthe surface with water,
while others require the use of specific solvents. Uniform removal of excess penetrant is
necessary for effective inspection.
Development. The development stage is necessary to reveal clearly the presence of any
defect. The development is usually a very fine chalk powder. This may be applied dry,
but more commonly is applied by spraying the surface with chalk dust suspended in a
volatile carrier fluid. A thin uniform layer of chalk is deposited on the surface of the
component. Penetrant liquid present within defects will be slowly drawn by capillary
action into the pores of the chalk. There will be some spread of penetrant within the
developer and this will magnifY the apparent width of a defect. When a dye penetrant is
used the dye colour must be in sharp contrast to the uniform white of the chalk-covered
surface. The development stage may sometimes be omitted when a fluorescent penetrant
is used.
436 Real-time fault monitoring of industrial processes
Observation and inspection. After an optimum developing time has been allowed, the
component surface is inspected for indications of penetrant "bleed back" into the
developer. Dye-penetrant inspection is carried out in strong lighting conditions, while
fluorescent-penetrant inspection is performed in a suitable screened area using ultra-
violet light. The latter technique causes the penetrant to emit visible light, and defects are
brilliantly outlined.
The liquid penetrant process is comparatively simple as no electronic systems are
involved, and the equipment necessary is cheaper than that required for other non-
destructive testing systems. The establishment of procedures, and inspection standards
for specific product parts, is usually less difficult than for more sophisticated methods.
The technique can be employed for any material except porous materials, and, in certain
cases, its sensitivity is greater than that of magnetic particle inspection. Penetrant
inspection is suitable for components of virtually any size or shape and is used for both
the quality control inspection of semi-finished and finished production items and for
routine in-service inspection of components.
The system is used in the aerospace industries by both producers for the quality control
of production and by users during regular maintenance and safety checks. Typical
components which are checked by this system are turbine rotor discs and bl ades, aircraft
wheels, castings, forged components and welded assemblies. Many automotive parts,
particularly aluminum castings and forgings, including pistons and cylinder heads, are
subjected to this form of quality control inspection before assembly. Penetrant testing is
also used for the regular in-service examination of the bogie frames of railway
locomotives and rolling stock in the search for fatigue cracking.
Magnetic particle inspection is a sensitive method of locating surface and some sub-
surface defects in ferro-magnetic components. The basic processing parameters depend
on relatively simple concepts. In essence, when a ferro-magnetic component is
magnetised, magnetic discontinuities that Iie in a direction approximately perpendicular
to the field direction will result in the fomlation of a strong "leakage field". This leakage
field is present at and above the surface of the magnetised component, and its presence
can be visibly detected by the utilization of finely divided magnetic particles. The
magnetic particles which are used for inspection may be made from any ferro-magnetic
material of low remanence and they are usually finely divided powders of either metal
oxides or metals. The particles are classified as dry or wet according to the manner in
which they are carried to a component. Dry particles are carried in air or gas suspension
while wet particles are carried in liquid suspension. The application of dry particles or
wet particles in a liquid carrier, over the surface of the component, results in a collection
of magnetic particles at a discontinuity. The "magnetic bridge" so formed indicates the
location, size, and shape ofthe discontinuity.
In-time failure pro gnosis and fatigue life prediction of structures 437
while the search coil is generally protected by a plastic casing, the end of the ferrite core
often projects beyond the plastic case. Eddy current test probes do not need any coupling
fluid between them and the testpiece, unlike ultrasonic probes, because they are coupled
to the material by a magnetic field, and consequently little if any surface preparation is
necessary prior to inspection. Many types of inspection probes have been designed but
generally they can be divided into surface probes and hole probes.
It is necessary to calibrate eddy current test equipment, and reference testpieces for
calibration purposes should be made from material of sirnilar type and quality to that
which is to be tested so as to have the same conductivity value. A test-block should
contain aseries of defects of known size and shape and these are frequently made by
making several fine saw-cuts of varying but known depth.
In many situations users will also use defective parts containing, for example, fatigue
cracks as reference and calibration pieces.
One method of representing the signals from eddy current inspection probes is by the
phasor technique or phase analysis. When it is only necessary to detect changes in one of
the parameters which affect impedance and all over factors are constant, then the
measurement of a change in impedance value will reflect a change in that parameter.
However, there are many instances where it becomes necessary to separate the responses
from more than one parameter, and to separate the reactive and resistive components of
impedance. This requires the use of more sophisticated instruments but in this way it
becomes possible to identify the type of defect present and not merely its position.
There is a phase difference between the reactive and resistive components of the
measurement voltage. Consider the voltage as vectors A and B. The frequency is the
same for both and, therefore, the radian velocity OJwill be the same for both (=27if).
The resistive and reactive components of a measurement (probe coil) voltage can be fed
to the "X" plates and "Y" plates respectively of a cathode ray oscilloscope and displayed
as a two-dimensional representation.
The impedance changes caused by various types of defect or by changes in conductivity
will give screen displays as shown in figs. 6.3b and c.
The eddy current system is a highly versatile system and can be used to detect not only
cracks but several other conditions, including corrosion. Corrosion of hidden surfaces as,
for example, within aircraft structures, can be detected using phase-sensitive equipment.
It is a comparative technique in that readings made in a suspect area are compared with
instrument readings obtained from sound, non-corroded material (Hagemaier et al.
1985).
An eddy current test system can also be used for the routine inspection of aircraft
undercarriage wheels. The wheel is placed on a turntable and the probe coil which is
mounted at the end of an adjustable arm, is positioned near the bottom of the wheel. As
the wheel turns on the turntable so the probe arm moves slowly up the wheel, giving a
440 Real-time fault monitoring of industrial processes
elose helical search pattern. It is necessary to use a second probe to check under the
wheel flange. A hand-held probe is used for tbis part ofthe inspection.
Conductiv,ty
change
V...SlSllncf' - - -
{al
Air point
_Stcel
~
I
(bI (cl
Figure 6.3
(a) Vector point.
(b) Impedance plane display on oscilloscope, showing differing conductivities.
(c) Impedance plane display, showing defect indicatioDS.
The ability of eddy current techniques to determine the conductivity of a material has
been utilized for the purpose of checking areas of heat-damaged skin on aircraft
structures. If the type of aluminum alloy used in aircraft construction becomes over-
heated it could suffer a serious loss of strength. Tbis is accompanied by an increase in the
electrical conductivity ofthe a1loy. The conductivity ofsound material is generally within
the range of 31 to 35 per cent IACS. Defective or heat-damaged material would show a
conductivity in excess of35 per cent IACS.
Ultrasonic techniques are very widely used for the detection of internal defects in
materials, but they can also be used for the detection of small surface cracks. U1trasonics
are used for the quality control inspection of part processed material, such as rolled slabs,
In-time failure prognosis and fatigue life prediction of structures 441
as weil as for the inspection of finished components. The techniques are also in regular
use for the in-service testing of parts and assemblies.
Sound waves are elastic waves which can be transmitted though both fluid and solid
media. The audible range of frequency is from about 20 Hz to about 20 kHz but it is
possible to produce elastic waves of the same nature as sound at frequencies up to 500
MHz. Elastic waves with frequencies higher than the audio range are described as
ultrasonic. The waves used for the non-destructive inspection of materials are usually
within the frequency range 0.5 MHz to 20 MHz.
Piezo-electric materials form the basis of electro-mechanical transducers for ultrasonic
NDT. The original piezo-electric material used was natural quartz. Quartz is still used to
some extent but other materials, including barium titanate, lead metaniobate and lead
zirconate, are used widely. When an altemating voltage is applied across the thickness of
a disc of piezo-electric material, the disc will contract and expand, and in so doing will
generate a compression wave normal to the disc in the surrounding medium. When
quartz is used the disc is cut in a particular direction from a natural crystal but the
transducer discs made from ceramic materials such as barium titanate are composed of
many crystals fused together, the crystals being permanently polarised to vibrate in one
plane only.
Wave generation is most efficient when the transducer crystal vibrates at its natural
frequency, and this is determined by the dimensions and elastic constants of the material
used. Hence, a 10 MHz crystal will be thinner than a 5 MHz crystal. A transducer for
sound generation will also detect sound. An ultrasonic wave incident on a crystal will
cause it to vibrate, producing an altemating current across the crystal faces. In some
ultrasonic testing techniques two transducers are used - one to transmit the beam and the
other acting as the receiver - but in very many cases only one transducer in necessary.
This acts as both transmitter and receiver. Ultrasonic is transmitted as aseries of pulses
of extremely short duration and during the time interval between transmissions the crystal
can detect reflected signals.
The presence of a defect within a material may be found using ultrasonics with either a
transmission technique or a reflection technique.
Normal probe transmission method. In this method a transmitter probe is placed in
contact with the testpiece surface, using a liquid coupler, and a receiving probe is placed
on the opposite side ofthe material (see fig. 6.4).
If there is no defect within the material, a certain strength of signal will reach the
receiver. If a defect is present between the transmitter and receiver, there will be a
reduction in the strength of the received signal because of partial reflection of the pulse
by the defect. Thus, the presence of a defect can be inferred.
442 Real-time fauIt monitoring of industrial processes
6 ------ Transmitter
probe
I
l
=-Oefect
Q ____
T Receiver
probe
A B
(a)
B
I;
I
I
I
I
(b)
Defect
b \ .. Defect
~~50~:::::::;;;"IFi(r7//'-----,1
The reflection method has certain advantages over the transmission method. These are:
(a) The specimen may be of any shape.
(b) Access to only one side of the testpiece is required.
(c) Qnly one coupling point exists, thus minimizing error.
(d) The distance of the defects from the probe can be measured.
The information obtained during an ultrasonic test can be displayed in several ways.
''A'' scan display. The most commonly used system is the "A" scan display (see fig. 6.8).
A blip appears on the CRT screen at the left-hand side, corresponding to the initial pulse,
and further blips appear on the time base, corresponding to any signal echoes received.
The height of the echo is generally proportional to the size of the reflecting surface but it
is affected by the distance travelled by the signal and attenuation effects within the
material. The linear position of the echo is proportional to the distance of the reflecting
surface from the probe, assuming a linear time base. This is the normal type of display for
hand probe inspection techniques.
Initial Backwall
pulse echo
5mall Defect
defect echo
(a) (b)
Figure 6.8 "A" scan display.
(a) reflections obtained from defect and backwall.
(b) representation of"A" scan screen display.
In-time failure prognosis and fatigue life prcdictlon of structures 445
A disadvantage of the "A" sean is that there may be no permanent reeord, unless a
photograph is taken of the sereen image, although more sophistieated modern equipment
have the faeility for digital reeording.
"B" scan display. The "B" sean enables a reeord to be made of the position of defeets
within a material. The system is ilIustrated in fig. 6.9. There needs to be co-ordination
between the probe position and the traee, and the use of "B" sean is eonfined to
automatie and semi-automatie testing teehniques. With the probe in position "1" the
indication on the screen is as shown in fig. 6.9, with (i) representing the initial signal and
(ii) representing the backwall. When the probe is moved to position "2", line (iii) on the
display represents the defeet. This representation of the testpieee eross-seetion may be
recorded on a paper chart, photographed, or viewed on a long-persistenee screen.
-
I
I
I
I
I I
I I
(i) -------T----r--r--r-
I I
I I
I I I
I I L.J
(iiil L...J
(i0-------
"e" scan display. While the "B" sean gives a representation of a side elevation of the
testpieee, another method, termed "C" sean ean be used to produee a plan view. Again,
the "C" scan display is eonfined to automatie testing (Nielsen (1981), Yanagi, (1983».
Identijication 0/ de/ects.
By means of uItrasonic methods not only ean the exact position of internal defeets be
determined but it is also possible, in many eases, to distinguish the type of defeet. In the
following, the various types of signal response received from particular types of defect
will be eonsidered.
(a) Defect at right angles to the beam direction. When no defect is present, a large echo
signal should be received from the backwall. The presence of a small defect should give a
446 Real-time fault monitoring of industrial processes
small defect echo and some reduction in the strength of the backwall echo. When the
defect size is greater than the probe diameter the defect echo will be large and the
backwall echo may be lost (fig. 6.10), depending on the depth ofthe defect in relation to
beam spread in the far zone.
(b) Defects other than plane defects. Areas of micro-porosity will cause a general
scattering of the beam, giving some "grass" on the CRT trace and with loss of the
backwall echo (fig. 6.11a). A large spherical or elliptical inclusion or hole would tend to
give a small defect echo coupled with a small backwall echo (fig. 6. 11 b), while a plain
trace showing no echo at all could be an indication of a plane defect at some angle other
than normal to the path ofthe beam (fig. 6.11c).
J______
Figure 6.11 (a) Micro-porosity, (b) Elliptical defect, (c) Angled defect.
In-time failure prognosis and fatigue life prediction of structures 447
(e) Laminations in thiek plate. The plate should be completely scanned in a methodical
manner, as shown in fig. 6.12. The indications of laminations are a closer spacing of
echoes and a more rapid fall-off in the size of the echo signals. Either or both of these
indications are signs oflamination (see fig. 6.13).
(d) Lamination in thin plate.
A thin plate may be considered to be a plate of thickness less than the dead zone of the
probe. Asound plate will show a regular series of echoes with exponential fall-off of
amplitude. A laminated region will show a close spacing with a much faster rate of
amplitude fall-off. The pattern may change from an even to an irregular outline. It is this
pattern change which, in may cases, gives the best indication of lamination in thin plate
(fig. 6.14).
Probe
(a) (b)
Figure 6.13 Indication of lamination in thick plate: (a) good plate; (b) laminated plate
(e) WeId defeets. Ultrasonic testing using angle probes in either the reflection or
transmission mode is a reliable method for the detection of defects in butt welds and for
determining their exact location. It is, however, fairly difficult to determine with certainty
the exact nature of the defect, and much depends upon the skill and experience of the
448 Real-time fault monitoring of industrial processes
operator. Ir, following ultrasonic inspection, there is any doubt in the mind of the
operator about the quality of a weid, then it would be wise to check radiographically the
suspect area.
(a) (b)
Figure 6.14 Indication of lamination in thin plate: (a) good plate; (b) laminated plate
(a) (b)
Figure 6.15 Detection of radial defects in:
(a) tubes,
(b) solid bar; normal probe in position A will not show defect but angle probe at B will.
(f) Radial defects in cylindrical tubes and shajts. A radial defect in a cylindrical member
is not generally detectable using normal probe inspection, as the defect will be parallel to
the ultrasonic beam. In these circumstances the use of an angle probe reflection
technique will clearly show the presence of defects (fig. 6.15).
As has been seen in the foregoing paragraphs, ultrasonic test methods are suitable for the
detection, identification and size assessment of a wide variety of both surface and sub-
surface defects in metallic materials, provided that there is, for reflection techniques,
access to one surface. There are automated systems which are highly suitable for the
routine inspection of production items at both an intermediate stage and the final stage of
manufacture. Using hand held probes, many types of components can be tested, including
in situ testing. This latter capability makes the method particularly attractive for the
routine inspection of aircraft and road and rail vehicles in the search for incipient fatigue
cracks (Yanagi, 1983). In aircraft inspection, specific test methods have been developed
In-time faHure prognosis and fatigue Iife prediction of structures 449
for each particular application and the procedures listed in the appropriate manuals must
be followed if consistent results are to be achieved. In may cases a probe will be specially
designed for one specific type of inspection.
In nuclear plants, chemical plants, pipelines, vessels, off-shore and marine structures
material damages due to fatigue loading causes can be detected and dimensioned
efficiently using ultrasonic NDT methods (Landez et al. 1992). The continuous
monitoring by ultrasonic can help to monitor high stress concentration or cracked zones.
Ultrasonic probes are perrnanently stuck on the region to monitor the critical area or (in
case of existing crack) the crack tip and the crack root (by diffraction and reflection
respectively). Any change in the UT signal amplitude indicates that a modification took
place in the inspected zone (formation ofa crack or propagation ofthe existing one).
Under the assumption that no additional non-linear effects affect the measurement, it has
been shown that this method can detect damage trom nearly 10% of life span.
6.2.2.4. Radiography
material. Thus radiography can be used for the inspection of materials and components
to detect certain types of defect.
The use of radiography and related processes must be strictly controlled because
exposure ofhumans to radiation could lead to body tissue damage.
Radiography is capable of detecting any feature in a component or structure provided
that there are sufficient differences in thickness or density within the testpiece. Large
differences are more readily detected than small differences. The main types of defect
which can be distinguished are porosity and other voids and inclusions, where the density
of the inclusion differs form that of the basis material. Generally speaking, the best results
will be obtained when the defect has an appreciable thickness in a direction parallel to the
radiation beam. Plane defects such as cracks are not always detectable and the ability to
locate a crack will depend upon its orientation to the beam. The sensitivity possible in
radiography depends upon many factors but, generally, if a feature causes a change in
absorption of 2 per cent or more compared with the surrounding material, then it will be
detectable.
Radiography and ultrasonics (see § 6.2.2.3) are the two methods which are generally
used for the successful detection of internal flaws that are located weIl below the surface,
but neither method is restricted to the detection of this type of defect. The methods are
complementary to one another in that radiography tends to be more effective when flaws
are non-planar in type, whereas ultrasonic tends to be more effective when the defects
are planar.
Radiographic inspection techniques are frequently used for the checking of welds and
castings, and in many instances radiography is specified for the inspection of
components. This is the case for weldments and thick-wall castings which form part of
high-pressure systems.
Radiography can also be used to inspect assemblies to check the condition and proper
placement of components. It is also used to check the level ofliquid in sealed liquid-filled
systems. One application for which radiography is very weIl suited is the inspection of
electrical and electronic component assemblies to detect cracks, broken wires, missing or
misplaced components and unsoldered connections.
Radiography can be used to inspect most types of solid material but there could be
problems with very high or very low density materials. Non-metallic and metallic
materials, both ferrous and non-ferrous, can be radiographed and there is a fairly wide
range of material thicknesses that can be inspected (Grangeat et al., 1992). The
sensitivities of the radiography processes are affected by a number of factors, including
the type and geometry ofthe material and the type offlaw.
Although radiography is a very useful non-destructive test system, it possesses some
relatively unattractive features. It tends to be an expensive technique, compared with
other non-destructive test methods. The capital costs of fixed X-ray equipment are high
but coupled with this considerable space is needed for a radiography laboratory,
In-time faHure prognosis and fatigue life prediction of structures 451
including a dark room for film processing. Capital costs will be much less if portable X-
ray sets or y-ray sources are used for in situ inspections, but space will still be required
for film processing and interpretation.
The operating costs for radiography are also high. The setting up time for radiography is
often lengthy and may account for over half of the total inspection time. Radiographic
inspection of components or structures out on sites may be a lengthy process because the
portable X-ray equipment is usually limited to a relatively low energy radiation emission.
Similarly, portable radio-active sources emitting y-radiation tend to be of fairly low
intensity. This is because high-intensity sources require very heavy shielding and thus
cease to be truly portable. In consequence, on-site radiography tends to be restricted to a
maximum material thickness of 75 mm of steel, or its equivalent. Even then, exposure
times of several hours may be needed for the examination of thick sections. This brings a
further disadvantage in that personnel may have to be away form their normal work posts
for a long time while radiography is taking place.
The operating costs for X-ray fluoroscopy are generally much lower than those for
radiography. Setting-up times are much shorter, exposure times are usually short
and there is no need for a film processing laboratory.
Another aspect which adds to radiography costs is the need to protect personnel from
the effects of radiation, and stringent safety precautions have to be employed. This safety
aspect will apply to all those who work in the vicinity of a radiography test as weil as to
those persons directly concemed in the testing.
High-frequency waves, at frequencies within the range 50 kHz to 10 MHz, are emitted
when strain energy is rapidly released as a consequence of structural changes taking
place within a material. Plastic deformation, phase transformations, twinning, micro-
yielding and crack growth result in the generation of "acoustic" signals which can be
detected and analysed. Hence, it is possible to obtain information on the location and
structural significance of such phenomena.
Basically there are two types of acoustic emission form materials - a continuous type and
an intermittent or burst type. Continuous emission is normally of low amplitude and is
associated with plastic deformation and the movement of dislocations within a material,
while burst emissions are high-amplitude short-duration pulses resulting from the
development and growth of cracks.
Acoustic emission inspection offers several advantages over conventional non-
destructive testing techniques. For example, it can assess the dynamic response of a flaw
to imposed stresses. When a crack or discontinuity approaches critical size there is a
marked increase in emission intensity, and hence, a warning is given of instability and
catastrophic failure. Also, it is possible to detect growing cracks of about 2x 10-4 mm in
452 Real-time fault monitoring of industrial processes
Optical inspection probes. Optical inspection probes are a major aid to visual
inspection as they permit the operator to see clearly inside pipes, ducts, cavities and other
openings to which there is limited access. The basic parts of an inspection probe system
are the objective lens head which is inserted into the cavity, the viewing eyepiece, and the
illumination system. The development of fiber optical systems has permitted major
advances to be made in the design and construction of inspection probes.
Optical inspection probes are of two general types, rigid or flexible, but within both of
these categories there are many different sizes and designs available.
A rigid inspection probe comprises an optical system with a viewing eyepiece at one end.
Illumination is conveyed to the inspection point through an optical fiber bundle and both
the optical and illumination systems are enclosed within a stainless steel tube. Light from
an external source, which is usually a variable intensity mains and/or battery-operated
quartz-halogen lamp, is conveyed to the probe through an optical fiber light guide.
Rigid probes are produced in many sizes from the smallest, with tube diameters of 2 mm
or less, up to large probes with tube diameters of 15 or 20 mm. The maximum usable or
working length of a probe is the extent to which it can be inserted into an opening; it is
not a constant and it varies with the value of probe diameter. Probes of all diameters are
In-time faHure prognosis and fatigue life prediction of structures 453
propagate across the component surface and within the component body. The intensity
of the incident laser impulses is such that no damage is caused to the surface of the
testpiece. The emission from a second laser illuminates the surface of the testpiece and
ultrasonic echoes returning to the testpiece surface cause deflections which cause a
modulation ofthe reflected light from the illuminating laser.
The third component of the system is an interferometer which analyses the modulated
reflected light signal and converts it into a signal which can be presented on the screen of
a cathode ray tube in a manner similar to the usual type ofultrasonic signal display.
The main advantages of this technique are that no mechanical coupling is necessary and
the acquisition ofresults is rapid. Laser-based ultrasonic interrogation systems are in use,
currently, to detect the existence of piping and liquid metal level in cast steel ingots.
Although the sensitivity of the system is lower than that of some of the more
conventional techniques, for example, ultrasonic pulse-echo testing, the system has
attracted some interest for the continuous monitoring of components on process lines in
the manufacturing industries.
Time-of-flight difTraction. A new ultrasonic technique has been developed, namely
time-of-flight diffi"action (TOFD), which relies on the diffi"action of ultrasonic waves
from crack or defect tips, rather than reflection, as in pulse-echo. The technique is very
useful in determining the true size of fatigue cracks, even though a crack may be pressed
together by the applied load or residual stress network. With conventional pulse-echo
testing, complete or partial transmission ofthe wave pulses across the "closed" crack can
lead to errors in the analysis of crack size, because of the reduction in amplitude of the
reflected signals. TOFD is so called because it relies on the wave propagation times to
indicate and locate the diffi"action source. An example of applying the technique is shown
in fig. 6.16.
Crack
adjacent to
weid
Figure 6.16 Probe and wave path geometry as used to measure the size of a crack in a welded
joint.
In-time failure prognosis and fatigue life prediction of structures 455
The low signal-to-noise ratio often necessitates signal averaging, and comparison and
subtraction of surface waves also may be necessary.
Crack depth gauges. Cracks which appear at the surface of a material can be readily
detected using liquid penetrant of magnetic particle inspection methods, but neither of
these methods will give an accurate assessment of the depth of a crack. Crack depth
gauges are frequently used in conjunction with these other non-destructive tests to give a
measure of the depth of tlaws which have been located. One simple but effective device
for this consists of two closely spaced electrical contacts. The gauge is placed on the
surface of the material and the electrical resistance between the two contact points
measured. When the gauge is placed on the testpiece with the contacts on either side of a
surface crack, the measured resistance will be greater as current now has to follow an
extended path around the crack. The meter scale is generally calibrated to give a direct
reading of crack depth.
Thermography. Thermography is concerned with the mapping of isotherms, or
contours of equal temperature, over the surface of a component. Heat-sensing materials
or devices can be used to detect irregularities in temperature contours and such
irregularities can be related to defects. Thermography is particularly suited to the
inspection of laminates. The conduction of heat through a laminate will be affected by the
presence of tlaws in the structure, resulting in an irregular surface temperature profile.
Typical tlaws which can be detected are unbonded areas, crushed cells, separation of the
core from the face plates and the presence of moisture in the cells of honeycomb
structures.
Thermographic methods may either be of the direct contact type, in which heat-sensitive
material is in contact with the component surface, or indirect contact, in which a heat-
sensitive device is used to measure the intensity of infrared energy emitted from the
surface.
Pulses of heat energy, from a source, are directed at the component under test. It is
usual, but not essential, to direct the incident energy on to one surface of a component
and observe the effects at the opposite surface after conduction through the material.
Flaws and irregularities in structure will affect the amount of conduction in their vicinity.
If it is impossible to have access to both surfaces, the technique can still be used. The
heat energy incident on the surface will be conducted away through the material at
differing rates, depending on whether or not tlaws are present.
Direct contact methods include the use of heat-sensitive paints and thermally quenched
phosphors. Indirect contact methods, which offer greater sensitivity, involve the use of
infra-red imaging systems with a TV-video output.
Beat-sensitive paints. Heat-sensitive photo-chromic paints are effective over a
temperature range from about 40°C to 1600c. Some paints show several color changes
within their reaction temperature range and, with careful application, will have a
sensitivity of the order of ±5°C. When heat reaches the painted surface by conduction
456 Real-time fault monitoring of industrial processes
through the material the paint colour changes, usually with a bleaching effect. Where a
tlaw impedes conduction the colour will be unchanged. On the other hand, if heat energy
is directed at the painted surface the reverse effect will show up as heat is conducted
away from the surface more rapidly through good regions than through defective areas.
Thermally quenched phosphors. These are organic compounds which emit visible
light when excited by ultra-violet radiation. The brightness of the emission decreases as
the temperature of the compounds increases. Phosphors are available that are useful at
temperatures up to about 40()OC and with aresolution of±l°C.
Thermal pulse video thermography. In this system no physical contact is necessary
with the material and very rapid rates of inspection are possible. A high-intensity heating
source is used to send pulses of infra-red energy into the material. The surface is scanned
by an infra-red thermal imager with a TV-video output. This system, again, can either be
used for sensing heat transmitted through the component or for single-sided inspection
when only one surface is accessible. Very good sensitivities are possible. Digitized image
processing to provide image enhancement is also possible.
NDT is gradually adapting to the use of the most recent developments in digital signal
and image processing. By signal processing (SP) is meant digital techniques that
transform an input signal into an output signal or into parameters. In this very broad
sense, not only averaging and Fourier transform but also TOFD, SAFT, 3-D
reconstruction, expert system based signal classification, adaptive learning network and
neural networks can be classified as SP techniques (more details are given in the
following).
Non-destructive testings generally result in an inverse problem: given a set of external
measurements compute the location and size of defects inside the material. Although the
basic equations differ from one method to the other ("direct problem" formulations are
different) their common feature is that they cannot be simply inverted because ofi) noise,
ii) lack of measurements, iii) incomplete modeling and iv) all together.
NDT noise itself is rarely the widespread additive Gaussian white noise. A first example
is the response of coarse grained materials when tested with ultrasonics: each grain
behaves as a retlector so that the whole response does not resemble a random white
noise and classical averaging does not work. In EC (eddy eurrent) testing of some steam
generator tubes, tlattening noise can be decomposed into several narrow band
components (it is therefore non-random, but it is not "stable" either). Moreover noise
and tlaw frequency spectra are the same: if this tlattening noise is bandpass-filtered, the
useful information is also filtered out. Another example originates from gammagraphy
testing of thick wall sampies. The radiographs are corrupted by a granular noise due to
thickness and film, which has to be modeled adequately before processing.
In-time faHure prognosis and fatigue life prediction of structures 457
Digital signal processing could be difficult and hence expensive to implement. Therefore
one has to implement it only in these cases where all other means have failed. Tbis is why
smart experimental setups and acquisition schemes have been developed. Among them
Synthetic Aperture Focusing Technique (SAF1), Time 0/ Flight Diffraction (TOFD)
and numerous enhancements of these basic techniques have been developed (Ludwing
and Roberti, 1989). SAFT is some sort of "beamforming" already known in array
processing for underwater acoustics or RADAR. For each pixel of the insonified
specimen the A-scan signals received at n-transducers are averaged after time-sbifting.
The sbifts are computed from the different distances between one transducer and the
pixel under study. Scanning the sampie results in an enhanced image because of
constructive addition of waveforms.
The aim is to better und erstand the content of the measured signals and then to be able to
simulate it (together with its accompanying noise) for study and method evaluation
purposes. The field of image processing has become a big consumer of sopbisticated
modelisation based on stochastic processes on one hand (Boolean and Markov models)
and Bayesian procedures on the other hand. Boolean models represent the granular noisy
part of images by assuming it is a Poisson-type random spatial distribution of some basic
pattern (usually the convex part of a Gaussian whose parameters are randomly chosen).
Tbis proved appropriate for radiograph modelisation. Markov models account for the
reasonable idea that statistical relationships between one pixel and the rest of an image
are summarized in a window around tbis pixel. These models are used in many
processing tasks and particularly in reconstruction. Time-frequency domain methods are
required for extracting physical parameters of interest when these involve joint variations
oftime and frequency. Techniques based on the Wavelet Transform appear to be suitable
for acoustic signal processing particularly for detection and description of burst, whose
arrival time and waveforms are unknown. Lastly the resolution procedure based on the
dassical Bayes' rule is an elegant way to introduce human "knowledge" on the desired
image and on its transformation. To sum up, besides the mathematical aspects that can
discourage NDT persons, the important point in this general stochastic approach is its
ability to take prior knowledge into account as probability laws and distrurbing noise as
random processes. It is amusing (although weil known) that introducing such apparently
complex tools leads at the end to tractable calculations and interesting results (Singh and
Udpa (1986), Ludwing and Roberti (1989), Grangeat et al. (1992)).
An important issue in US (ultrasonics) is to restore signals from austenitic welds because
they are severely corrupted by noise. Several techniques have become popular. Theyare
called "averaging" although tbis reference to a "linear" mix could be misleading. Spatial
averaging consists in selecting for example the minimum values in a number of
waveforms produced by different probe locations dose to each other, wbile jrequency
averaging does the same but from different frequency bands of a single signal. Both
methods are based on the reasonable assumption that signal (i.e. defects) responses are
coherent whereas noise (grain reflections) responses are not. Signal to noise
enhancements ofup to 10 dB have been reported.
458 Real-time fault monitoring of industrial processes
In flattening noise from some steam generator tubes, filtering has to be done by specific
digital techniques. They are based on a noise reference either picked from the signal itself
or provided by an auxiliary signal (this is the so called correlofilter). When the signal is
not stationary, one convenient way to filter it is to let a feedback loop estimate the filter
coefficients from the measured sampies. The output of the filter can be used as a noise
estimation and subtracted from the original signal.
After clean signals have been recovered, an automatic decision about their nature is
desired. Besides the statistical techniques (principal component analysis, discriminant
analysis and others) a classification scheme got a tremendous favour at the end of the
70's: the Adaptive Learning Network (ALN). ALN is an empirical combination of
candidate parameters, in which a non-linear polynomial model is constructed. At each
iteration the model "grows", that is the coefficients and the structure of the model are
determined simultaneously. The model's output can be either a classification or an
estimation of some parameter of interest. ALNs have been tested both for US
(Ultrasonics) and EC (Eddy Currents) signals. As revealed, they performed more or less
like classical multidimensional statistics and apparently disappeared from reports.
As the question of what are the optimal parameters remains open, another approach has
been proposed for EC signals. The idea, first used for hand print character recognition,
consists in retaining only the first terms of some sort of Fourier development of the EC
complex signal. These terms are then used as features for classification. The experience
about these Fourier descriptors applied to support plates discrimination is that they are
too global to allow an accurate localization of small flaws.
Since the rediscovery ofRosenblatt's perceptron in the 80's, NNs (neural networks) have
been proposed for numerous tasks. A particular combination has proved to be fruitful:
1. A three-Iayer architecture (one hidden layer).
2. The back propaga'tion algorithm to estimate the weights.
3. Classification purposes.
Whereas NNs have given an opportunity to revisit once more some classical problems,
their new features are:
1. Efficient hardware implementation.
2. Some preprocessing capability (Komatsu et al. , (1992), Parpaglione, 1992).
Nevertheless, the amount of relevant examples has a par~ount importance in NN
approach as weil as in other approaches. It is surprising that nobody has compared ALNs
and NNs at least from a NDT point ofview.
In-time faiJure prognosis and fatigue Iife prediction of structures 459
Acoustic emission (AB) signal analysis. Acoustic emission signal analysis has yielded
important information in the detection of leaky components under pressure, in
pressurized heavy water reactors. In one of the above problems, the ratio of the spectral
energies present in different bands of the power spectrum of the AB signal, is used in
order to detect the leaking component, since the signal-to-noise (SNR) ratio was very
poor. This is an example where problems due to poor SNR were overcome by
appropriate use ofDSP.
In the NDE of rotating machinery, such as steam turbines and turbine generators, AB is
used to detect malfl.mctions such as rubbing and bearing tilt. In order to detect and
transmit AB signals from an operating rotor (to enable on-line processing), a wireless AB
monitor has been used, which can detect and transmit AB signals ranging from 50 kHz to
250 kHz. Acoustic emission parameters such as events, energy values, amplitude
distribution, frequency components, skewness and kurtosis values have also been
correlated with the "health" of cutting tools, used in lathes. Failure prediction in
gearboxes by the processing and analysis of its vibration (rotational) signals have been
done with success. It has been concluded that the imminent failure could be predicted
accurately using cepstrum analysis (see Chapter 1). Vibrations in the gear meshings have
been monitored to detect failure in gears, where the tooth meshing vibration components
and their harmonics are elirninated from the spectrum of the time domain average. The
reconstructed time signal shows the presence of defects (if present) which otherwise
cannot be seen in the time domain average. This again underlines the importance and
usefulness of SP in the field of acoustic testing.
Time of Flight DitTraction (TOFD) technique for defect sizing. When ultrasonic
waves encounter a crack-like defect, not only reflection but also production of scattered
and diffracted waves takes place, over a wide angular range from defect tips. The
separation of diffracted waves in space and hence in time directly relate to the size of the
defect. By knowing the delays for different waves, it is possible to compute the size and
location of the defect.
This technique has been used and the results are found to be in conformity with ASME
XI with respect to determination of maximum acceptable defect height and depth of
upper and lower edges for internal defects which lie deeper than 30% of the specimen
thickness in steel exceeding 12 mm thickness. Again, the results are in conformity with
modified ASME XI for all defects in steel exceeding 10 mm thickness.
Synthetic Aperture Focusing Technique (SAFT) for increased resolution. In this
procedure, a larg~ aperture focused probe is synthesized electronically, thereby
increasing the fundamental resolution and defect sizing accuracy of the technique. A
wide angle compression probe and a point flaw in the specimen is assumed for the
purpose of simplicity. When the transducer scans over the specimen, each reflected echo
for various scan positions with respect to the position of closest approach of transducer
to the flaw, is delayed in time due to the greater distance travelled by ultrasonic waves. If
the individual scans are shifted by an amount equal to their predicted time delays, they
In-time failure prognosis and fatigue life prediction of structures 461
will come into coincidence with each other and when they are summed, the resultant will
be a large amplitude response. If the same procedure is repeated centered around another
position, the above time shift compensation does not produce a set of self-coincidence
scans which results in a significantly smaller response. The time shifts can be achieved
either electronically or digitally using a computer. This technique is an excellent example
of the advantages that accrue form the combination of conventional and advanced
techniques. Typical applications of this important technique, apart from radar, is the in-
service inspection of pressure retaining boundaries for accurate defect sizing.
Reduction of random noise using split spectrum processing. This technique is
implemented by splitting the frequency spectrum of the received signal by using
Gaussian, overlapping band pass filters having central frequency at regular intervals. If
the inverse Fourier transform is taken for N filters, N time domain signals are obtained.
These N number of time domain signals are subjected to algorithms such as minimisation
and polarity thresholding for extracting useful information. The split spectrum
processing technique is widely applied in the analysis of signals form noisy materials Iike
centrifugally cast stainless steels, carbon epoxy composites, welded joints and c1added
materials.
The ALOK technique The ALOK technique was conceived and developed by the
Fraunhofer Institute for NDE techniques (IzfP), Saarbrucken, Germany. The principle of
this technique is to characterize a reflector by its time of flight characteristics rather than
on the basis of its reflected amplitudes. A modified version of this technique, developed
by Siemens, rapidly acquires a manifold of amplitude and corresponding time of flight
values in each A-scan, concentrating on the relevant A-scan information by a specific
pattern recognition process. ALOK provides remarkable advantages with respect to
general improvement of the inspection, increase in the information density (reduction of
documentation) and simplification of data evaluation.
Microstructure and mechanical properties characterization using acousto-
ultrasonics. This approach is based on the concept that spontaneously generated stress
waves produced during failure interact with material morphology. By introducing
ultrasonic waves into the material, simulated acoustic stress waves are produced which
are affected by the material condition. The waves are measured in the form of stress
wave jactors (SWF), defined as the number of oscillations higher than a chosen threshold
in the ringdown oscillations of the output signal. The SWF is correlated to the
microstructure and mechanical strength. Damages in the specimen produces
corresponding changes in the signal attenuation resulting in lower SWF readings.
6.2.5 Conclusions
The techniques mentioned in the above sections are a selection of some of the more
recent developments in NDT and NDE. By the turn of this century, the number of
techniques that will be used in the field ofNDE, will be significantly large in number and
462 Real-time fault monitoring of industrial processes
analysis methods, pattern and cluster analysis methods and exploit the advances that are
being made presently in the field of artificial intelligence (AI).
It can be conclusively stated that acoustic NDE methods at their present state of
development would gain significantly by intelligent and balanced use of these
advanced concepts.
6.3.1 Introduction
Christensen and Sorensen (1987). The total cost of inspection and repair is minimized
with the constraints that the reliability of elements and/or of the structural system are
acceptable. The design variables are the time intervals between inspections and the
quality of the inspections. Numerical examples are presented to illustrate the
performance of the strategy. The strategy can be used for any engineering system where
inspection and repair are required.
Although the physico-metallurgical aspect is important for the understanding of rupture
mechanisms in structures, in the literature the phenomenological approach is more
commonly found, based on laboratory tests and semi-empirical models, using mainly
linear fracture mechanics (Lucia, 1985, Kozin and Bogdanoff, 1992). The fracture
mechanics relationsbips allow a link to be established between the defect dimensions, the
load level and the stress intensity. The stress intensity can be compared with the
material's resistance to rupture. Given the complexity of the phenomena involved
however, experimental support is essential for the definition of rupture mechanism
models. The problem arises at the moment of transferring the laboratory results, if
obtained in over-simplified environment and loading conditions, to component reality
and, on tbis basis, predicting the lifetime.
During the past twenty years a rather extensive effort has been devoted to developing
techniques that permit the accurate in-time prediction of structures fatigue life (see in
"Theoretical and Applied Fracture Mechanics", "Engineering Fracture Mechanics",
"International Journal of Fracture ", "Structural Safety" and "International Journal of
Fatigue"). As the knowledge related to fatigue of structures and materials expanded, it
became clear that in many cases fatigue could be treated from a propagation point of
view. Tbis knowledge has led to the development of phenomenological, fracture
mechanies and probabilistic fracture mechanies, stochastic process, time-series analysis
and knowledge-based approaches to assess in-time the fatigue crack growth (FCG) and
the failure probability in structures. Tbis is the "heart" of the in-time fatigue life
prediction leading to increased life of structures subjected to dynamic loads. These
approaches will be presented and discussed in the following.
The interpretation of damage as the birth and propagation of defects in the elementary
structure of the material, caused by the altemating stress field acting on imperfections of
the crystalline network, on distortions due to impurities, etc. is generally accepted.
Qualitatively one can distinguish:
• An initial nucleation stage (defect generation).
• A transition stage (defects coalescence).
• A propagation stage (unstable growth ofthe largest defects).
In-time faHure prognosis and fatigue life prediction of structures 465
In general, NF being the number of cycles to rupture and No, NT, Ne the cycles at the end
ofthe nucleation, transition and propagation stages, one has:
(6.1)
wbich expresses the fact that each stage is determined by the level reached in the
preceding stage. The damage accumulation mechanisms are presumably different in the
three stages and in each one they are significantly dependent on the environmental
conditions and the stress field; for tbis reason the construction of a unified model of
interpretation seems unlikely. Furthermore, the relative importance (in terms of number
of cycles) of the above three fatigue stages depends on the intensity of the altemating
load. Tbis rather complex picture can be partially c1arified by the consideration that
experimentally evident fatigue failures in operating components are mainly caused by
fabrication defects, generally introduced during welding procedures. As a consequence,
the nucleation and transition stages are relatively less important than the propagation
stage, starting from the fabrication defects. These defects are, in fact, usually larger than
the defects of the elementary structure of the material.
To date, efforts have concentrated on development of independent models for cyclic
constitutive behavior, cyclic crack initiation and cyclic crack propagation. However the
transition between crack initiation and crack propagation has not been thoroughly
researched as yet, to be integrated in a unified life prediction method (Bhargava et al.
(1986), Lankford and Hudak (1987), Halford et al. (1989».
Having said that, the problem is: how to estimate the time to failure of a cyclically loaded
structural component. The estimation ofthe lifetime (expressed, e.g., by the number of
load cycles allowed) is more than just a research problem; it is a practical problem of
current design, considered by the current standards. For this reason it appears opportune
to present first the criteria of fatigue design of the ASME Sect. III standards and the
acceptance criteria of the defect propagation rates of the ASME Sect. XI standards.
The fatigue design and residual life time criterion of the ASME standards coincides with
the Miner rule. It is still the most commonly used, not only for components subject to
ASME standards but also, and especially, for components in the aeronautical industry for
wbich the dimensioning for fatigue is often the main consideration.
The ASME standards for components of conventional (Section VIII, Div. 2) or nuclear
(Section III) installations are based on limitations imposed to a "cumulative factor of
use" U:
U = LUj = Lnj / N j :s; 1 (6.2)
j j
where nj is the number of cycles envisaged for the cyclic load of type i and Nj the
allowed number of cycles as deduced from the fatigue curves (S-N curves). An S-N
curve is given as,
N=Ks-m(s> So) (6.3)
466 Real-time fault monitoring of industrial processes
where N is the number of cycles to failure under constant-amplitude stress range S, K and
mare the S-N curve parameters, and So is a stress cut-off level below which no damage is
accumulated.
The S-N curves used are generally obtained from monoaxial cyclic load tests at constant
amplitude, on notched sampies: a suitable safety factor, ofthe order of3 or 4, is applied
to the mean experimental curve to account for the considerable scatter of results for
design purposes. With a safety factor of 4 the design reliability is of the order of 99.9%
for a standard deviation of20%.
The fatigue damage accumulation D follows Miner's law as
No 1 1 No
D=L =-LSj (6.4)
i=l Ni (Si ) K i=l
in which No is the total number of stress cycles. In a deterministic design, it is assumed
that failure occurs when D= 1.
To account for model's (6.4) uncertainty by a random variable A, the fatigue limit state
function is defined as:
1 No
g(Z) =Ll-D =Ll--LSf (6.5)
K i=l
If one further defines that any random variable x can be expressed as,
x = Bxxc (6.6)
in which Xc is the characteristic value of x and Bx is a normalized random variable
associated with x, the limit state function (6.5) is then ofthe following form:
1 No
g(Z) =Ll- B K L(BsSi,c)m (6.7)
1( c i=l
in which the S-N exponent m is usually treated as a deterministic constant.
Failure is defined by the e,:ent g(Z)SO, and g(Z»0 identifies a safe state. The failure
probability is,
computational procedures ofFORMISORM are given in Madsen et al. (1986), where the
reliability index ais applied, which is related to the failure probability by
ß= - ~l(Pf) (6.9)
in which tP( ) is the standard normal distribution function.
From technical or economical considerations, the required safety level in the design Iife
may sometimes have to be achieved by additional safety measures such as inspection, so
that the design safety can be updated to the required safety level. The principal of
reliability updating is based on the definition of conditional probability,
The prediction of the fatigue damage accumulation relies on the S-N curves where the
number of cycles to failure is based on a large visible crack size and the remaining fatigue
468 Real-time fault monitoring of industrial processes
life is conservatively neglected. The prediction of the fatigue crack growth (FCG) using
the fracture mechanics, however, describes the crack growth physically and is able to
calculate the fatigue life up to fracture accounting for possible inspection effects. Tbis
approach is thus more sopbisticated and its application is becoming widespread.
Some hundreds of more or less different relationsbips can be found in the literature for
expressing the fatigue growth of cracks (Hoeppner and Krupp (1974), Akyurek and Bilir
(1992». Some of them are purely theoretical or based on microscopic properties of
material, but the most widely employed are semi-empirical and have been developed
mainly as interpretative models of experi-mental results: they allow the prediction of the
behavior of the crack size "a" as a function of "N" the number of stress cycles. This
prediction is based on the integration of the growth rate, for fixed initial conditions of the
defect. The result arrived at is not, however, in general, representative of the real growth
situations. Tbis is due both to the fact that the initial conditions have a considerable
scatter of values and to the fact that, for the same initial conditions, there is an intrinsic
variability in the process of damage by fatigue which leads to a distribution of values "a",
at cycle N (Virkler et al., 1979). It thus appears natural to consider the relationships for
defining the growth rate as stochastic (Ghonem and Dore, 1987). In tbis context,
therefore, prediction methods can be seen as being based on the integration of the FCG
relationsbips with the parameters or the initial conditions or the loads represented by
random variables. These procedures lead to the determination of a distribution of
dimensions for the propagated defect, at cycle N, or of a distribution of the number of
cycles N for a given propagation from ao to af These distributions are the basis for the
prediction of the residual life of structures stressed by fatigue. Three randomization
methods are presented by Lucia (1985), as an indication ofthe vast range of applications
of tbis methodological approach.
ProbabiJistic models for fatigue crack growtb.
As mentioned before, the statistical variability in crack growth depends on many
undetermined factors which can be classified as small differences of material intrinsic
properties, loading environment, specimen geometry, measuring system, even
microstructure and the state of stress, etc. In general, the fatigue crack growth can be
expressed by the following nonlinear relation,
da/eiN= Q(AK, Kc ' R, Kth , a, ...) (6.11)
where,
~ ...) a non-negative function,
a half-crack-Iength, mm
N number of fatigue cycles, (cumulative load cycles)
AK stress intensity factor range at the crack tip, given by the relation
AK=S(na)lI2F(a), MPa ml12
S applied stress (load) range, MPa
F(a) crack shape geometrical factor (see Verreman et al., 1987, Dufresne
et al., 1988)
In-time faHure prognosis and fatigue life prediction of structures 469
da = C(Il.K)m (6.12)
dN
da C(Il.K)m
=--"'-----'--- (6.13)
dN (I-R)Kc -i1K'
where the stress cut-off level So(.IlKth ) is a function ofthe threshold ofthe stress intensity
factor range (AKth ), below which there is no crack growth, and C, m=crack
growth parameters.
Laws (6.12) and (6.13) are almost universally applied to stage-lI FCG, that is, crack
growth at a1temating stress intensity values somewhat larger than the threshold
a1temating stress intensity value AKth , but below the value of AK at which unstable crack
propagation begins to occur.
All of the factors and parameters mentioned above are treated as random variables in
probabilistic fracture mechanies (PFM). Therefore, the statistical investigation and the
accumulation of statistical data of these parameters are very necessary and important for
the reasonable and economic design of fatigue structures. In particular, for fatigue
reliability analysis of structures, a probabilistic or stochastic model is required for fatigue
crack growth. As a result, many probabi-Iistic or stochastic fracture mechanies models
470 Real-time fault monitoring ofindustrial processes
have been developed to deal with the variability ofcrack growth (Lucia, 1985, Journet
and Pelloux, 1987, Ghonem and Dore, 1987, Cortie and Garrett, 1988, Zhu and Lin,
1992, Nisitani et al., 1992). These models have their realistic physical or microstructural
basis for some special conditions. A major problem with these models is the difficulty in
obtaining sufficient data due to time and money. For tbis reason, some models are not
verified by experimental data, and it is difficult to apply some models in engineering.
The purpose here is to present a simple probabilistic model which is easy for designers to
use in predicting crack growth behavior. In the model, crack growth parameters C and m
in the Paris-Erdogan and Forman laws are considered as random variables, and their
stochastic characterizations are found from a crack growth experiment with small sampie
size. Furthermore, using the COVASTOL computer program (more details for tbis
program are given later), the statistical distributions of crack growth rate da/dN and
cycIes to reach a given crack length are obtained. The experimental resuIts are used to
verity the theoretical prediction of the statistical properties of fatigue crack growth
behavior for aluminum 2024-T3 test specimens.
Material inhomogeneity has long been considered to be an important factor in crack
initiation. However, it also has considerable influence on crack growth, wbich is not
commonly perceived in deterministic fracture mechanics. Material inhomogeneity is
usually negligible in crack growth under generallaboratory conditions, especially under a
random spectrum, because the fatigue stress dominates the scatter aspects of crack
growth. However, there is considerable variability in a well-controlled test under a
constant amplitude spectrum. For a good probabilistic model of crack growth, material
inhomogeneity must be involved.
Several different approaches have been followed for the probabilistic modeling of
material inhomogeneity. The most common approach is to randomize the crack growth
parameters. For the crack growth equations of Paris-Erdogan and Forman there are
several randomizations possible: both C and m in the Paris-Erdogan and Forman laws
could be random variables; or C could be a random variable and m a constant; or m could
be a random variable and C a function of m. However, C is really not a material constant
(as was initially assumed by Paris), but it depends on the mean stress or stress ratio,
frequency, temperature, etc. In particular, the stress ratio R is recognized to have
significant influence on C.
Then, Paris-Erdogan and Forman equations can be transformed respectively to yield:
In(daldN) = In(C) + m In(AK) ~Y= In(C) + mX (6.14)
where,
Y=ln(da/dN), X=ln(AK)
and,
In(da/dN) = In(C) + m In(AK) -ln[(1 -R)Kc-AK] ~Y= In(C) + m X (6.15)
where,
In-time faHure prognosis and fatigue life prediction of structures 471
a(N) = CI N111J.
a(N) = C3Cm3N
where Cj and mj ; ;=1,2,3, are functions of applied load, material characteristics,
geometrical configuration of the component and the initial quality of the product being
tested. Equations (6.16) can be rewritten as:
In a(N) = In Cl + ml InN
In a(N) = In C2 + m2 In [logloN.1 (6.17)
In a(N) = In C3 + m3N= C* + m~
Thus, regression lines of various types can be obtained for: the crack growth data
reported for each test, for all data from a given specimen geometry, and for all data
considered as one group.
The raw data from a crack propagation test are the half crack length, a, and the number
of cumulative load cycles, N, needed to grow the crack to some crack length, a, from
some reference initial crack length.
The current interpretation of these data is to report the FCG rate, daldN, vs. AK, AK
being the stress intensity factor range at the crack tip for each individual test. The
graphical representation ofthese data includes a log-log plot of daidN vs. AK, leading to
the best fit straight line on this plot, see fig. 6.17. This data processing method is strictly
related to the use of the well-known Paris-Erdogan low or Forman law as a model for
the FCG rate.
The overall variability encountered in FCG rate data depends on the variability inherent
in both the data collection and data processing techniques. If C and m are taken as
random variables, C and mare related. Cortie and Garrett (1988) have shown that the C-
m correlation, while present, does not possess any fundamental significance and is purely
the result of firstly, the logarithmic method conventionally used to plot the data, and
secondly, the nature of the dimensions of the physical quantities used in the Paris-
Erdogan equation. In the light of the probabilistic theorem, distribution of crack growth
rate daidN as a function of AK can be deduced form stochastic characterizations of C
472 Real-time fault monitoring of industrial processes
and m as weil as the above logarithmic equations Crack growth rate da/dN is able to
accept a log-normal distribution (i.e. In(da/dN) is able to accept a normal distribution),
and its mean and variance are given using the Paris-Erdogan equation, by (two-variable
prediction method):
where AK can be taken as any value, and the means, variances and correlation Pcm of C
and m are taken from the statistical analysis of the raw FCG test data (Virkler et al.,
1979, Stavrakakis et al., 1990).
-5-8
-6-0 i IJ
I' tf I
"ä)-6-2 I· .
~-6 -4
~
E-6'6
Z-6'8
'0
~ -7-0
'0
~ -7-2
-J -7'4
-7,6
-7'8
-8'0 'r-~""'--'---'--"'----r-~-'--~"'--"--''---'--"T---.J
0·92 0·97 1{)2 1·07 1-12 ,.,7 1·22 1-27 1-32 1·37 1·40
Figure 6.17 Summary of FCG rate data for the Virkler et al. (1979) case calculated by the
ASTM E647-83 standard rnethod.
If m is a constant and C a randorn variable, by the same principle, the crack growth rate
da/eIN as a function of AK, can be shown to follow a log-normal distribution with mean
and variance given by (single-variable prediction method):
In-time failure prognosis and fatigue life prediction of structures 473
From eq. (6.20), Ngla, is a joint random variable of C and m. The Monte-Carlo
ao
simulation technique can be applied to get a convenient distribution of Nglai through
ao
simulating distributions of C and m.
The distribution of crack lengths after a given service life (number of cycles). This
procedure computes the propagation of a given defect or distribution of defects in a
given position and the corresponding failure probability during accidental loading. It is
thus more meaningful for real-time fatigue life prediction than the previous one.
The competence for facing such problems of cumulative structural damage has been
acquired in the Components Diagnostics and Reliability Sector at the Joint Research
Center of EEC, Ispra-Italy, from the development of analytical models for the
representation of cumulative damage process and for the estimation of lifetime
distribution under fatigue loading. Two numerical codes have been developed to tbis end,
namely COVASTOL and RELIEF. The COVASTOL code has been developed in the
framework of a more general study on the in-time estimation of nuclear reactor pressure
vessels residual life time and failure probabilities. It is based on the application of
probabilistic linear elastic fracture mechanics on statistical distributions of data
conceming flaws, material properties and loading conditions (see Dufresne et al., 1988).
474 Real-time fault monitoring ofindustrial processes
The RELIEF code is based on the representation of the process of damage accumulation
as a semi-Markovian stochastic process; no assumptions are made ab out the elementary
mechanisms causing the accumulation of damage. The latter approach will be presented
in the next section.
The COVASTOL code estimates the FCG rate by Paris' law with statistically distributed
coefficients. The probability of onset of unstable crack propagation is estimated through
the convolution of the distributions of the stress intensity factor and of the material
resistance expressed by the static fracture toughness. The great advantage of this model
is its simplicity, while tests are necessary to determine the coefficients m and C. It
should, of course, be kept in mind that the Paris relationship does not generally describe
correctly the behavior of cracks in the nucleation stage or near fracture; for small AK, for
example, the propagation rate is overestimated (Nisitani et al., 1992). However, it should
be pointed out that no model describes in its entirety the crack propagation phenomenon.
Under these conditions, the definition of at least three ranges of AK should allow more
accurate FCG predictions. In that respect it is also certain that the different methods of
treating the original data (a, N) introduce a scatter connected to the more or less
pronounced importance of the subjective factor in each method. The method in the
COVASTOL code is as follows (see also ASTM E647-83 standard):
• starting with the experimental data in each of the ranges considered, (daldN)mean is
computed for a certain number of AKi levels;
• A linear regression is performed to determine, according to these values, the
parameters m and C relative to Paris' law for the AK ranges considered. In each of
these classes of AK a mean value of m is computed and retained and from this value
the distribution of C is calculated. This distribution is presented in the form of a
histogram of five class intervals.
It is quite important to mention here that in the operations connected with the fatigue
crack growth calculation, a special procedure is implemented for the combination of the
histograms, as folIows:
• a given pair ofvalues (or class intervals) ao, bo (elliptical defects are considered) is
combined with every class interval of C (coefficient of Paris law) only for the first
stress transient;
• after that, ao, bo are transformed in one pair ofhistograms aI, bIo In the subsequent
transients only combinations among class intervals of the same order are taken into
account.
Conceming the width of the defects, because no data are usually available from
manufacturers, its distribution is calculated by estimating the probability for two or more
defects (assumed with one weId bead width) to overlap, both in horizontal and
transversal sectiono
The defect length and width distributions so obtained correspond to the observed defects
in a weId or a structure after fabrication and before repair, and are corrected
In-time failure prognosis and fatigue life prediction of structures 475
automatically in order to take into account the sampie size, the accuracy of the
measurement equipment, the size of acceptable defects according to the construction
ruIes, and the reliability of the NDT methods (probability of having undetected and
correspondingly unprepared defects).
To consider all combinations among a, b and C class intervals at every stress transient
would in fact mean continuousIy mixing the material properties, whose scattering has, on
the contrary, to be applied only once to the fatigue phenomenon considered as a whole.
The modeling defined above was introduced in the COVASTOL computer code thus
allowing calculation of the propagation along the two axes of an elliptical defect
subjected to a periodicalloading. Temperatures and stresses as a function oflocation and
time are given as deterministic analytical functions for each situation.
The probability of onset of unstable crack propagation is calculated as the convolution of
fracture toughness and stress intensity factor histograms. Its evolution is followed during
the stress transients as well as the evolution of any defect.
The COVASTOL program outputs give on the one hand the crack growths and if needed
the evolution of defect size distribution and on the other hand the rupture probability
associated with each defect size. The crack growth and the rupture probability
computation procedure for internal and surface defects as well as test cases to calculate
the rupture risk of welded steel Pressure Vessels are well presented and analyzed by
Dufresne et al. (1986, 1988) where the reader is referred for details.
The failure probability, when a sophisticated program like the COVASTOL code is not
available, can be calculated by using the limit state function concept. By integrating the
Paris-Erdogan law (6.12) one obtains:
fN [F(a)&]m
llo
da =CLS~
N
j=} J
(6.21)
where ao is the initial crack size and aN is the crack size after N stress cycles. For a given
critical crack size a c ' the failure occurs when ac-aN.s O. Hence the limit state function
can be expressed as:
(6.22)
A similar limit state function using the Forman law (6.13) can be easily evaluated. The
failure probability of the structure can be evaluated using the equations of Section 6.3.2
and the statistical distributions ofthe parameter C (m can be considered constant) and of
the load sequence Si' All the above analysis concerns mainly FCG under static loading
conditions.
Zhu and Lin, (1992), propose a new analytical procedure to predict the fatigue life and
reliability of a mechanicalor structural component, with random material resistance to
476 Real-time fault monitoring of industrial processes
crack growth and under random loading. The procedure is applicable when the fatigue
crack growth is a slow process compared with the stress process, which is the case for
the high cycle fatigue. In the special case, in which the stress is a narrow-band stationary
Gaussian process and a randomized Paris-Erdogan crack growth law is applicable,
analytical expressions have been obtained for the probability densities of the fatigue crack
size and the fatigue life, and the reliability function. A numerical example is given for the
case of a degrading system. The accuracy of the proposed analytical procedure is
confirmed by comparing the theoretical and simulation results.
Quality of the fatigue life prediction and failure prognosis. From the above
discussion it is clear that the quality of the prediction depends directly on the quality of
the method used to process the raw FCG experimental data and to estimate the
parameters of the probabilistic fracture mechanics model. A poor estimation of the
parameters will lead to an inaccurate prediction of the life-time, even if sophisticated
FCG prediction models are used (Stavrakakis et al., 1990).
The currently used standard method to estimate the parameters, the ASTM E647-83, has
several weak points. The determination of the derivative, daldN, required by this
method, introduces a scatter in the FCG rate data which varies considerably with the data
processing technique used. Thus, a significant variation in daidN at a given l1K level is
introduced due to the raw FCG data processing technique. This variability introduced on
the estimated FCG model parameters distributions due to the FCG raw data processing
technique leads to a pessimistic structural reliability assessment. Moreover, the ASTM
E647-83 standard method is strictly related and thus limited to the application of the
Paris law or Paris-like models to describe the FCG phenomenon.
Stavrakakis, (1992), proposes a general method to process the raw FCG data based on
non-linear regression techniques. In this method, the parameters of any probabilistic FCG
rate model are estimated directly form its integral form, namely a=j(N). It is not
restricted to the application of the Paris law for FCG predictions and handles, because of
its generality, any probabilistic FCG rate relationship with any number of parameters.
This method permits a significantly reduced contribution to the variation in daidN at a
given l1K level, due to the raw data processing method. The performance of the method
is evaluated using the integrated computer program COVASTOL for structural reliability
assessment, when the FCG rate model coefficients are determined by the currently used
ASTM E647-83 standard method and the new technique proposed there.
The two methods were used to process the Virkler et al. experimental data. The Virkler
et al. data comprises 68 replications with constant load amplitude cycling loading. The
data consists of the number of cycles required to reach 164 crack lengths, starting at
9mm and terminating at 49.8 mm, for each replication. Center crack (CCT) aluminum
2024-T3 test specimens were employed. The 68 sampie functions of time (cycles) to
reach a half crack length a are plotted, statistically analyzed and discussed in Virkler et
a/., (1979). The non-linear FCG rate model considered was the Paris law for
convenience.
In-time failure prognosis and fatigue life prediction of structures 477
In order to evaluate the influence of the FCG data processing method on the results of a
FCG prediction program, the COVASTOL program was run for the same initial crack
and stress transients conditions as in the Virkler et al. experiments, namely ao=9mm and
/1(1 =48MPa except for the Paris law parameters, i.e. the mean value of n and the C-
histograms (Stavrakakis, 1992). First, the prediction of the defect propagation after a
service-life of 2xl05 cycles as resulted from the COVASTOL program is performed,
when the Paris law parameters used are those derived by the standard ASTM E647-83.
Then, the defect propagation prediction after 2x 105 cycles is calculated by the
COVASTOL program for the same initial and loading conditions, but with the C-
histograms, n-mean values derived by the non-linear regression method. Finally, the real
defect distribution (histogram) after a service life of 2x 105 cycles, derived directly from
the Virkler et al. experimental data, is given.
A comparison of the predicted defect histogram for the propagated crack length after 2x
105 cycles with the real defect histogram has shown that even if the real crack length
classes are predicted, the predicted probability of the upper classes (crack-Iength-a
between -31mm and -40mm) is very high (-50 per cent) compared to the reality (less
than -10 per cent). Moreover, the prediction gives a small probability (-8 per cent) for
fast crack propagation and fracture (crack length a> -55mm up to -7Imm) that does
not exist in reality. This is a quite conservative (i.e. pessirnistic) prediction.
A comparison of the predicted defect histogram after 2x 105 cycles with the real defect
histogram has shown that the predicted probabilities ofthe different crack-Iength classes
differ by less than 10 per cent from those of reality, and a successful coincidence between
the two histograms occurs.
Thus, it is obvious that even if the variability introduced by the raw FCG data processing
techniques ofthe ASTM E647-83 standard does not induce a significant amount ofbias
in the processed results it can induce an unacceptable bias in the final FCG prediction and
residual-life-time results which make them conservative and thus less realistic.
In the above experimental evaluation tbe Paris law was used because tbis is tbe case in
the COVASTOL program. This is not restrictive in any case. An analysis of exarnining
the applicability of the unified fatigue crack propagation (FCP) approach proposed
earlier for the FCP in engineering plastics such as PMMA and PVC is described by Chow
and Wond (1987).
A Paris-like formulation is proposed to characterize FCP in polymeric materials and it is
found, using measurements, that it is able to assess satisfactorily the FCP in both PMMA
and PVC materials.
In this way, all the considerations of this section can be easily extended using tbis
formulation to assess in-time the FCP phenomenon in polymeric materials and plastic
pipes.
478 Real-time fault monitoring ofindustrial processes
In general both the loading actions and the resistance degradation mechanisms have the
characteristics of stochastic processes. They can thus be defined as random variables
which are functions of time. The particular load history which affects a component is one
of the possible realizations of the stochastic load process and the same applies for the
environmental condition or for the evolution of the dimensions of a defect inside the
component.
The prediction of the component lifetime is to a large extent based on the representation
of the stochastic processes which act on the component. The damage accumulation
mechanisms can, in general, be represented by a positive "damage rate" function such
that the measure of damage is a monotonie increasing function of time.
The physical situation to be contemplated is as folIows: a structural component is in
operation in a certain environment. During cyclic operation, irreversible changes occur.
These irreversible changes accumulate until the component can no longer perform
satisfactorily. The component is then said to have failed. The time at which the
component ceases to perform satisfactorily is called the time-to-failure or the lifetime of
the component.
The process by which the irreversible changes accumulate is called a cumulative damage
(CD) process. Fatigue, wear, crack growth, creep are examples of physical processes in
which CD takes place.
The particular damage process of interest here is the FCG as experienced for instance in
failures of pressurized mechanical systems having a structure which contains defects as a
result of technological operations like weldings. The defect dimensions, although
continuous variables, are in fact associated with a discrete level or state which allows
(without excessive restrictions) the use of well-known mathematical tools for discrete
Markov processes.
The damage levels are represented by the states >=1, 2, "', b; b being the conventional
rupture state. The loading process is represented at cycle x by the transition matrix Px :
PI ql 0 0 0
0 P2 q2 0 0
P,-
x- (6.23)
0 0 0 0 0 Pb-l qb-l
0 0 0 0 0 0 1
wherepj' qj>O,Pj+lJ,,-=I; j=l, 2, ... , b-l.
In-time failure prognosis and fatigue life prediction of structures 479
As the transition between the states is governed by eq. (6.23) the damage state at cycle x
is linked to that at cycle x-I by:
(6.24)
and thus,
x
Px=PoTIPk (6.25)
k
which describes a unitary jump (UJ) stationary stochastic process.
The relationships (6.23)+(6.25) represent the mathematical basis ofthe discrete Markov
process; from them one can easily find the probability distribution of the number of
cycles to failure and of the damage level at a given number of cycles x (Bogdanoff and
Kozin, 1985).
The sampie functions (SFs), that is the functions a(N) of each sampie of the set form
FCG experiments, are the complete, even if elementary, representation of the damage
process. Starting from a set of SFs and from the first two statistical moments (mean
value and variance of the cycles) related to a given value of the crack size a, the
Markovian model (the above three equations) ofthe FCG process can be defined. This is
called a unitary-jump (VJ) stationary B-model of CD. The mathematical details of this
operation may be found in Bogdanoff and Kozin, 1985. The important point to be
remarked here is the fundamental hypothesis of a Markovian process (that is the
statistical independence of damage states).
The Markovian hypothesis characterizes a process "without memory" of the past events
except those which occurred in the time immediately before. This assumption is purely
theoretical because any damage state depends on the past history (Lucia et af., 1987,
Kozin and Bogdanoff, 1992, Bogdanoff and Kozin, 1985).
The disadvantages of the Markovian assumption are related mainly to an overestimation
of the variance of the predicted time-to-failure distributions, when the initial crack
population is different from the trivial case of a single crack, located at the origin of the
SFs set.
The way to overcome this limitation of the B-model of CD has been suggested by
Bogdanoff and Kozin (1984) which consider the propagation of a population of cracks as
the superposition of many elementary propagation processes, each one starting from a
particular crack size ab belonging to a given initial distribution. This can be done if one
thinks in terms of many VJ stationary B-models, each starting from crack size ak and
considering as random variables the differences in cycles (~-Nk)' where Nk is the cycle
number corresponding to ak' These new variables constitute a sub set of the main random
variable set ~ defined for every j#-k.
The statistical moments of the first and second order corresponding to these variables are
expressed taking into account the statistical dependence between ~ and Nk- The
480 Real-time fault monitoring ofindustrial processes
application of the method of statistical moments to the random variables (~-Nk) or/and
(T;-Tk) (lj being the holding time in the state S) for the estimation ofthe parameters of
each VJ Markovian model implicitly introduces the statistical dependence of the
theoretical random variables ~ (or T;).
A CD model having these characteristics is called a semi-Markovian B-model, because
the fundamental assumption of an elementary Markovian model is disregarded and the
dependence between ~ levels is considered. With tbis difference in mind the computer
code REliEF 2·0 was developed at the Joint Research Center of EEC at Ispra-Italy
wbich optimizes the efficiency of the Markovian scheme according to the above
considerations. The calculation of the first and second order statistical moments to
estimate the CD B-model parameters is now included in the code itself due to the
dependence of tbis calculation step on the current crack size. In particular the evaluation
of the covariances has now to be carried out in order to account for the statistical
dependence between the number of cycles at the different crack sizes describing the
process.
In their recent work, Kozin and Bogdanoff, (1992), propose and study a probabilistic
macro model ofFCG based upon a micro result from the reaction rate theory. A center
crack panel under periodic tensile load is the basic physical situation considered. The
moders explicit dependence on the temperature and the wave form of the periodic load
indicates the importance of these two quantities in the evolution of the crack length. The
straightforword relation of the semi-Markovian B-model parameters with the parameters
of tbis probabilistic model illuminated many of the complexities that are experimentally
observed in the FCG process.
The simplicity and flexibility of models based on Markov schemes is the reason for their
frequent appearance in the literature. In the case in which the emphasis is rather upon the
stochastic process of the loads and environmental conditions than upon the mechanism of
damage accumulation, the traditional techniques for the treatment of processes of this
type become more important. It is in tbis context that the Caldarola and Bolotin methods
are described representatively by Lucia (1985). Many others can by found in the
literature.
The structural reliability, in its most stringent formulation, can be defined as the
probability that the largest of the loads envisaged is smaller than the smallest of the
resistances hypothesized. This means that what one needs to know is the distributions of
the extreme values of the loads and of the resistances, rather than their effective
distributions. Tbis observation, together with the fact that the possible distributions of
the extreme values of a random variable are asymptotically independent of the
distribution of the variable itself, leads to the consideration of the extreme values theory
as a fundamental ingredient of structural reliability.
Some methods, all based on the hypothesis that the lowest resistance has a Weibull
distribution, have been proposed by Freudenthal, Ang and Talreja and presented by
Lucia, (1985).
In-time failure prognosis and fatigue life prediction of structures 481
precise than that performed by the COVASTOL code (smaller scatter), its applicability is
limited with respect to complex real situations.
FCG predictions allowed by the RELIEF code are those concerning the SF sets (same
material, environment conditions, type of load) corresponding to the different loading
intensities (stress transients) which have been catalogued in the databank.
The principle underlying this methodology is that the fatigue crack growth data (N, a)
occur in a form of a time series where observations are dependent. This dependency is
not necessarily limited to one step (Markov assumption) but it can extend to many steps
in the past of the series. Thus, in general, the current value Na (=number of cycles at
crack size a) of the process N can be expressed as a finite linear aggregate of previous
values of the process and the present and previous values of a random shock u (Solomos
and Moussas, 1991), i.e.
(6.26)
In eq. (6.26) Na' Na-I> N a-2, ... and Ua' ua-I> ua-2, ... represent respectively the number
of cycles and the value of the random shock at the indexing equally spaced crack sizes a,
a-I, a-2, ... The random shock u is modeled as a white noise stochastic process, whose
distribution is assumed to be Gaussian with zero mean and standard deviation O'u
(specified by the structure random loading conditions).
Defining the autoregressive operator 01 order p by,
rp(B) = 1 - rpIB - rp2 B2 - .. , - rpJY'
and the moving-average operator olorder q by,
B(B) = I - BIB - B2B2 - ... - BIfl
Eqn. (6.26) can be rewritten compactly, as,
rp(B)Na = B(B)ua
It is recalled that B stands for the backward shift operator defined as BWa=Na-s-
Another closely related operator, to be used below, is the backward difference operator
V defined as VNa= N a - N a-1 and thus equal to I-B.
In an attempt to physically interpret the above equations and connect them to the
observed inhomogeneous crack propagation properties, one could associate the
autoregressive terms to the mean behavior of each individual test curve and the moving-
average terms to the non smoothness within it, which is due to the inhomogeneity of the
material ahead ofthe crack tip. In this manner, this spatial irregularity is approximated by
the homogeneous eandom field u.
In-time faHure prognosis and fatigue Iife prediction of structures 483
In this equation the parameters p, d, q and the operators qJp(B) and 0iB) are exactly as
those defined for the ARIMA model and refer to the aforementioned point (i), while V
=1-Bs f/Jp(BS) and 8 Q(BS) are proper polynomials in Bs of degrees P and Q, respectively,
representing relationships of point (ii) above. This multiplicative process is said to be of
order (p, d, q)x(P, D, Q)s'
484 Real-time famt monitoring ofindustrial processes
The building of the model for a specific physical problem is composed again the same
steps: identification, estimation, diagnostic checking. The general scheme for determining
a model includes three phases, which are:
• Model identification, where the values ofthe parameters p, d, q are defined.
• Parameter estimation, where the {tp} and {B} parameters are determined in some
optimal way, and
• Diagnostic checking for controlling the model's performance.
As is stated however by Box and Jenkins (1976), there is no uniqueness in the ARIMA
models for a particular physical problem. In the selection procedure, among potentially
good candidates, one is aided by certain additional criteria. Among them are Akaike's
information criterion (AIC) and Schwartz's Bayesian criterion (SBC). If L=L(tpl' ... , tpp,
Ot, ... , Bq, CJu ) represents the likelihood function, formed during the parameter estimation,
the AlC and SBC are expressed, respectively, as,
AIC= -21n L + 2k
SEC = -21n L + In(n)k (6.29)
where k, the number of free parameters (=p+q) and n, the number of residuals, that can
be computed for the time series. Proper choice of p and q calls for a minimization of the
AlC and SBC. Last, in the overall efficiency of the model, the principle of parsimony
should be observed. Inclusion of an excessive number of parameters might give rise to
numerical difficulties (ill-conditioning of matrices etc.), and might render the model too
stiff and impractical.
Model building. It is weil known that the ARMA model (6.26) can be written for
identification purposes in the form of an observation equation as follows (see also
Chapter 3):
~1
A ~p A
y=Na =[Na - 1 N a - 2.. ·Na - p u a -ua-l ... -Ua _ q ] 1 =uTB
'1
(6.30)
In-time faHure prognosis and fatigue life prediction of structures 485
An ARIMA model can also be written easily in a similar form, considering the new
stationary process VdN in the place ofthe process N.
On the other hand, as mentioned earlier in Section 6.3.3, the Paris-Erdogan and Forman
logarithmic FCG equations (6.12) and (6.13) are the most suitable for accurate FCG
prediction purposes because they can model satisfactorily the curves of fig. 6.17. The
experimental points of fig. 6.17 do not form exactly a straight line. However, straight
lines modeled by the Paris-Erdogan and Forman logarithmic FCG equations (6.14) and
(6.15) of Section 6.3.3. can adequately represent large portions ofthem. The logarithmic
equations (6.14) and (6.15) can also be rewritten in an observation form as folIows:
The same considerations are obviously valid for the FCG laws of crack length as an
exponential function of the number of accumulated cycles, presented before.
It can therefore be claimed that quite efficient linear regression models for the fatigue
crack growth phenomenon have been constructed. In addition, they have the advantage
of being compact, easily presentable and implementable. They can thus serve in practical
situations, as they can readily furnish updated predictions of a component's residual
lifetime after periodic inspections.
Every such model is built based on the primary form of information of the crack growth,
i.e. the (N, a) sampie functions, and consequently is suitable for a specific set of
geometric and loading conditions. The possibility of utilizing the same model under
different conditions, or of attacbing physical significance to its parameters, can also be
envisaged.
In particular, if one considers moving windows of data of appropriate length, iterative
regression techniques can be used to track the varying conditions. In tbis way an adaptive
prediction method is introduced by Stavrakakis and Pouliezos, (1991), wbich is
especially desirable in such cases, since the parameters of the logarithmic Paris-Erdogan
(6.14), logarithmic Forman (6.15), ARMA, ARIMA and logarithmic exponential FCG
models (6.17) change with time (number of cycles), due to the continuous variation of
the conditions related with the FCG condition (stress transients, random overloads,
temperature, material properties, inspection technique variability, etc.).
To denote explicitly the dependence of the various regression models (6.14)+
(6.17) estimated parameters on the number of cycles, the observation equation
derived before for the various model cases may be written more accurately as,
y= uT 6(N) (6.32)
For n pairs of (a(N), N) experimental points, the weil known linear least-squares
regression formula gives,
(6.33)
486 Real-time fault monitoring of industrial processes
The FCG law ena(N)=C*+m3N (see Section 6.3.3, eqs. (6.17)) is fitted by Stavrakakis
and Pouliezos, (1991), into the Virkler et al., (1979) data, using the linear moving
window regression technique described before. The "deterministic" value of the
parameter m3 is estimated to be 6·89xl0-6 and the mean value and variance of the
parameter C* are estimated as 1·94 and 7·67x 10-3 respectively.
In tbis case, the failure probability of the structure or component can be calculated in
closed form as follows (see for details Stavrakakis and Pouliezos, 1991):
Parameter A represents an estimation of the mean number of cycles in order to attain the
critical crack length ac-
Parameter B determines the role of the quality of the product, i.e. variability of the
properties of the material and of the loading and thermal conditions, or the measurement
error introduced by the crack detection method.
The failure probability of a cracked aluminum-2024-T3 structure constructed using the
aluminum corresponding to the loading conditions of Virkler et al., (1979), experiment
can be calculated using the above equations. The failure probability for the Virkler's
experiment at N=2xl0 5 cycles and for a critical crack length a c=32.68mm, is found, by
applying the above equations, to be cl (200000)=0.0274. Parameters A, B were found to
be 224958.15 and 12715.5 cycles respectively. From the propagated crack length
bistogram at N=2xl0 5 cycles derived directly from the Virkler et al., (1979), experiment
the same probability is evaluated as U(200000)=O·0294. Tbis represents a discrepancy of
6.8%.
The usefulness of the moving window method is illustrated using one set of Virkler's
data. Simulation runs for the one-step ahead predictor indicated that the optimum
window length was nw=4. Tbis produced a maximum absolute prediction error of 0.23
over the whole range of data. If predictions of longer horizon are required, simulation
runs could establish the corresponding optimum window length.
In cases where crack length measurements are available on-line using appropriate
hardware equipment, the recursive nature of the method makes it suitable for an
integrated automatic safety alarm system.
Autoregressive integrated moving-average processes have been employed by Solomos
and Moussas (1991) for the modeling of the number of cycles over the crack size for the
fatigue crack propagation phenomenon. Even though no perfect stationarity conditions
488 Real-time fault monitoring ofindustrial processes
have been obtained in the treatment ofthe Virkler et al. (1979) records, an overall good
performance of the derived models has been observed. It has been found that a single
r~rd can be reproduced satisfactorily by an ARIMA process of order (p, d, q)=(2, 3,
1). The quality of the forecasts depends upon the origin; an early origin allows for short
forecasts while a later origin yields unconditionally good forecasts. A multiplicative
ARIMA process of order (p, d, q)x(P, D, Q}s=(l, 2, l)x(O, 1, 1)89 has been found to
represent very efficiently the whole set of the fatigue crack records. Its forecasting
capabilities are excellent both at reproducing existing data, and at the monitoring and
prediction of new experiments.
As it has already been discussed, the mathematical FCG models available for representing
the relevant physical processes are only approximate representations of the physical
reality, having peculiar, but often ill-defined characteristics of precision, sensitivity and
range ofvalidity. Furthermore, they do not constitute an exhaustive representation ofthe
reality. The knowledge to be used, related to various fields, is not fully representable by
algorithms or mathematical tools but contains also qualitative and heuristic parts. Any a
priori estimate of the life span distribution of a structure shows therefore, quite a large
scatter which can be progressively reduced by using proper updating techniques.
Traditional algorithmic approaches are unable to cope with such a complex context.
Expert systems are potentially, the breakthrough. Expert systems, roughly consisting of
a procedure for inferring intermediate or definitive conclusions on structural damage and
rernnant lifetime, using the domain knowledge and the accumulating service data, can
deal with real world problems by properly incorporating all the knowledge which may
become available. An expert system for structural reliability assessment must have the
ability to analyse and interpret large quantities of information in order to achieve the
following goals:
• Identification of the actual state of the structure and of the damage process actually
taking place.
• Prediction of the future behavior of the structure.
• Decision and planning of appropriate actions.
The bone of the expert system can be thought of as a coordinator and manager of
operators which mutually collaborate and supply the information the system needs. Each
step of the assessment procedure (e.g. defect population identification, material
properties selection, microdamage analysis, macrodamage analysis, etc.) can constitute
one operator or be subdivided into more specialized operators. The user can exploit
interactively the functions performed by the operators. Rules and decision criteria can be
modified under a set of metarules. The modular array allows an easier representation of
the base of knowledge and an incremental construction of the system (see also Chapter 4
and Jovanovic et al. (1989».
In-time failure prognosis and fatigue life prediction of structures 489
An expert system for assessing damage states of structures will consist of an interpreter,
data-base and rule-base. All the rules involved are described through production rules
with certainty factors. Ihe inspection results are used as the input data. Ihe inspection
results regarding cracks are firstly input into the system, rules concerning their damage
degree, cause and expanding speed are implemented to provide a solution for the damage
assessment. Ihis inference procedure is performed as shown in fig. 6.18.
Ihe uncertainties involved in the input data and rules can be taken into account by
introducing certainty factors. Damage pattern, damage cause and deterioration speed are
employed to interpret the inspection data from the multi-aspect point ofview.
Certainty factor. Most of the data available in the damage assessment generally include
certain kinds of uncertainty and experience-based knowledge may be vague and
ambiguous. Ihus, an expert system should, have the ability to treat these uncertainties in
a logical manner. Ihe certainty factor is calculated hereafter. Input data and production
rules are written as folIows, with certainty factors:
Data 1: Cl; Data 2: C2 ; ..., Data p: Cp
IF Ant. 1, Ant. 2, ... , Ant. m
THEN Con. 1: Ci, Con. 2: Cl, ... , Con. n: C~
where Ant. and Con. denote alltecedent and conclusioll, respectively, and Cp and Cj are
certainty factors. p, m and 11 are the numbers of input data, antecedents and conclusions,
respectively. At execution of the inference procedure using the rules, including the
certainty factors, the following must be done:
I, Calculate the certainty factor for the resultant antecedent.
2. Calculate the certainty factor for the resultant conclusion.
3. Determine the final conclusion and calculate its certainty factor for more than two
rules which provide the same conclusion.
One can employ the following calculation methods corresponding to the items above:
1. Cm= min(C], Cl> . , "Cm), where Cil! is the certainty factor for the resultant
antecedent
2, COIIT,k=C',IIXC;' where Ci- is the original certainty factor for the k-th conclusion and
COllf,k is the certainty factor ofthe k-th output.
3. Ihe certainty factor C for the final conclusion is calculated as folIows, using Cout,k
C = max (COllt,! , Collt,2, .. " Cout,k)
Suppose that inspection data are as given in Iable 6.2,
490 Real-time fault monitoring of industrial processes
No
By dividing the data-base and rule-base into several groups, it becomes possible to
reduce the execution time which is proportional to the number of available rules.
Fig. 6.19 shows examples of rules for the inference process.
In-time failure prognosis and fatigue life prediction of structures 491
(damage-degree-2-1
if
(direction-of-cracks 2-directions=CF1)
(width-of-cracks middle=CF2)
(interval-of-cracks small=CF3)
then
(*deposit (damage-degree A (*times 1.0 (*min=CF1 =CF2=CF3)))))
(damage-degree-4-1
if
(fracture large=CF1)
then
(*deposit (damage-degree A(*times 1.0 (*min=CF1 )))))
Figure 6.19 Examples ofrules for the damage degree of reinforced concrete bridge decks.
In practice, the values of the certainty factors involved in the input data and production
rules are given by an expert who has been engaged in maintenance work for more than
20 years. First, matching succeeds in the rule (damage degree 2-1), where 0.9, 0.5 and
0.7 are prescribed for =CFI, =CF2, and =CF3, respectively. The symbol = denotes CF;
is a variable. Second, Cin is calculated as 0.7, using step (1), i.e., C;n=min(=CFI, =CF2,
=CF3). According to step (2), Coutk is obtained as 0.7 from 0.7xl.O. This leads to the
conclusion that damage state is A ~th CF=0.7. Similarly, the rule (damage degree 4-1)
leads to another conclusion that the damage state is A with CF=0.5. From these two
conclusions, the final conclusion is that damage state is A with CF=0.7, using step (3).
In the MYCIN approach (see Chapter 4), the certainty factors are formally defined,
extensively tested and correct resultsldiagnoses have been obtained in many
circumstances.
Evaluation method. Usual damage state evaluation is based only on the information
obtained from visual inspection. If one desires a high accuracy in its evaluation, the
damage degree ought to be classified into several categories, where too many categories
may often induce a contradiction among them derived by each individual and make
classification meaningless. To increase the evaluation accuracy, one can introduce three
damage measures; damage pattern, damage propagation pattern and damage cause. An
appropriate damage pattern is chosen among prescribed basic damage patterns. Similarly,
the most probable damage propagation pattern is determined by using the inference
results of the crack occurrence time, crack pattern, cause of crack, and serviceability of
the concrete deck. Basic damage patterns are determined by considering the following:
• Pattern 1: Severe damage is seen all over the structure.
• Pattern 2: Severe damage is concentrated at the structure edges.
• Pattern 3: Severe damage is concentrated at both ends of a structure component.
492 Real-time fault monitoring ofindustrial processes
• Pattern 4: Severe damage is concentrated at the overhang portions ofthe structure (if
these portions exist).
• Pattern 5: Severe damage is concentrated in the structure's center region.
• Pattern 6: Severe damage is not seen all over the structure.
To demonstrate the usefulness ofthe expert system in FCG real-time assessment, a plate-
girder bridge with four main girders and seven cross beams is employed by Shiraishi et
al. (1991).
A large number of rulesuseful for the damage assessment could be acquired through an
intensive interview with well-experienced engineers on repair and maintenance work.
The use of certainty factors can lead to a reliable conclusion using vague and ambiguous
data and rules. Introducing the three damage measures such as damage pattern, damage
propagation pattern and damage cause, it is possible to give useful information to predict
the change of structural durability in the future.
The damage causes are estimated on the basis of damage degree, damage pattern, and
loss of serviceability and the estimation is important to clarify the occurrence mechanism
of damage as well as useful for establishing an efficient repair and maintenance program
(see Lucia and Volta, 1991).
Recently Vancoille et al. , (1993), have developed a new module that explicitly deals with
corrosion troubleshooting. During the development of this module it was observed that
expert systems are not always suited to carry out part of the tasks involved in corrosion
troubleshooting. Therefore, the possibilities of neural networks were investigated. It was
realized that they have so me potential that might open completely new perspectives in
dealing with problems where expert systems tend to fai!. The combination of expert
systems and neural network techniques gives rise to powerful architectures that can be
used to solve a wide range of problems.
In cases where conventional analytical techniques cannot provide a useful means for the
evaluation of system reliability, techniques based on expert opinions may be used until
such time that either performance data can be obtained and/or mathematical modeling of
system reliability along with adequate field or laboratory data can be used. The expert
opinion technique can also be used in conjunction with an analytical approach in cases
where the performance data are sparse but system failure modes are well known
(Mohammadi et al. , 1991).
Specific examples of engineering systems for which the expert opinion approach can be
used in lieu of acquiring data from conventional sources are given next.
Bridge inspection. In this problem, the evaluation of bridge components, i.e.,
determination of their levels of deterioration and extent of damage is conducted by
experts (bridge inspection personnei). The results of an inspection are then verified,
analyzed and used along with structural analyses to arrive at a specific rating for a given
bridge. The rating is indicative of the level of structural integrity of the bridge.
In-time failure prognosis and fatigue life prediction of structures 493
Interior gas piping systems. Interior gas piping systems operate under low pressure,
1.75 to 14.00 kPa (0.25 to 2.0 psi). Under normal operating conditions, the internal
stresses are low and do not impose any safety problems. However, there are many
factors (such as poor installation practice, component malfunction, loose joints due to
external factors, etc.) that can contribute to system failure resulting in a leak. An expert
opinion approach can effectively be used (i) to identify components' modes offailure; and
(ii) to compile system performance data for reliability evaluation purposes (Mohammadi
et al., 1991, Sandberg et al., 1989).
Human error. The impact of human error on the reliability of an engineering system is
another problem that may be investigated using the expert opinion approach. One typical
example is fabrication errors occurring during construction of a facility. Identification of
factors that may ultimately promote structural failure and evaluation of the likelihood of
occurrence of such factors can be done using the expert opinion approach.
In the above three examples the objective is weil defined, i.e., the objective is to acquire
information on the performance of a system and to determine its reliability. In certain
non-engineering areas, however, the objectives may be unknown or not clear. Thus a
separate expert opinion survey may be used only to arrive at a set of objectives and
attributes to the problem being investigated.
In engineering problems, because the objectives are often weil known, the expert opinion
approach becomes simply a data collection process that can be used for one or more of
the following tasks:
• Identification of failure modes in terms of component or system performance.
• Establishment of statistics or occurrence rates for individual modes of failure.
• Fault-tree and event-tree analyses and identification of the sequence of events
(scenarios) whose occurrence would lead to the formation of a top event (in fault
tree analysis) or aseries of consequences (in event tree analysis).
The general process of the expert opinion method is very much dependent on the type of
problem. As described earlier, in cases where the problem's objectives are weil defined
and the parameters influencing these objectives are also known, the procedure
degenerates to a data collection scheme for ranking or scaling the objectives and their
associated parameters. Many engineering problems fall under this category and represent
cases each with a limited number of weil defined objectives. Each objective may then be
expressed with a performance level and aseries of attributes. In other extreme cases
where uncertainties exist in specific objectives and their attributes, the expert opinion
approach may become very complicated. Generally problems associated with societal or
economics issues fall under tbis category. In such cases the method may have to be
repeated for several rounds before a final decision on the objectives can be made.
The following list presents the basic elements of the method and can weil be expanded
for certain cases.
1. Discuss why the expert opinion approach is employed instead of other methods.
494 Real-time fault monitoring of industrial processes
2. Identify aseries of objectives in the study. If the objectives are not weH defined, a
separate expert opinion approach may be used to arrive at definite objectives.
3. Solicit expert opinions for ranking or scaling these objectives. At this stage the final
refinement of the rankings may be done in more than one round if time and money
permit and especially if a somewhat large discrepancy in the opinions is observed.
4. Summarize the findings in a form that can be used as a mathematical tool for the
system risk analysis or merely as a support document. The findings mayaIso be
evaluated using statistical methods. Of course, prior to these steps, experts must be
identified.
A case study is presented by Mohammadi et al., (1991), to demonstrate the applicability
of the expert opinion approach in system reliability evaluation. In this case study, the risk
associated with leak development in several interior gas piping system is evaluated and
the results are presented. The structure considered in the case study is a simple system
made of components with binary modes of failure. For more complicated structures with
multiple independent and/or dependent modes of failure the reliability formulation and
evaluation of results require additional analyses including the translation of the expert
opinion data into numerical values that can be used in the formulation of the individual
modes of failure. One objective of the case study presented there was to compare an
existing system (black steel piping system) with a new product (corrugated stainless steel
tubing). In the absence of reliable performance data on these systems the expert opinion
approach was employed. As demonstrated in this example, the approach offers an
effective method in the analysis of system reliability of each system and the evaluation
and comparison ~fthe performance ofthe two systems.
To treat the uncertainty and ambiguity involved in the expression in terms of natural
languages, it is useful to introduce the concept offuzzy sets.
Garribba et al., (1988), present a specific application of fuzzy measures relevant to
structural reliability assessment for the treatment of imperfections in ultrasonic inspection
data.
Looking from a general point of view at the problem of combining multiple non-
homogeneous sources of knowledge, whilst the structure of the composition problem can
differ from one case to another, the preservation of a general pattern may be suppossed.
Thus, the investigation and characterization of this pattern can help to highlight the
nature of the dependencies between the different sourees.
Assessment of damaged structures is usually performed by experts through subjective
judgments in which linguistic values are frequentIy used.
The fuzzy set concept is then used to quantify the Iinguistic values of the variables of
damage criteria and to construct the rules. Assessments from the same group of experts
may result in rules with the following cases:
1. Similar antecedents and consequents.
2. Similar antecedents but different consequents.
In-time failure prognosis and fatigue life prediction of structures 495
a fuzzy relation such that R11 =MAJxVSE, R31 = EXPxVSE, R1'1 '=VSBxSEV, and R3'1'
=VEXxSEV, where Rij is the fuzzy relation between ANT i and CONS j; R11 and R1'1'
are contained in the classes of all fuzzy sets of (ARxDL); and R31 and R3'1' are
contained in the classes ofall fuzzy sets of(RCxDL).
The combined relations of R11 and R1'1' can be obtained through the use of the
modified combined fuzzy relation method introduced by Boissonnade wbich is an
extension ofMamdani's approach wbich combined all relations though fuzzy disjuctions.
The method uses modified Newton iterations to reach an optimal solution for the
combined fuzzy relations. Details of these techniques can be found in Chapter 4 and
Hadipriono and Ross (1987). Through the use of tbis method, the combined relations of
R11 and R1'1' yield R111'1'. A similar procedure is performed for R31 and R3'1' to
yield R313' 1 '. The fuzzy composition between R111 ' l' and R31 3' l' results in R1 31 ' 3',
contained in the classes ofall fuzzy sets of(ARxRC). The fuzzy set value for AR and RC,
is the projection of R131 ' 3' on planes AR and RC, respectively. The result now yields
two mies with similar antecedents but different consequents. Hence, similar procedures
can be applied as in cases (2), (3) and (4).
A complete rule may require the participation of the three damage criteria. Therefore, the
mies should also be combined to incorporate the functionality, repairability, and
structural integrity of the damage structure. Zadeh developed the extension principle to
extend the ordinary algebraic operations to fuzzy algebraic operations. One method
based on tbis principle is the DSW technique introduced by Dong, Shah, and Wong (see
Hadipriono and Ross, 1987). The technique uses the lambda-cut representations of fuzzy
sets and performs the extended operations by manipulating the lambda-intervals. For
brevity, further details ofthese techniques can be obtained in the above references.
In order to accommodate the effect of each damage criteria on the total damage, in tbis
study, one can include the weighting factor ofeach criterion. For example, ifthe weights
of the damage level assessed, based on the above three damage criteria, are assumed to
be "high" (HIH), "fairly high" (FHI), and "moderate" (MOD), respectively, and the
values of the damage level are DL 1, DL2, and DL3, respectively, then the overall
combined damage level becomes,
DL _ (HIHxDL1)+(FHlxDL2)+(MODxDL3)
(6.37)
tot - HIH+FHI+MOD
Based on the complete rules, new or intermediate rules can be constructed through
partial matcbing.
Consider the following production rule: "IF deformation (DF) is very severe (VSE),
THEN damage level (DL) is severe (SEV)". When a fact shows that DF is VSE, the
consequent is then realized. However, when the value of DF does not match exactly,
e.g., the fact shows that "DF is SEV", then partial matcbing is in order.
In-time failure prognosis and fatigue life prediction of structures 497
Tbis can be performed by the following fuzzy logic operations: truth junctional
modijication (TFM), inverse truth junctional modijication (ITFM), and modus ponens
deduction (MPD). Briefdescriptions ofthese operations follow:
TFM, first introduced by Zadeh, is a logic operation that can be used to modifY the
membership function of a linguistic value in a certain proposition with a known truth
value. Suppose that damage level (DL) is "negligible" or NNE and is believed to be
"false", or FA. This proposition can be expressed as
P: (DL is NNE) is FA; NNEcDL, FAc T
where DL is a variable (universe of discourse), T is the truth space, and NNE and FA are
the values of DL and T, respectively. The symbol c denotes "a subset of ". Modification
of tbis proposition yields,
P ': (DL is DL 1): DL 1cDL
where DL 1 is a value of DL. A grapbical solution is shown in fig. 6.21 where the fuzzy
set NNE and FA are represented by Baldwin's model, (1980), and plotted in figs 6.21.b
and 6.21.a, respectively.
Note that the axes offig. 6.21.a are rotated 90° counterclockwise from fig. 6.21.b. Since
the elements of FA are equal to the membersbip values of NNE, they are represented by
the same vertical axis in fig. 6.21. Tbis means that for any given element of NNE, one
can obtain the corresponding element of FA. Also, since the membersbip values of FA
and DL 1 are the same, the membersbip values of DL 1 can be found as shown by the
arrowheads and plotted in fig. 6.21.b.
t fez)
1.0
I
I
---t---
ITFM is a logie operation that ean be used to obtain the truth values of a eonditional
proposition. Suppose a proposition, p, is expressed as "damage level is negligible given
damage level is severe"; then the proposition can be rewritten as,
P: (DL is NNE) I (DL is SEV)i NNE, SEVcDL
The ITFM reassesses the truth of(DL is NNE) by modifying this proposition to yield,
P ~. (DL is NNE) is T1 i T1 c T
where T1 is the new truth value for (DL is NNE). The truth value, T1 ean also be
obtained through the graphieal solution shown in fig. 6.22. Suppose NNE and SEV are
again represented by Baldwin's model. The values NNE and SEV are first plotted as
shown in fig. 6.22.b. Sinee the truth level is equal to the membership value of NNE they
lie on the same vertieal axis. Henee, for eaeh membership value of NNE, the
corresponding element of T1 is also known. Then too, sinee the membership value of T1
equals that of SEV, for any given element of both NNE and SEV, one ean find the
eorresponding element and membership value ofT1. The truth value, T1, in fig. 6.22.a is
eonstructed by sueeessively plotting the membership values of SEV (dl, d2, ete.) from
fig. 6.22.b at eaeh truth level. Note that the axes in fig. 6.22.a are rotated 90°
eountercloekwise from fig. 6.22.b.
t fez)
1.0
Modus ponens deduction (MPD) is a fuzzy logie operation whose task is to find the
value of a consequent in a production rule, given the information about the anteeedent. A
simple MPD is: A implies B and given A, then the eonclusion is B. Consider again the
proposition: "if deformation is very severe, then damage level is severe", (IF DF is
In-time failure prognosis and fatigue life prediction of structures 499
1',t fez)
1.0
Brown and Yao, (1983), developed an algorithm to illustrate the effect of qualitative
parameters in existing structures. In their analysis, quality Q; is used to describe the
condition, such as good, fair, poor, etc., of the i-th parameter or structural component.
500 Real-time fault monitoring of industrial processes
and,
T(j,k) = max[min [Qj (j),C j (k)]] (6.39)
j
in which Qj(*) and Cj(*) are the membership or degree ofbelonging at the numerical
rating * of quality Q and consequence C, respectively, for the i-th parameter; T(j,k) is the
(j,k) element of the total effect matrix T; and the symbols u and n represent,
respectively, the relations union and intersection between two fuzzy events. A fuzzy
relation R is then developed relating the consequence to the safety reduction N The
safety reduction N describes the level of resistance reduction, verbally, according to the
type ofthe resulting consequence. For example, a "catastrophic" consequence may lead
to a "very large" safety reduction. The fuzzy relation R can be calculated in the same
manner as the total effect T,
(6.40)
Once the total effect T and the fuzzy relation Rare obtained, a safety measure S can be
computed by combining Twith Rthrough the operation called composition.
S=T-R
and,
S(j,Ji.) = max[min[T(j,k),R(k,Ji.)]] (6.41)
k
in which S(j,t) is the (j,t) element of the safety measure matrix S and R(k,t) is the (k,t)
element ofthe fuzzy relation matrix R Using a fuzzifier, which in this case extracts the
element with the largest numerical value in each colurnn of the safety measure matrix S,
yields a safety junction F. The colurnns of the matrix S represent the levels of reliability
reduction. The safety function F shows the degree of belonging for each level of safety
reduction which corresponds to increase in probability of failure. This function will give
engineers some idea of the possible reductions for the design reliability after an
inspection was done. The engineers may use the results to assist them in deciding on the
priority of their actions or in allocating resources in order to maintain the current usage
of the structure.
In-time faHure prognosis and fatigue life prediction of structures 501
An illustration of the Brown and Yao algorithm for structural damage assessment using
fuzzy relation approach can be found in Chou and Yuan, (1992), where a typical rigidly
connected plane frame was analyzed.
The fuzzy relation approach presented by Brown and Yao, (1983), incorporating
qualitative parameters in assessing existing structures was modified by applying a filter to
the total effect T.
Since the fuzzy relation approach presented by Brown and Yao, failed to differentiate the
importance of various levels of consequences from the total effect T, a filtering process
was presented by Chou and Yuan (1992) which is used to emphasize the more critical
effects over the minor effects. The total effect T can be modified to
Tf(j,k)=m~x{(k/ m)~(j,k)} (6.42)
1
in which T f (j,k) = (j,k) element of the filtered total effect matrix 7f;
(TiV,k)=min[Qli), Clk)] and m is the total number of numerical ratings used to define
consequence C. Note that this filtering equation assumes that the numerical rating is in an
ascending order of seriousness. That is, 0, 1, 2 and 3 are the numerical rating for an
"insignificant" consequence while 15, 16, 17 and 18 are the rating for a "catastrophic"
consequence. Due to the filtration, the membership values for the less serious
consequence will be reduced substantially. This reduction may lead to a low membership
value in the safety function. However, if one is only interested in the relative degree of
belonging in the safety reduction that the existing structure may have, it would be more
appropriate to normalize the safety function with the highest membership being 1.
A filtering process is applied to the total effect Tbecause the focus here is at the overall
consequence of the structure. Six different filtering processes were considered in order to
determine if the results would alter significantly. The filtering processes are shown
graphically in fig. 6.24. In each process, m is the same as defined for the filtering
equation and k is the numerical rating used for consequence C The discontinuous
filtering functions (figs. 6.24.b and 6.24.c) yielded unsatisfactory results. The reason is
that the zero slope region of the filtering function has the same effect as a no filter. The
results from the remaining fiItering processes (figs. 6.24.d, 6.24.f and 6.24.g) were
similar to that obtained from the linear filtering function offig. 6.24.a.
It was suggested that perhaps a modified membership function for consequence Ci would
be fundamentally more sound. A membership function reflects the degree of belonging a
numerical rating for a verbal description. An individual consequence Ci is not intended to
represent the integrated effect of a structure. It only contributes to the overall effect (that
is the function of a total effect 1). Thus, modifying Ci in general is not desirable. In
addition, if the unanimous professional opinion of a bad consequence has a numerical
rating of 12 and a catastrophic consequence has a rating of 18, then these opinions
should not be altered just because a catastrophic consequence is more serious than a bad
consequence.
502 Real-time fault monitoring of industrial processes
ä 1 .g
'';::::
~gp ~
'C .§
~ ~
O~----------------m-- ~O~---------
m
k
k
(k/m)1I2
0+------------------ 0+-------------------
m m
k
k
.~ = 0
'';::::
~ =
<:I
~
.j =
'C
tlJl
re O+-__~-------------
~
~
m 0
m
k k
Based on the representative cases studied, the filtered fuzzy relation algorithm yields
safety fimctions F which are in tune with intuition. The concept will enhance the current
practice of relying heavily on the inspector's experience to analyze the qualitative
information. Although a rigid frame was used to illustrate the algorithm presented, the
application of the concept is not constrained to buildings only. It can apply to any
structural system. The information required for the analysis is the condition (through
inspection) of the parameters within the system, and the consequence and effect on the
overall performance ofthe system associated with the condition of each parameter.
A sensitivity analysis on the shapes, degrees of curvature, shifting, gradient and
maximum membership values was performed. It was found that all, but two of the factors
examined have no effect, or only very insignificant effects, on the safety function F. The
most pronounced effects are those due to changes in the gradient and due to shifting. In
In-time failure pro gnosis and fatigue life prediction of structures 503
the gradient study, the gradient will influence the range of the safety reductions. The
lower gradient value will yield a wider range of the safety reductions with a high
membership value. In the shifting study, the levels of safety reduction having a high
membership value will shift in the same direction as the membership function for
consequence C or safety reduction N
Based on the results of membership function sensitivity analysis, the safety function F, in
general, is not significantly affected by the preciseness of membership functions
developed for every parameter considered. Thus, the algorithm presented by Chou and
Yuan, (1992), has practical applications in assessing aging infrastructure system with
minimal expert input to establish the necessary membership functions.
The methods presented heretofore are among the most important for the representation
of reasoning under uncertainty. These methods, which the artificial intelligence
community refers to as uncertain inference schemes, contain also Bayesian decision and
causal network approaches. Bayesian decision theory may be described as statistical
inference using the Bayesian position, i.e. a personalistic view of prob ability. If
probabilities are thought to describe orderly opinions, Bayes' theorem describes how the
opinions should be updated in the light of new information. Two major problems have
been identified with the implementation ofBayesian inference schemes: (1) combinatorial
explosion for realistic networks and (2) the estimation of prior probabilities. Because it
has been shown that humans display characteristics such as representativeness,
conservatism and the gambler's fallacy when dealing with probability, it has been
assumed that a subjective estimation of prob ability is not meaningful. These results may
be thought of as an impetus for the development of other uncertain inference schemes, in
the sense that alternative methods for modeling the "true" meaning ofwhat humans term
"belief' have been sought. Others have attempted to develop methods for easier
assessment in order to use the Bayesian approach.
The causal network approach is a computationally tractable form of Bayesian decision
theory. A causal network is defined as an acyclical directed graph in which probabilistic
nodes are connected in a causal manner. "Directed" means that the nodes are connected
by arrows and "acyclical" means that the arrows cannot form a circle or cycle in the
graph or network. The connection in a causal manner allows for an easier assessment of
probabilities. In real-world usage, assessment is frequently made in the direction of,
observable ~ unobservable
or,
effect ~ cause
which is generally more difficult to assess than going from,
effect f-- cause
The latter assessment is made in the causal direction.
504 Real-time fault monitoring of industrial processes
A causal network is shown in fig. 6.25. In this figure, the conditional independence
assumption is illustrated.
where p(eklci,d} is the probability of ek conditioned upon ci and ~" If the events
represented by nodes Ci and~. are independent, thenp(c;,~) = P(Ci)P(~).
p(Cj) = LLP(cjI81m ,Q)p(81m )P(q) (6.44)
m n
m n
It can be seen from the above equations that node e is conditionally independent of nodes
al> a2' b 1 and b2 (because there are no arrows connecting these nodes) given that nodes
C and d are updated using nodes al> a2, b 1 and b2. The condition for the graph to be
acyclical, i.e. it cannot contain any cycles, means that the node can never become
conditional upon itself. It is noted that a consequence of the Bayes theorem is that it
holds in both (arrow) directions.
In some instances, anode y may be conditional upon several, say, n nodes, x r , where r
= 1,2, 3, "', n. In order to reduce the assessment of 2n prob ability values to n values, a
technique called the noisy OR gate was developed (Reed, 1993). In the noisy OR gate,
the probability of y conditional on n nodes, xr> r = 1, 2, 3, "', n, is estimated as
In-time faHure prognosis and fatigue life prediction of structures 505
n
P(yl x l,x2,X3'''''Xn ) = 1- I1 (1- p(ylxr » (6.45)
r=1
In tbis equation, P(Ylxl)' P(Ylx2)' P(Ylx3)' "', p(Ylxn ), are assessed and then used to
estimate P(Ylx} , x2, x3' "', xn ),
Dubois and Prade, (1986), discuss specific features of probability and possibility theory
with emphasis on semantical aspects. They fig. out how statistical data and possibility
theory could be matched. As a result, procedures for constructing weak possibilistic
substitutes of probability measures and for processing imprecise statistical data are
outlined. They provide new insights on relationsbip between fuzzy sets and probability
theory.
In this way, fuzzy causal networks can be constructed to improve reasoning with
uncertainty in structural damage assessment. Probability, fuzzy set theory and the
Dempster-Shafer theory have been combined in SPERIL developed by Yao, (1985), and
Brown and Yao, (1983), for evaluating the post-earthquake structural system safety.
The Dempster-Shafer Theory has been developed primarily to model measures of belief
when the probability distribution is incompletely known. This theory enables one to
include the consideration of lack of evidence. Dempster's rule for combining evidence
from different sources is provided in the method. The interval notation used in tbis
method suggests bounded limits on probability values. Although tbis method has been
combined successfully with others, the main criticism of it is that many consider it a
generalization of probability .
It should be obvious from the previous presentation that it is not simple to undertake a
definitive comparative study of the various uncertain inference schemes. First of alt, it is
clear that the modeling approaches are different in the definitions and assessments of
uncertainty. Secondly, the manner in wbich the uncertainty (ies) is (are) combined is
different. Converting from one scheme to the other numerically for alt cases is not trivial,
if at all possible. However, measures for comparison were defined (i.e. clarity,
completeness, hypothetical conditioning, comptementarity, consistency, etc.) and
comparison results of fundamental properties of uncertain inference schemes are
summarized in Reed, (1993).
506 Real-time fault monitoring ofindustrial processes
Research on incidents occurring with conventional pressure vessels has shown that in
90% of cases, the initial defects were Iocated in a weId. For tbis reason, the present
analysis is primarily concemed with defects in welds; under cladding defects have also
been considered in order to evaluate their harmfulness.
Data were collected from 3 European manufacturers: BREDA (Italy), FRAMATOME
(France) and Rotterdam Nuclear (Netherlands). Each manufacturer filled in, for each
weId, a standard form, giving complete information on NOT results (US or X-ray) before
and after repair: instrument calibration, weId size and description and position of tbe
defect in azimuth, in deptb and in relation to tbe axis of symmetry ofthe weId. Alloftbis
information was sent in confidence to tbe Ispra JRC of EEC-Italy wbicb processed and
barmonized tbe data.
A total of338 meters ofPWR and BWR sbell were analyzed. Tbe main conclusions are
as follows (Dufresne et al. , 1986, 1988):
• Density of defects: tbe number of defects per weId varies from 0 to 50 (mean value is
13).
• Position of tbe defects in tbe weId: tbere is no clear distribution of tbe defects
according to tbeir deptb and to tbeir position in relation to tbe axis of symmetry of
tbe weId, but, for a given weid, defects are frequently gatbered in some limited areas
of tbe weId, tbis probably being due to maladjustment of a parameter during tbe
welding process.
• Lengtb of tbe defects: tbe cumulative distribution fimction before repair sbows tbat,
for defects larger than 20mm, tbe log-normal distribution is a good approximation.
Witb regard to tbe widtb distribution of tbe defects, unfortunately no data has been
obtained from manufacturers. After discussion witb experienced welding operators, it
seems that a defect Iarger tban a single pass is very unlikely. Therefore the number and
distribution of defects wider tban one pass bave been calculated by estimating tbe
probability for two or more defects to overlap, botb in azimutbal and transversal section.
Tbis probability is calculated using tbe Monte Carlo method.
Tbe defect lengtb and widtb distributions so obtained correspond to tbe observed defects
in a weId after fabrication and before repair, but tbe distribution to be incorporated in tbe
code must be processed in order to take into account tbe following factors: tbe sampIe
size, tbe accuracy of tbe measurement equipment, tbe reliability of tbe NOT metbods and
equipment, and tbe size of acceptable defects according to the construction rules.
In-time faHure prognosis and fatigue life prediction of structures 507
• Steam break, with different break sizes, with and without electrical power, and
different tim es, to depressurize the primary circuit.
• Hot and cold overpressure during operation or start-up ofthe plant.
Concerning the pressure tests, the rupture probability has been computed for a test
pressure of206 bars and at different temperatures between 40°C and 100°C.
For all the situations thus defined, a thermoelastic analysis is made at different crack
depths. Temperature and the three mains stresses are computed at different steps of the
transient, and the stress intensity factor is computed with plastic zone correction.
Different approaches for rupture criteria have been considered: LEFM, EPFM and plastic
instability. After comparison of these three criteria with 141 representative experimental
data it has been concluded that the criterion proposed by Dowling and Townley gives the
best evaluation of fracture conditions.
Toughness distribution as a function ofthe RTNDT temperature of a 508 Cl 3 steel has
been calculated from experimental results. The shift in transition temperature (ARTNDT)
due to irradiation has been calculated as a function of the neutron flux and of the
phosphorus and copper content, according to the RG l.99 formula. From these data, it is
then possible to compute the toughness distribution at any point of the pressure vessel
and at each step of the transient.
The fracture prob ability is found by searching for the intersection of toughness and stress
intensity coefficient histograms at each transient step. The probability of initiating
unstable crack growth for the considered defect is equal to the maximum value of the
various probabilities thus obtained during the various steps of the transient.
The modeling of the initiation of unstable crack growth during a transient has been
determined according to the following assumptions:
• When a internal defect becomes a through-wall defect, it is transformed into a semi-
elliptical surface crack having the same excentricity and keeping its inner front at the
same position as the initial crack.
• When adefeet becomes unstable at the crack tip of the major axis, the defect
becomes an infinite length crack and the crack arrest criterion is applied at its inner
front.
• Every defect becoming unstable at its inner front is considered as Ieading to vessel
failure; the code does not consider the possibility of subsequent arrest in warmer and
Iess irradiated zones through the wall.
Calculations made with the COVASTOL code from the data compiled on French PWRs,
operated according to EDF, rules allow the following conc1usions to be drawn:
• Rupture probabilities at 40 years vary from 10-8 to 10- 12 according to the weId
Iocation and the safety injection water temperature; final results are presented in fig.
6.26.
In-time faHure prognosis and fatigne life prediction of structures 509
In dealing with the quality control and/or the reliability of marine structures, attention
will be focused mainly on ships and offshore structures, leaving aside other types of
marine structures such as submersible vehicles, subsea installations and pipelines.
The most frequently techniques for the detection of cracks in weid seams of underwater
constructions are the magnetic particle test methods and the ultrasonic technique (see
Section 6.2.2). The advantage of the magnetic particle testing technique is the high
51O Real-time fault monitoring of industrial processes
detection capability in the indication of cracks in the weId seam and the simple evaluation
of the results. Disadvantageous is the fact that the area to be inspected has to be cleaned
accurately. The realization ofthe inspection with a heavy magnetizing yoke is a very hard
work for the divers, for in most cases it is not possible to have a secure support at the
structure. In low water depths the sun light hinders an evaluation of the indication, for
the fluorescence ofthe particles cannot be recognized. The inspection must be postponed
to the evening or night. The most important disadvantage of the magnetic particle testing
technique however is that it is very difficult to integrate into manipulator systems.
..w,~_ .. - .. _. j-- - - -
- 10
2.410
-0
..
210 _ - 2.110
.,
0- -0
0-
Figure 6.26 Probability of rupture per year of a PWR pressure vessel after 40 years of
operation.
The ultrasonic technique can be applied easily in remote controlled manipulators and is
actually often used in such systems. The main disadvantage of the ultrasonic technique
for the detection of fatigue cracks is that the complicated and permanent changing
geometries of the three dimensional structure cause serious problems in the evaluation of
the signals. A further problem is the fact that ultrasonic inspection requires a very exact
In-time failure prognosis and fatigue life prediction of structures 511
sensor movement which can be difficult for the diver, so that a mechanical guide or a
manipulator is necessary. But it can be stated that for.the detection ofthe depth of cracks
with known position the ultrasonic technique can be applied with good success.
The eddy current testing method, which is a traditional technique for the detection of
surface cracks, is applied offshore for a very short time. The reason is that the signals of
weId seam roughness, and the changes of magnetic permeability and electric conductivity
caused by the welding process, superpose the crack signal which reduces the detectability
of cracks. The fact that the eddy current technique is now more interesting for the
inspection of welds, can be explained by the development of sensors and evaluation
techniques. The eddy current method has very low demands of surface cleaning, so that
inspections without removing dirty layers are possible. The signal evaluation is
independent from ambient conditions and the eddy current method can be applied easily
in remote controlled manipulators (Camerini et al., 1992).
Ship and offshore steel structures are designed to withstand the cyclic stresses created by
sea waves. Ship welding is a critical technology. It is difficult to manufacture welds in
which the defects are few and far between. Some smaller defects will always be present
in the welds and there is a risk of having large defects. The welds frequently determine
the strength and usability of a welded product, more than any other single factor.
Therefore, the presence of severe stress in as-fabricated structures may cause fatigue
cracking during service. If a corrosion or a fatigue crack is detected in the steel or the
welds of the ship structure during in-service inspection, it is necessary to assess its
significance with respect to the safety ofthe facility. The key to the assessment offatigue
cracks in marine structures is a validated analytical model that predicts the growth rate of
fatigue cracks and the corresponding probability of failure for the anticipated service
conditions (Stavrakakis, (1990), (1993)).
Fatigue crack growth behavior in metals is best described using the fracture mechanics
approach, in which fatigue crack growth rates, daldN, are correlated to the applied stress
intensity factor range Al( (see Section 6.3.3).
Cyclic stresses caused by sea-wave loading are random in nature, and fatigue analysis of
structures subjected to random loading is complicated. An equivalent-stress-range
scheme will be presented here, used to correlate fatigue crack growth results for
structures subjected to random loadings with those of structures subjected to constant-
amplitude loading.
Ifthe crack growth per cycle is much less than the crack length (daldN«a), and there
are no load-sequence interaction effects, then the total increment of crack growth in N
successive cycles, for a specific Al( range is,
N N
~>iai =C[(na)1/2 F]m~>'i
i=l i=l
512 Real-time fault monitoring of industrial processes
This result is derived by summing the crack-Iength increment per stress cycle calculated
using the Paris-Erdogan equation (see Section 6.3.3), for the N successive cycles, when
the applied stress-range is (Jj, i= 1, 2, "', N for the 1st, 2nd, .. .Nth cycle.
The average FCGR per stress cycle is then,
or
where,
li/rn
aeq =[ ~ La; ,
N
j= 1,2, ... ,N
N j=1
~ 100 r---------------,
~
...,~
...g
80
60
~
I 40~
\0' 10' \0' \0' \0' 10' 10' 10' \0' \0'
12o~~
~~
o 10 IS 20 25 J
From the histogram offig. 6.27.a, (Jeq is evaluated from the equation,
l/m
(Jeq =[ Lrpf ]
where (J is the stress range, r is the frequency of its occurrence, and m is the exponent in
the Paris-Erdogan FCG rate equation.
The other way to describe the random load history, is to use the PSDF, as shown in fig.
6.28. The PSDF is a result obtained from a spectral analysis ofthe original random load
history. If the random load history is a stationary Gaussian process, as is commonly
assumed, then a PSDF exists, G(w), which possesses all the statistical properties of the
original load history (Cheng, 1988).
The two most important parameters in the random-loading fatigue analysis that can be
retrieved from the PSDF, are the root-mean-square (rms) value ofthe load amplitude and
the irregularity factor, a, ofthe random load history. The rms value is the square root of
the area under the PSDF.
The irregularity factor is defined as the ratio of the number of positive-slope zero
crossings, No, to the number of peaks, Fo per unit time in a load history:
a=NrlFo
The exact value of No and Fo can be evaluated from G(w) as folIows:
No = (M2 / Mo )L'2
and,
Fo = (M4 / Mi )112
where Mo, M 2 and M 4 are the zeroth, second and fourth moments of G(w) about the
origin (zero frequency) and are defined as,
514 Real-time fault monitoring of industrial processes
Mo r
= G(w)dw
M2 = J: w 2G(w)dw
M4 =J: w4G(w)dw
Thus,
'8'
'Wo' 1.0 -
C-'
.-=
.0"
f.Il
(l)
0.8 -
~ 0.6 -
";i
~ 0.4 -
8-
f.Il
0.2 -
~
~
0.4
0 0.1 0.2 0.3
Frequency (Hz)
Figure 6.28 Example ofpower spectral density function (double peaked spectra).
The free surface elevation of the sea can be modeled by an ergodic Gaussian process for
adequately short periods of time. This short-tenn description implies that the process is
homogeneous in time and in space, that is, its probabilistic properties do not change with
time nor with location. Thus, it is equivalent to estimate those properties from several
sea surface elevation records made at different times in the same point or made at the
same time in different points.
Each sea state is completely described by a wave spectrum. These spectra result from
physical processes and are therefore amenable to theoretical modeling. Various proposed
mathematical expressions to represent average sea spectra have appeared in the past. The
one that has become generally accepted and that has been cornrnonly used in response
analysis is due to Pierson and Moskowitz, although it is most cornrnonly seen in the
parametric fonn proposed in the International Ship Structures Congress (lSSC), see
Guedes-Soares (1984) and Hogben et al. (1976). A sea state is often characterized by an
In-time failure prognosis and fatigue life prediction of structures 515
average wave period Tz and a significant wave height Hs , which are related to the
spectral moments by:
T =_1 1110
z 2n ml
Having a sea state defined by the four parameters Hs ' Tz, HR and TR , where HR and TR
are the ratios of significant wave height and average period of the two spectral
components, each spectral component Sk' k=s,w can be given by the equation:
Sk(f) = SPM(f)r q, (m2 . sec)
where SpM is the ISSC spectrum:
SC!) = O.l1H 2T (Tj)-5 exp{ -0.44(Tj)-4}, (m2 . sec)
and r is the peak enhancement factor of the JONSWAP spectrum,
q = exp{ - (1.2961J - 1)2/2u2)}
f= OJ/2n
T= Tz ·F2
H=HsI.Jif
The JONSWAP parameter uis used at its mean values:
ua = 0.07 forf~ 1/1.296T
ub= 0.09 forf >1/1.296T
The quantities Fr and F 2 are two constants that correct for the difference in peak period
and area between a Pierson-Moskowitz (P-M) and a JONSWAP spectrum. The values
ofthese parameters depend on ras shown in Table 6.3.
The two additional parameters that define the double peaked spectrum are HR and TR.
The last one is easily determined from a measured spectrum, as the ratio of the spectral
peak frequencies. Other easily obtainable quantity is the ratio between the two spectral
ordinates SR' To relate this ratio with H R , it is necessary to obtain the expression of the
spectral ordinate at the peak frequency.
The peak frequency is determined by equating to zero the derivative of Sk with respect to
f If this is done, it follows that the ratio of the two spectral peaks Ssp and Swp is given
by:
correct for that effect, so as to make the procedure applicable regardless of the distance
between spectral peaks.
Having a double peaked spectrum defined by its 4 parameters, one determines first the
ordinates of the theoretical spectrum at the two peak frequencies, and their estimated
ratio SR- This value is larger or equal to the value ofthe spectral parameter SR- The value
of HR to be used in the above equations is thus corrected by the factor k R = SR / SR,
which upon substitution in the last equation results in:
=(k~~R J
O.5
HR
If the sea spectrum describes a stationary Gaussian process, the assumption that the
transfer function is linear implies that the response is also a stationary Gaussian process,
thus completely described by the response spectrum.
These theoretical formulations describe average spectral shapes expected to occur in the
presence of a given wind or in a sea state of known characteristic wave height and
period. There is however considerable uncertainty in the shape of individual spectrum
due to the large variability of the generation process and of the estimation methods.
The sources of uncertainty in the spectral shape definition are discussed by Guedes-
Soares, (1984), and a method is proposed there to model them and to determine their
effect on the uncertainty of the response parameters. This treatment accounts for both
fundamental and statistical uncertainties in the spectral shape. The results are given in
terms of the response quantities predicted by the standard method of calculation of ship
responses. They indicate the bias and the coefficient of variation of the standard
predictions, being thus a representation of the model uncertainty of that response
calculation method.
The main feature ofthe standard response method is, in this context, the use ofthe ISSC
spectrum to represent all sea states. Thus the results can also be interpreted as the model
uncertainty ofthe ISSC spectrum in representing all types ofsea spectra.
In addition to the uncertainties related with the shape of the spectrum, ship responses are
also subjected to other uncertainty sources such as relative course and speed of the ship.
Thus it is often more meaningful to operate with expected values of responses than with
responses to specific sea states. Different possible formulations of rnean response and
rnean response and rnean bias are examined there.
If the load history under consideration is in the form of a narrow-band random process
(irregularity factor a= 1), the value of the equivalent stress range (1eq can be calculated
frorn the following c1osed-form expression (Cheng, 1988):
2( )[F(1+mI2)]lIm
(1eq = rms 2(1 12i+m / 2
where I(... ) is the gamma function and m is the exponent in the Paris-Erdogan FCG rate
equation. There are no c1osed-form solutions for wide-band (a<O.99) randorn loads.
For wide-band randorn loads, solutions derived frorn nurnerical analysis to convert PSDF
to (1eq are presented in graphical form in Cheng (1988).
FCG rates under sea-wave loading can thus be calculated using the probabilistic fracture
rnechanics approach described in Section 6.3.3 and the equivalent-stress-range concept
described above.
The procedures start with an anticipated (or assurned) stress spectrurn acting on a
component of interest. The equivalent-stress-range for the sea-wave loading is then
calculated.
In-time faHure prognosis and fatigue life prediction of stmctures 519
The COVASTOL computer program, presented in sections 6.3.3,6.3.4 and 6.4.1, or any
sirnilar FCG prediction program, can be applied in a straightforward manner to predict at
any time the residuallife-time and to estimate using inspection data the failure probability
of marine structures under sea-wave action. The main objective of applying this program
is to provide information about the availability of marine structures, which is important
for their efficient design, for defining optimal after-service inspection and maintenance
policies and for estimating the loss of marine structures financial risk.
In this section, a simple causal network is built for post-earthquake structural damage
assessment. "Uncertainty" in this context includes the description and prediction of
loading conditions, material properties of structural components, the difference between
the simplistic mathematical modeling of the structure and the actual behavior and loading
path, and the imprecision involved in construction. Evaluation of structural damage by an
expert involves all of these "uncertainties" in some implicit fashion which is assumed to
be appropriately characterized as a degree of belief in the severity of damage from
observations. For example, assigning a degree ofbeliefto an event such as "globalloss of
strength" as "moderate" is assumed to be an accurate characterization of damage
assessment.
The causal form of the network corresponds to the physical "pathway" of damage, i.e.
the two causes of structural failure are inadequate stiffness and insufficient strength.
Creating a causal network in which the component damage "caused" global damage
which "caused" a level of structure failure extent corresponds to the physical model of
damage which is familiar to structural engineers. It would also allow engineers to identify
in a concise manner how the damage would be evaluated at each stage.
It is cognitively easier to make (subjective) probability assessments in the causal
direction. Ifnecessary, the noisy OR gate (eq. (6.45» provides a method for generating
conditional probabilities.
The structural damage assessment network will be evaluated using fig. 6.25. Each node
or event can take on the value ofnone, slight, moderate and severe, i.e. the indices in the
equations (6.44), will all be defined on the range from unity to four. In fig. 6.25, the
events that the nodes represent, are as folIows:
520 Real-time fault monitoring of industrial processes
Node Meaning
Structure failure extent
Globalloss of strength
Globa110ss of stiffness
Component loss of strength
Component loss of stiffness
Global damage to strength
Global damage to stiffness
The network is interpreted as follows. The probability of "structure failure extent" taking
on the values of "none", "slight", "moderate" or "severe" is denoted as p(ek), where k=l,
2, 3, 4, represents the four values. The event "structure failure extent" is "caused" by
globalloss of strength and stiffness, nodes c and d, respectively. The events these nodes
represent are "caused" by "component loss of strength" and "component loss of
stiffness ", respectively, and "global structural changes" such as a change in the natural
frequency. It could be argued that component loss of stiffness is influenced by
component loss of strength; however, this influence is assumed to be small enough that
the two should be treated as different, distinct causes of failure. Preliminary conditional
probability values can be assessed on the basis of experience and limited consultation
with colleagues; additional values are estimated on the basis of the noisy OR gate. It is
necessary to normalize the probabilities generated in this manner.
This aspect ofthe evaluation process is the most complex, and the most uncertain. Given
the input or marginal probabilities for these nodes, the probability p (structure failure
extent) may be calculated using eqns. (6.43)+(6.45) of Section 6.3.6. First, marginal
prob ability values are assessed for nodes al> a2' b l and b2 . These values are input into
eqns. (6.44). Multiplication and summation of the marginal probabilities yield p(ci ) and
p(d), where i, )=1, ... , 4. These values are input into eqn. (6.43) for p(ek), where
multiplication and summation yieldp(ek)' k=1, ... ,4.
For purposes of illustration, numerical examples are given in Table 6.4. It can be seen
that the degrees of belief in the structure failure extent, seem reasonable for the given
inputs. For example, extreme inputs yield extreme results. For mixed inputs on the
component level, the structure failure extent appears to have the greatest degree of belief
associated with moderate damage, as would be expected. Given this information, a
structural engineer would be able to decide whether structural rehabilitation was
required. Structures for which the degree of belief in the structure failure extent being
moderate or severe was high, would be required to und ergo rehabilitation.
A more realistic causal network would inc1ude damage observations for each component
and individual types of global structural changes. The total component loss of strength
and stiffness would therefore be conditional upon the type and severity of damage
observed. Another extension of the present efforts should be to evaluate diagrams which
In-time failure prognosis and fatigue life prediction of structures 521
have horizontal arrows connecting nodes at each level. Any set of influences is permitted
in a causal network as long as the network is acyclic.
[0 [0 [0 [0 [0 [0
0.33 0 0.33 0.29 0.14 0.19
0.34 0.5 0.34 0.41 0.48 0.5
0.33] 0.5] 0.33] 0.3] 0.38] 0.32]
References
Akyurek T. and O.G. Bilir (1992). A survey of fatigue crack growth life estimation
methodologies. Engineering Fracture Mechanics, 42, 5, p. 797.
Al-Obaid Y.F. (1992). Fracture toughness parameter in a pipeling. Engineering
Fracture Mechanies, 43, 3, p. 461.
Ben-Amoz M. (1992). Prediction offatigue crack initiation life from cumulative damage
tests. Engineering Fracture Mechanics, 41, 2, p. 247.
Bhargava V. et al. (1986). Analysis of cyclic crack growth in high strength roller
bearings. Theoretical and Applied Fracture Mechanics, 5, p. 31.
Bladwin lF., and B.W. Pilsworth (1980). Axiomatic approach to implication for
approximate reasoning with fuzzy logic. Fuzzy Sets and Systems, 3, p. 193.
Bogdanoff lL. and F. Kozin (1985). Probabilistic models of cumulative Damage. John
Wileyand Sons, N.Y.
Bogdanoff lL. and F. Kozin (1984). Probabilistic models of fatigue crack growth.
Engineering Fracture Mechanics, 20, 2, p. 225.
Box G.E.P. and G.M. Jenkins (1976). Time Series Analysis, Forecasting and Control.
Holden-Day, San Francisco, CA.
Brown C.B. and lT.P. Yao (1983). Fuzzy Sets and Structural Engineering. Journal 0/
Structural Engineering, ASCE, 109, 5, p. 211.
Camerini et al. (1992). Application of automated eddy eurrent techniques for off-shore
inspection. In C. Hallai-P. Kulcsar (Eds.), "Non-Destructive testing '92", Elsevier
Science Publishers.
Cheng Y.W. (1985). The fatigue crack growth of a ship steel in saltwater under
spectrum loading. International Journal 0/ Fatigue, 7, 2, p. 95.
Cheng Y.W. (1988). Fatigue crack growth analysis under sea-wave loading.
International Journal ofFatigue, 10,2, p. 101.
Chou K. C. and l Yuan (1992). Safety assessment of existing structures using a filtered
fuzzy relation. Structural Sa/ety, 11, p. 173.
Chow C. L. and K.H. Wong (1987). A comparative study of crack propagation models
for PMMA and PVC. Theoretical and Applied Fracture Mechanies, 8, p. 101.
Cortie M. B. and G.G. Garrett (1988). On the correlation between the C and m in the
Paris equation for fatigue crack propagation. Engineering Fracture Mechanocs, 30, 1,
p.49.
524 Real-time fauIt monitoring of industrial processes
D'Attelis C. et al. (1992). A bank of Kaiman filters for failure detection using acoustic
emission signals. In C. Hallai - P. Kulcsar (Eds.), "Non-destructive testing '92", Elsevier
Science Publishers.
Dubois D. and H. Prade (1986). Fuzzy sets and statistical data. European Journal 0/
Operational Research, 25, p. 345.
Dufresne J., Lucia A, Grandemange J. and A Pellissier-Tanon (1986). The
COVASTOL program. Nuclear Engineering and Design, 86, p. 139.
Dufresne J., Lucia A, Grandemange J. and A Pellissier-Tanon (1988). Probabilistic
vessels-study ofthe failure ofpressurized water reactor (PWR) vessels. Report EUR No
8682, JRC-Ispra (Italy), Commission ofthe European Communities.
Fukuda T. and T. Mitsuoka (1986). Pipeline inspection and maintenance by applications
of computer data processing and Robotic technology. Computers in Industry, 7, p. 5.
Garribba S. et al. (1988). Fuzzy measures of uncertainty for evaluating non-destructive
crack inspection. Structural Sa/ety, 5, p. 187.
Georgel B. and R. Zorgati (1992). EXTRACSION: a system for automatie Eddy
Current diagnosis of steam generator tubes in nuclear power plant. In C. HaIlai - P.
Kulcsar (Eds.), "Non-Destructive testing 92", Elsevier Science Publishing.
Ghonem H. and S. Dore (1987). Experimental study of the constant - probability crack
growth curves under constant amplitude loading. Engineering Fracture Mechanics, 27,
1, p. 1.
Fukuda T. and T. Mitsuoka (1986). Pipeline inspection and maintenance by applications
of computer data processing and Robotic technology. Computers in Industry, 7, p. 5.
Godfrey M.W., Mahcwood LA and D.C. Emmony (1986). An improved design for
point contact transducer. NDT International, 19,2.
Grangeat P. et al. (1992). X-Ray 3D cone beam tomography application to the control
ofceramic parts. In C. Hallai - P. Kulcsar (Eds.), "Non-Destructive testing '92", Elsevier
Science Publishers.
Guedes-Soares C. (1984). Probabilistic models for load effects in ship structures.
Report UR-84-38, Marine Technology Dept., The Norwegian Institute of Technology,
Trondheim, Norway.
Hadipriono F. and T. Ross (1987). Towards a rule-based expert system for damage
assessment of protective structures. Proceedings 0/ International Fuzzy Systems
Association (IFSA) Congress, Tokyo, Japan, July 20-25.
Hagemaier D.J., Wendeibo AH. and Y. Bar-Cohen (1985). Aircraft Corrosion and
detection methods. Materials Evaluation, 43, p. 426.
Halford et al. (1989). Fatigue life prediction modeling for turbine hot section materials.
ASME Journal 0/ Engineering/or Gas Turbines and Power, 11, 1, p. 279.
In-time failure prognosis and fatigue life prediction of structures 525
Stavrakakis G.S., Lucia AC. and G. Solomos (1990). A comparative study of the
probabilistic fracture mechanics and the stochastic markovian process approaches for
structural reliability assessment. International Journal Pres. Ves. Piping, 41, p. 25.
Stavrakakis G.S. and A Pouliezos (1991). Fatigue life prediction using a new moving
window regression method. Mechanical Systems and Signal Processing, 5,4, p. 327.
Stavrakakis G.S. and S.M. Psomas (1993). NDT data interpretation using Neural
Networks. In "Knowledge based system applications in power plant and structural
engineering", SMiRT 12 post-conference Seminar no. 13, August 23-25, Konstanz,
Germany.
Thoft-Christensen P. and 1.D. Sorensen (1987). Optimal strategy for inspection and
repair of structural systems. Civil Engineering Systems, 4, p. 17.
Van Dijk G.M. and 1. Boogaard (1992). NDT reliability - a way to go. In C. Hallai and
P. Kulcsar (Eds.), "Non-Destructive testing '92", Elsevier Science Publishers.
Vancoille MJ.S., Smets HM.G. and F.L. Bogaerts (1993). Intelligent corrosion
management systems. In "Knowledge based system applications in power plant and
structural engineering" SMiRT 12 post-conference Seminar no. 13, August 23-25,
Konstantz, Germany.
Verreman Y et al. (1987). Fatigue life prediction of welded joints - a reassessment.
Fatigue Fracture Engineering and Materials Structure, 10, 1, p. 17.
Virkler D.A, Hillberry B.M. and P.K. Goel (1979). The statistical nature of fatigue
crack propagation. ASME Journal Enginnering Materials Technology, 101, p. 148.
Yanagi C. (1983). Robotics in material inspection. The NDT Journal 0/ Japan, 1, No 3,
p.162.
Yao I.T.R. (1985). Safety and Reliability of Existing Structures. Pitman Publishing,
Marshfield.
Zhu W.Q. and YK Lin (1992). On fatigue crack growth under random loading.
Engineering Fracture Mechanics, 43, 1, p.l.
AuthoT index
Boden 344
A Bogdanoff464, 479, 480
Adamopoulos 368 Bonivento 101
Adams 101 Boogaard 434
Adelman261 Boose 265, 269
Ahlqvist 333 Bothe404
Akyurek 468, 469 Box 28, 484
Al-Obaid 469 Bradshaw 265
Ali 18, 19,20 Brailsford 317
Aljundi 416 Brown 499,501,505
Alty 269 Brole 271
Anderson 9, 16, 17, 19, 104, 105, 111, 273 Buchanan 261,263
An~aklis 138,285,371,408
Armstrong 171
C
Arreguy 338 Camerini 511
Ast 66 Cao 393
Athans 102, 113 Carlsson 211, 229
Carpenter 385, 389
B Carriero 280, 281
Baines 43 Cecchin 22
Ballard 375, 378 Chan 371
Bandekar 274 Chang 401, 414, 417
Baram 116 Chen 100, 139, 140, 148,330,331,401
Barschdorff371,404 Cheng 512, 513, 518
Bartlett 416 Cheon 417
Baskiotis 229 Chien 101
Basseville 3, 103, 118, 119, 120 Chin 57
Bavarian 401 Chitturi 18, 19
Beattie 163 ChoI67,247,382,393
Ben-Amoz 467 Chou 501, 503
Bennett 7,9 Chow93, 101, 129, 149,371,404,477
Benveniste 118, 120 Clark 125
Bems 371 Coats 167
Bhargava 465 Cohen 261, 262, 383
Bickel105 Console 275
Bierman 199 Contini 271,273
Bilir 468, 469 Cordero 208
Blazek 29, 37, 40 Cortie 470,471
Blount 271 Cue 50, 54, 64
Blumen 17 Cybenko 382
530 Real time fault monitoring of industrial processes
D Froechte 167
Fuchs 246
Daley 135
Dalla Molle 224, 225 G
Danai 57
Gaines 269
Darenberg 157
Gantmacher 142, 190
Davis 271, 401
Garrett 470, 471
De Kleer 277, 302, 356
Geiger 221,234,235
De Mello401
Gelernter 280, 281
Deckert 101, 120, 129
Gertler 93,95, 168,273
Dehoff229
Ghonem 468,470,481
DeLaat 161
Gien 401
Dernpster 505
Godfrey452
Desai 101
Goodwin 211,213,214
Dialynas 237, 406
Grangeat 450, 452, 457
Dixon 8
Gray 267,268
Dobner 167
Greene 113
Doel 322, 332
Grizzle 167
Dolins 288, 291
Grogono283
Dong496
Grossberg 383, 385, 389
Dore 468,470,481
Grober 261,262
Dorernus 418
Guedes-Soares 514, 515, 518
Dounias 271
Guo415,417
Dowdle 120
Gupta 277, 279, 282
Dubois 505
Gustaffson 120
Dufresne 434,468,473,475,506
H
E
Hadipriono 496
Edelmayer 301
Hagernaier 439
Elkasabgy 75
Halford465
Engell144
Hamilton 333
Eryurek412
Hammer 272, 329
F HasseImann 515
Hawkins 29
Favier 200
Hedrick 167
Feldman 375, 378
Henry 229
Feng 371
Hickman 260, 270
Fink 273, 275, 335
Himmelblau 26, 40, 100, 224, 370, 371,
Finke 401
396
Forbus 274
Hoeppner 468,469,471,507
Forsythe 261,263
Hoerl28
Fortescue 206
Hoey401
Frank 100, 124, 126, 127, 141, 148, 149,
Hoff 379
170,171,273
Hogben 514,515
Franklin 7, 9
Hopfield 373, 374, 378, 383, 384, 385
Freiling 275
Hoskins 371,396
Freyermuth 246, 273, 338
Author index 531
Hudak465 Kramer 28
Hudlieka 273 Krupp 468, 469, 471, 507
HuH 432, 442 Kuan 320
Hunter 29, 30 Kumamaru 102,223
Kusiak 402
I
Kwon 211,213,214
Ikeuem 56, 59
L
Ikonomopoulos 414,419
Ioqnnou 383 Lainiotis 102, 113
Irwin 204 Landez449
Iserman 181, 182, 183,246,247,273 Lankford 465
Ishiko 330, 331 Laws54
Lee 339, 343, 355, 356, 357, 415
J Lehr 378, 379
Janik 246 Lesser 273
Janssen 127 Li 66,70
Jenkin 111 Ligget 18
Jenkins 484 Lin470, 475
Johannsen 269 Ljund 36
Johnson265 Ljung 203,216
Jones 102, 116 Lo401
Journet 470 Lou 129, 146
Jovanovie 488 Loukis 71, 75
Lueas 29, 31
K
Lueia 434, 463, 464, 468, 469, 470, 479,
Kaiser 260 480,492,521
Kalouptsidis 203, 204 Ludwing457
Kalyanasundaram 459 Luger 260
Kaminski 249 Lusth 273, 275, 335
Kangethe 100 Lyon 43, 45, 53
Karakoulas 273, 297
Karkanias 138 M
Kasper 159 MaeDonald 268
Kawagoe 107 MaeNeilll7,18
Kendall 3, 4, 6, 9, 10, 12, 17, 19 Madsen 467
Kim 382, 393, 415, 417, 418 Maguire 204
Klein 264 Majstorovie 271
Kohonen 371, 374, 378, 387, 390, 392, Marei 469
393,401,402,403,415 Marsh 275
Komatsu 458 Maruyama 297,300,301,364
Konik 144 Massoumnia 100, 129
Konstantopoulos 371, 409 Matsumoto 405
Kosko 385 Mayne 208
Kosmatopoulos 383, 385 MeClelland 378, 380, 379
Kouvaritakis 138 Mehra 12, 101, 103
Kozin 464,479,480 Merrill 161,229
532 Real time fault monitoring of industrial processes
Merrington 229 Patton 100, 136, 139, 140, 145, 148, 153,
Miguel371 154
Milne 271 Pelloux 470
Minsky 374, 379 Peng 273
Mirchandani 393 Pengelly 66
Mironovskii 10 1 Peschon 101, 103
Mitchell 48, 50, 55, 62, 63 Pignatiello 29,35
Mitsuoka 452 Polycarpou 383
Mohammadi 492, 493, 494 Pomeroy285
Monostori 338 Pot 200
Moon401 Potterl0l, 135, 197,202
Moore 140 Pouliezos 12, 102, 106, 111, 112, 116, 121,
Morpurgo 271 122, 182, 195, 218, 221, 273, 339,
Moskwa 167 485,486,487
Moussas 482, 483, 487 Prade 505
Müller 102, 116 Prasad271
Mussi 271 Prock 296
Protopapas 334
N
Psomas 462
Naidu 371, 411
Namioka453
R
Narayanan 270 Raj 459
Nawab 261 Randa1l45, 50, 55, 59, 63, 64, 66, 71
Nett 408, 409, 411 Randles 16, 17
Neumann 246,337,338 Rasmussen 264
Nielsen445 Rauch 371
Nikiforov 103 Ray 101
Nisitani 470, 474 Reed 401,504,505,521
Nold 183 Reese 288, 291
Noore 325, 327 Reggia273
Novak334 Reiss 246
o Rhodes204,273,297
Rizzoni 167
Obreja302 Robert 18
Ogi 406 Roberti 457
Ohga 417 Roh 418
Ono 102 Ross 496
p Roth264
Rouse 261
Palm 28 Rummelhart 378, 379, 380
Pandelidis 277,356
Pao 370,382,390,415 S
Papert 374, 379 Saccucci 29, 31
Pappis 368 Sahraoui 335, 338
Parpaglione 458 Sakaguchi 405
Passino 285, 350, 371 Sandberg 493
Author index 533
Yashchin 37
Ydstie 207
Yeh205
Yoon 272, 329
Yoshimura 101
Young 182
Yuan 501, 503
Z
Zeilingold 401
Zhu 470, 475
Subject index
testing 101
A
cluster 373, 385, 391, 392, 394,415
accumulated cycles 471, 485
diagrams 390
activation function 374, 376, 388, 389
COBRA416
activity 374, 381
competition 388
adaptive resonance theory 374, 385, 393,
condensed nearest neighbor 405
401
connectionist expert system 418
algorithm
content addresable memory model 373
modified Gram Schmidt 249
continuous spectrum 438
square root 197,204
control chart
U-D factorisation 198
CUSUM 29, 34, 36
analytical redundancy 99
exponentially weighted moving average
ARMAX model 187
29
ARX model 188
multivariate 33
ASCOS scheme 127
multivariate Shewhart 34
associative memory model 373
univariate Shewhart 26
attentional phase 387
correct detection 97
autocorrelation matrix 110
correlation coefficient 111
autoregressive model with exogenous signals
cost function 192
188
covanance
autoregressive moving average model with
matrix 197
exogenous signals 187
instability 196
autospectrum 60, 61
singularity 200
B cross spectrum 54, 60, 61
back-propagation 373, 380, 383, 393, 401, crosspower spectrum 459
405,406,410,411,413,417,418,419 CTLS 193
backward likelihood ratio function 107 curve analysis fauIt diagnosis 287
Bayes rule 114 cyc1e-counting 512
bearings failure diagnosis 48, 50, 66, 68, cyc1es to failure 466, 467
322 cyc1es to rupture 465
bilevel function 376 cyc1ic load 465
black box identification 215
D
bubble 391
data weights 187, 190,205,226
C decision function 100, 158, 159, 160
causal network 286, 519 deconvolution 148
chi -squared decoupling
distribution 105, 109 approximate 145, 147
non-central 119 dedicated observer scheme 124
random variable 119 departure from nuclear boiling ratio 415
536 Real time fault monitoring of industrial processes